Open retrievers released
- LightOnIO released two open retriever models, DenseOn and LateOn, under an Apache 2.0 license. - The 149M‑parameter models reportedly top BEIR benchmarks for retrieval tasks and ship with an open data pipeline. - Open, high‑performing retrievers reduce friction for RAG systems and make production retrieval easier to integrate (x.com).
Search systems for chatbots work like a library clerk: they turn a question into vectors, then fetch the nearest documents before the model writes an answer. LightOn released two of those retrievers, DenseOn and LateOn, on April 21 under the Apache 2.0 license. (huggingface.co) DenseOn is the simpler design: one vector for a query, one vector for a document, then a similarity score. LateOn is a ColBERT-style model that keeps multiple vectors per document, which usually costs more compute but can match passages more precisely. (huggingface.co 1) (huggingface.co 2) Both models use the ModernBERT backbone and have 149 million parameters. LightOn says that size is meant for “latency-sensitive production systems,” where teams need retrieval quality without running very large models. (huggingface.co 1) (huggingface.co 2) The benchmark in this release is BEIR, a standard retrieval test bed with more than 15 datasets and a common evaluation framework for information retrieval models. LightOn reports LateOn at 57.22 average nDCG@10 on 14 BEIR datasets and DenseOn at 56.75, with both models also posting higher scores on a decontaminated 12-dataset split. (github.com) (huggingface.co 1) (huggingface.co 2) That decontaminated split is meant to answer a basic question in retrieval: did the model learn to generalize, or did it see overlapping examples during training. LightOn said it stripped training-overlap samples from BEIR corpora and found LateOn rose to 60.36 nDCG@10 and DenseOn to 57.71. (huggingface.co) (huggingface.co) (huggingface.co) The release is not just model weights. LightOn also published an open training pipeline, including a pre-training dataset with 1.4 billion query-document pairs and a fine-tuning dataset with 1.88 million samples and mined negatives. (huggingface.co) (huggingface.co) (huggingface.co) (huggingface.co) That matters for companies building retrieval-augmented generation, or RAG, systems that answer questions from internal files instead of only from model memory. In those systems, the retriever often decides whether the final answer sees the right contract, ticket, or policy at all. (lighton.ai) (huggingface.co) LightOn framed the launch as a response to a market where many top retrievers are either API-only or trained on undisclosed data. The company said open data and model releases let others test for leakage, rebuild the pipeline, and swap in their own filters instead of treating retrieval quality as a black box. (huggingface.co) (huggingface.co) The near-term test is whether developers adopt the full stack rather than just the leaderboard numbers. LightOn has already packaged the models on Hugging Face and tied its late-interaction tooling to PyLate and FastPLAID, which lowers the work needed to put an open retriever into production. (huggingface.co) (huggingface.co)