Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Latest Collaboration Models for Fast Multilingual Search Across 11 Languages

This week, Liquid AI released two new recovery models. That’s right LFM2.5-ColBERT-350M again LFM2.5-Embedding-350M. Both hold 350M parameters. Both are the first two members of the LFM family. They build on LFM2.5-350M-Basewhich was released in March. This pairing is aimed at fast multilingual and cross-linguistic searches in all 11 languages. Their footprint is small enough that you can run almost anywhere. Both are available now from Hugging Face under the LFM Open License v1.0.
LFM2.5 Retriever
Both models have the same core but represent text differently. LFM2.5-Embedding-350M dense bi-encoder. Converts each document into a single vector. Choose it when you want the fastest search and the smallest, cheapest index.
LFM2.5-ColBERT-350M it is a late model interaction. Converts each token to a vector rather than one vector per document. This allows it to match the questions word for word for higher accuracy and better generalization. The trade-off is a big indicator. Choose it when accuracy is more important than storage. Its query length is limited to 32 tokens. It can also change the results of the first category finder without creating an index.
Both are aimed at short content searches. A good fit includes product catalogs, FAQ knowledge bases, and support documentation. Liquid AI positions both as a replacing entry with the existing RAG pipeline.
Structural Change: A Rationale for Two Approaches
Both models are based on the LFM2.5-350M-Base, a general purpose test and training facility. Liquid AI uses a small set of two patches in the construction of LFM2. This converts from a causal decoder to a bidirectional encoder.
In setting the cause, each token uses it and the previous tokens. That fits the left-to-right generation but it’s not that common to get it. The team replaces the mask of causal attention with a dual-directional one. Now all tokens can take care of both left and right context. They also ruled out a short LFM2 mutation. These mix local information evenly around each token, not just the previous one.
This preserves the efficiency of the LFM2 backbone while generating the requirements for receiving full content presentations. Each model has 17 layers: 10 convolution, 6 attention, and 1 integration or density. The length of the context reaches 32,768 tokens, although the texts are tuned to 512 tokens. From the shared encoder, the two models differ only in output. Embedding uses CLS-style interpolation of a single 1024-dim vector. ColBERT maintains a 128-dim token embedding for late MaxSim implementations.
Training and Data
Both models follow the same three-stage recipe:
- The first stage is a different pre-training of English.
- The second section is a multilingual and cross-linguistic analysis from a powerful teacher in all 11 languages.
- The third stage is to make up for the hard-earned negative final.
The embedding model detects slightly different linguistic data than ColBERT. Retrieving different languages comes more naturally in the latter interface setup. The training data includes selected internal data and open source English retrieval data sets. LLM-based translation expands multilingual and cross-linguistic pairings.
Benchmark
Liquid AI tested two abilities. The first is multilingual retrieval with NanoBEIR. The second is an open source cross-language QA with MKQA-11. Both report results in all 11 languages: Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Norwegian, Portuguese, and Swedish.
On average, both models lead their class. Here are the comparison details:
| Model | Kind of | NanoBEIR ML (NDCG@10) | MKQA-11 (Remember@20) |
|---|---|---|---|
| LFM2.5-ColBERT-350M | late communication | 0.605 | 0.694 |
| LFM2.5-Embedding-350M | it’s crowded | 0.577 | 0.691 |
| Qwen/Qwen3-Embedding-0.6B | it’s crowded | 0.556 | 0.638 |
| LFM2-ColBERT-350M | late communication | 0.540 | 0.646 |
| Alibaba-NLP/gte-multilingual-base | it’s crowded | 0.528 | 0.675 |
| lightonai/GTE-ModernColBERT-v1 | late communication | 0.489 | 0.459 |
| BAAI/bge-enkulu-zu-v1.5 | it’s crowded | 0.359 | 0.413 |
ColBERT leads in both measurements. The embedding is close behind in MKQA-11 with 0.691. Both beat Qwen3-Embedding-0.6B, a larger model. The new ColBERT also improves on the previous LFM2-ColBERT-350M, from 0.540 to 0.605 in NanoBEIR. Liquid AI also notes that NanoBEIR English tracks the most expensive full BEIR. Both remain highly correlated, with NanoBEIR scoring nearly 15% higher. The research team therefore uses NanoBEIR as an effective proxy during training.
Latency and edge usage
Liquid AI released a variant of GGUF llama.cpp. These allow both models to run on CPUs, laptops, and peripherals. The calculations below use a MacBook Pro M4 Max at FP16. Questions for 32 tokens; documents are 256 tokens.
| Model | The stage | Archived documents | p50 |
|---|---|---|---|
| LFM2.5-Embedding-350M | Question embedding | yes | 7.3 ms |
| LFM2.5-ColBERT-350M | Query embedding + MaxSim | yes | 8.2 mz |
| LFM2.5-ColBERT-350M | Query + Document embedding + MaxSim | no | 34.3 ms |
When document embedding is pre-computed, the median query latency (p50) remains below 10 ms. Encoding documents during a query pushes ColBERT to 34.3 ms. On an enterprise scale, Liquid AI has also built an internal GPU stack. For H100 on FP16, it sees latency as low as 1 ms. Embedded query delay is 1.5 ms p50.
Use Cases with examples
- E-commerce: Search the product catalog in multiple languages with one index. The buyer types in a Korean query and the system displays an English product listing. Retrieving different languages makes this work without references to each language.
- FAQ and supporting knowledge bases: Reliably return the right response to all customer-facing environments. The French support question is posted in the English help article.
- Semantic search on the device: Search files, emails, and notes locally on consumer hardware. The GGUF architecture stores data on the device at almost zero cost.
- Business information assistants: Retrieve internal legal, financial, and business documents in all languages. ColBERT is suitable for this when the accuracy of the response exceeds the size of the index.
Code: Getting started
The embedded model continues sentence-transformers. Always pass asymmetric commands, query: again document:. Silent withdrawal degrades the quality of the acquisition.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"LiquidAI/LFM2.5-Embedding-350M",
trust_remote_code=True,
)
queries = ["What is the capital of France?"]
documents = ["Paris is the capital and largest city of France."]
q_emb = model.encode(queries, prompt_name="query", normalize_embeddings=True)
d_emb = model.encode(documents, prompt_name="document", normalize_embeddings=True)
scores = q_emb @ d_emb.T # shape: (n_queries, n_documents)
The ColBERT model continues PyLate. Its PLAID index uses FastPLAID for efficient match searches.
from pylate import indexes, models, retrieve
model = models.ColBERT(
model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M",
trust_remote_code=True,
)
model.tokenizer.pad_token = model.tokenizer.eos_token
index = indexes.PLAID(index_folder="pylate-index", index_name="index", override=True)
docs_emb = model.encode(["document 1 text", "document 2 text"], is_query=False)
index.add_documents(documents_ids=["1", "2"], documents_embeddings=docs_emb)
retriever = retrieve.ColBERT(index=index)
q_emb = model.encode(["a search query"], is_query=True)
scores = retriever.retrieve(queries_embeddings=q_emb, k=10)
To reassign an existing first-stage pipeline instead, skip the pointer and use rank.rerank.
from pylate import models, rank
model = models.ColBERT(model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M", trust_remote_code=True)
queries = ["query A"]
documents = [["candidate doc 1", "candidate doc 2"]]
documents_ids = [[1, 2]]
q_emb = model.encode(queries, is_query=True)
d_emb = model.encode(documents, is_query=False)
reranked = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=q_emb,
documents_embeddings=d_emb,
)
You can also tune any model to your data. The embed card provides a snapshot in use sentence-transformers again MultipleNegativesRankingLoss.
Key Takeaways
- Liquid AI’s LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M are the first two-dimensional LFMs, designed for multilingual search in 11 languages.
- Both 350M models lead their class in NanoBEIR and MKQA-11, beating the massive Qwen3-Embedding-0.6B.
- Embedding provides the smallest, cheapest index; ColBERT trades with a large index for maximum accuracy of each token.
- GGUF is built to run on CPUs, laptops, and the edge with llama.cpp, with a p50 query latency of less than 10 ms.
- They fit into the existing RAG pipelines
sentence-transformersand PyLate, under the LFM Open License v1.0.
Interactive Descriptor


