Production System Benchmarks
How Latimal's production search and matching pipeline compares to leading embedding models on food-domain tasks.
By Aditya Patni
The FoodEval leaderboard measures bare embeddings: encode, cosine, done. That tells you how good the embedding space is in isolation, and it's the right way to compare models on a level playing field. But nobody ships bare cosine to production. Real search and matching systems layer reranking, query understanding, and retrieval augmentation on top of the embedding. Those layers matter.
This page shows how the full production systems compare. Every model below uses two-stage retrieval (embedding retrieval followed by a reranker). The methodology is disclosed per table so you can see exactly what each system is running.
All systems compared at 384 dimensions. Competitors are paired with bge-reranker-v2-m3, a strong open-source reranker. Latimal uses its own proprietary reranker. This is intentional: we want to show how a purpose-built food system compares against the best general stack you could assemble today.
Menu Intelligence
Production matching, F1. All models use embedding + reranker (Latimal: proprietary reranker; competitors: bge-reranker-v2-m3).
| Task | Latimal | CohereEmbed v4 | Nomic AINomic v1.5 | BAAIBGE-M3 | AlibabaGTE-large | MicrosoftE5-large |
|---|---|---|---|---|---|---|
| Indian cuisine matchingF1 | 87.6% | 79.1% | 79.1% | 79.1% | 78.8% | 79.1% |
| Global cuisine matchingF1 | 87.0% | 81.6% | 81.6% | 81.6% | 81.2% | 81.6% |
| Beverage matchingF1 | 86.8% | 70.2% | 70.1% | 70.4% | 69.0% | 70.6% |
| Bakery & dessert matchingF1 | 86.2% | 70.5% | 70.2% | 70.8% | 70.4% | 70.5% |
| Portion size sensitivityF1 | 95.5% | 65.5% | 65.7% | 65.5% | 65.5% | 65.5% |
| Noisy menu matchingF1 | 88.4% | 94.4% | 94.4% | 94.4% | 94.0% | 94.4% |
| Cross-lingual matchingF1 | 81.2% | 70.7% | 70.9% | 70.9% | 70.6% | 70.9% |
| Cuisine classificationMacro F1 | 73.7% | 73.7% | 71.0% | 70.1% | 71.6% | 39.9% |
| Average8 tasks | 85.7% | 75.7% | 75.4% | 75.3% | 75.1% | 71.6% |
Latimal leads seven of eight matching tasks. The exception is noisy menu matching, where the reranker-augmented competitors score 94.4% against Latimal's 88.4%. Noisy matching rewards pattern stripping (prices, promo tags, size labels), and bge-reranker-v2-m3 handles that well. Across the full sweep, the average gap is 10 points.
The widest margins are on portion size sensitivity (+30 points) and beverage matching (+16 points). Those are the tasks where the model needs to know which modifiers change dish identity and which don't. "Large Coffee" and "Small Coffee" are the same item. "Iced Latte" and "Hot Chocolate" are not. General embeddings treat all modifiers the same way.
Search
Production search, NDCG@10. Latimal uses its full search pipeline (reranker, concept centroids, query expansion, document attributes, PRF). Competing models use embedding + bge-reranker-v2-m3.
| Task | Latimal | CohereEmbed v4 | Nomic AINomic v1.5 | BAAIBGE-M3 | AlibabaGTE-large | MicrosoftE5-large |
|---|---|---|---|---|---|---|
| Food searchNDCG@10 | 93.8% | 58.9% | 56.4% | 55.2% | 57.2% | 55.4% |
| Concept searchNDCG@10 | 80.9% | 39.1% | 35.7% | 33.6% | 37.4% | 32.8% |
| Diet & allergen searchNDCG@10 | 80.2% | 16.5% | 13.2% | 13.2% | 13.5% | 13.6% |
| Noisy searchNDCG@10 | 92.5% | 66.0% | 63.5% | 61.4% | 62.8% | 62.8% |
| Average4 tasks | 86.9% | 45.1% | 42.2% | 40.8% | 42.7% | 41.1% |
The search gap is larger. Latimal averages 86.9% NDCG@10 against the next best at 45.1%. Diet and allergen search shows the starkest difference: 80.2% vs 16.5%. Queries like "celiac friendly" or "keto" rarely have any lexical overlap with the items that satisfy them. General models and a general reranker can't bridge that gap. The production pipeline can, because it knows what those dietary terms mean in a food context.
Concept search follows the same pattern. "Warm comfort food" and "crispy appetizer" are abstract queries that need food-specific understanding to resolve. The production system scores 80.9%, roughly double the next competitor.
Methodology
Tasks and evaluation data come from FoodEval, the same benchmark used for the bare-embedding leaderboard. The difference here is the inference path: instead of raw cosine similarity, each model runs through its respective production-grade retrieval pipeline.
- Matching tasks use embedding retrieval followed by a cross-encoder reranker to produce the final similarity score. F1 is computed at the optimal threshold for each model.
- Search tasks use embedding retrieval to generate a candidate set, then a reranker to re-score and re-order. NDCG@10 measures ranking quality.
- Classification uses a probe trained on frozen embeddings, identical to the FoodEval methodology. The reranker is not involved.
The competitor pairing (any embedding + bge-reranker-v2-m3) represents the strongest general-purpose two-stage system available as open-source today. If you know of a better open reranker for food tasks, we will add it.
Bare-embedding comparison
If you want to see how models compare without reranking, the FoodEval leaderboard has ten models evaluated on the same tasks using only cosine similarity. That comparison isolates the embedding quality. This page shows what happens when you add the rest of the stack.
Try the search and matching endpoints directly in the playground.