June 10, 2026

Production System Benchmarks

How Latimal's production search and matching pipeline compares to leading embedding models on food-domain tasks.

The FoodEval leaderboard measures bare embeddings: encode, cosine, done. That tells you how good the embedding space is in isolation, and it's the right way to compare models on a level playing field. But nobody ships bare cosine to production. Real search and matching systems layer reranking, query understanding, and retrieval augmentation on top of the embedding. Those layers matter.

This page shows how the full production systems compare. Every model below uses two-stage retrieval (embedding retrieval followed by a reranker). The methodology is disclosed per table so you can see exactly what each system is running.

All systems compared at 384 dimensions. Competitors are paired with bge-reranker-v2-m3, a strong open-source reranker. Latimal uses its own proprietary reranker. This is intentional: we want to show how a purpose-built food system compares against the best general stack you could assemble today.

Menu Intelligence

Production matching, F1. All models use embedding + reranker (Latimal: proprietary reranker; competitors: bge-reranker-v2-m3).

Task	Latimal	CohereEmbed v4	Nomic AINomic v1.5	BAAIBGE-M3	AlibabaGTE-large	MicrosoftE5-large
Indian cuisine matchingF1	87.6%	79.1%	79.1%	79.1%	78.8%	79.1%
Global cuisine matchingF1	87.0%	81.6%	81.6%	81.6%	81.2%	81.6%
Beverage matchingF1	86.8%	70.2%	70.1%	70.4%	69.0%	70.6%
Bakery & dessert matchingF1	86.2%	70.5%	70.2%	70.8%	70.4%	70.5%
Portion size sensitivityF1	95.5%	65.5%	65.7%	65.5%	65.5%	65.5%
Noisy menu matchingF1	88.4%	94.4%	94.4%	94.4%	94.0%	94.4%
Cross-lingual matchingF1	81.2%	70.7%	70.9%	70.9%	70.6%	70.9%
Cuisine classificationMacro F1	73.7%	73.7%	71.0%	70.1%	71.6%	39.9%
Average8 tasks	85.7%	75.7%	75.4%	75.3%	75.1%	71.6%

Latimal leads seven of eight matching tasks. The exception is noisy menu matching, where the reranker-augmented competitors score 94.4% against Latimal's 88.4%. Noisy matching rewards pattern stripping (prices, promo tags, size labels), and bge-reranker-v2-m3 handles that well. Across the full sweep, the average gap is 10 points.

The widest margins are on portion size sensitivity (+30 points) and beverage matching (+16 points). Those are the tasks where the model needs to know which modifiers change dish identity and which don't. "Large Coffee" and "Small Coffee" are the same item. "Iced Latte" and "Hot Chocolate" are not. General embeddings treat all modifiers the same way.

Search

Production search, NDCG@10. Latimal uses its full search pipeline (reranker, concept centroids, query expansion, document attributes, PRF). Competing models use embedding + bge-reranker-v2-m3.

Task	Latimal	CohereEmbed v4	Nomic AINomic v1.5	BAAIBGE-M3	AlibabaGTE-large	MicrosoftE5-large
Food searchNDCG@10	93.8%	58.9%	56.4%	55.2%	57.2%	55.4%
Concept searchNDCG@10	80.9%	39.1%	35.7%	33.6%	37.4%	32.8%
Diet & allergen searchNDCG@10	80.2%	16.5%	13.2%	13.2%	13.5%	13.6%
Noisy searchNDCG@10	92.5%	66.0%	63.5%	61.4%	62.8%	62.8%
Average4 tasks	86.9%	45.1%	42.2%	40.8%	42.7%	41.1%

The search gap is larger. Latimal averages 86.9% NDCG@10 against the next best at 45.1%. Diet and allergen search shows the starkest difference: 80.2% vs 16.5%. Queries like "celiac friendly" or "keto" rarely have any lexical overlap with the items that satisfy them. General models and a general reranker can't bridge that gap. The production pipeline can, because it knows what those dietary terms mean in a food context.

Concept search follows the same pattern. "Warm comfort food" and "crispy appetizer" are abstract queries that need food-specific understanding to resolve. The production system scores 80.9%, roughly double the next competitor.

Methodology

Tasks and evaluation data come from FoodEval, the same benchmark used for the bare-embedding leaderboard. The difference here is the inference path: instead of raw cosine similarity, each model runs through its respective production-grade retrieval pipeline.

Matching tasks use embedding retrieval followed by a cross-encoder reranker to produce the final similarity score. F1 is computed at the optimal threshold for each model.
Search tasks use embedding retrieval to generate a candidate set, then a reranker to re-score and re-order. NDCG@10 measures ranking quality.
Classification uses a probe trained on frozen embeddings, identical to the FoodEval methodology. The reranker is not involved.

The competitor pairing (any embedding + bge-reranker-v2-m3) represents the strongest general-purpose two-stage system available as open-source today. If you know of a better open reranker for food tasks, we will add it.

Bare-embedding comparison

If you want to see how models compare without reranking, the FoodEval leaderboard has ten models evaluated on the same tasks using only cosine similarity. That comparison isolates the embedding quality. This page shows what happens when you add the rest of the stack.

Try the search and matching endpoints directly in the playground.