June 18, 2026

Your Embedding Model Thinks Bread and Honeycomb Are the Same Food

General-purpose embeddings confuse co-occurrence with identity. Bread and honeycomb share a breakfast table but are different foods, and that breaks search.

A loaf of bread and a honeycomb sitting across from each other at a small table with two coffee cups, under a single hanging lamp

The bug you can't grep for

A customer searches "bread" on your app. The results include honeycomb, honey butter, and bee pollen granola. Your search model learned from web text that bread and honey always appear together, so it pulled them close in vector space. The customer sees nonsense and switches to scrolling.

This failure has a name: the model treats co-occurrence as equivalence. Bread and honey share a breakfast table, but a customer searching for bread wants naan, pita, sourdough, or a baguette. The search engine needs to know that bread is a carbohydrate and honeycomb is a sweetener, that they belong to different categories, dietary profiles, and aisles.

Where general models break down

We tested over a dozen general-purpose embedding models on food retrieval tasks for FoodEval, a 12-task benchmark covering search, matching, and classification across cuisines and languages. Three failure modes showed up everywhere.

Take "Paneer Tikka" and "Cottage Cheese Tikka." Same dish. Zero shared tokens. A general multilingual model places them around 0.4 apart in cosine space because it can bridge languages but has no food-domain alignment to know these are identical. A food-specialized model puts them at 0.92. That gap is the difference between returning the item and burying it on page three.

Category collapseis subtler. "Chicken biryani" and "Veg biryani" land close together because they share the word "biryani." But for a vegetarian customer, these are on opposite sides of a hard constraint. The protein is the defining axis, not the preparation style. On FoodEval's protein-conflict task, general models score below 0.80 F1. A domain model hits 0.95.

Then there are intent gaps. When someone searches "something light," they mean salads, soups, smoothies. A model trained on Wikipedia has no concept of meal weight. It has never seen "light" used to describe caloric density. So the query returns whatever happens to co-occur with "light" in web text.

What food specialization changes

A food-specialized model knows that "soya chaap" is a vegetarian protein, not a sauce. That "cap" in a menu context means cappuccino, not a hat. That "Kadhai Chicken" and "Karahi Murgh" are the same dish written in different romanization systems.

These ambiguities are the majority of real menu queries. A food delivery platform with 500 items across 30 stores will hit dozens of them on every search.

The numbers

We built FoodEval to measure this precisely. Twelve tasks, ten models, public leaderboard. The top general-purpose model (OpenAI text-embedding-3-large at 3072 dimensions) scores 0.577 task-weighted. A food-domain model at 384 dimensions scores 0.718. Smaller vectors, better results, because the model understands the domain.

The benchmark is open source. Run it on any model. The leaderboard accepts community submissions.

Try it yourself

The Latimal Playground lets you run real queries against real menus. Search for "something spicy and filling" or "healthy breakfast" and see what comes back. If the results make sense, the API has a 14-day free trial.

← All posts