The Hidden Cost of Duplicate Menu Listings
A single dish can appear under five different names across POS systems, aggregators, and languages. The business cost is bigger than most platforms realize.
I spent a weekend last year manually deduplicating a catalog of 12,000 menu items from restaurants in Bangalore. By 2 AM I'd cleared about 800. The spreadsheet had five columns open, I was Googling whether "Murg Makhni" and "Murgh Makhani" were the same thing (they are), and I still had 11,200 rows to go.
Here's what those rows looked like:
- Butter Chicken
- Murgh Makhani
- बटर चिकन
- Murg Makhni
- **BUY1GET1** Butter Chkn [Med]
Five entries. Five SKUs. Five rows in the analytics dashboard. One dish. Every food delivery platform that aggregates menus across restaurants has some version of this problem.
Five ways the same dish becomes five listings
Menu data enters the system from dozens of independent sources, each with its own conventions. A restaurant owner types "Chole Bhature" into their POS. The aggregator's onboarding team transcribes it as "Chana Bhatura." Another branch uploads a menu image where OCR reads "Choley Bhatoore."
| Source | Example |
|---|---|
| POS variation | Same chain, two POS systems, two name formats for every item |
| Transliteration | Murgh Makhani / Murg Makhni / बटर चिकन |
| Promotional noise | "BESTSELLER Margherita Pizza [Reg, Thin Crust]" |
| Regional naming | Parotta (Chennai) = Lachha Paratha (Delhi), Pani Puri = Golgappa = Puchka |
| Abbreviations | Chkn, Pnr, Spl, Veg Thali = V.T. |
A 45-item menu that shows 120 listings
Consider a real scenario. A mid-tier restaurant serves 45 dishes. After aggregation across two POS exports, one OCR scan, and three months of the owner editing names for promotions, the catalog shows 120 listings. The customer sees a wall of near-identical items and bounces.
Now multiply across the platform. An aggregator with 50,000 restaurants and an average 2.5x duplication rate is carrying roughly 3 million phantom listings. Search results are polluted. "Butter Chicken" is the top-ordered dish in Bangalore, but orders are split across four variant names, so none of them individually crack the top 10. The trend report says Biryani is number one. It might be wrong.
Price comparison, the feature that makes multi-restaurant platforms valuable, breaks entirely. You can't compare Chicken Biryani across 200 restaurants if the catalog doesn't know that "Chkn Biryani (Dum Style)" and "चिकन बिरयानी" are the same thing.
Edit distance gets you about 30% of the way
The first instinct is always Levenshtein distance. It catches typos. "Margherita" vs "Margarita" is a one-character swap. Fine.
But "Murgh Makhani" and "Butter Chicken"? Edit distance: 14. Zero shared characters. Completely different strings that mean the same dish. No amount of fuzzy matching bridges that gap, because the relationship lives in meaning, not spelling.
Regex mapping tables are the next attempt. "Murgh" = "Chicken," "Makhani" = "Butter." Works for the 50 dishes your catalog team knows. Then you add Japanese, then Thai, then Italian (is "Arrabbiata" the same as "Arrabiata"?), then Korean. The rule set grows into thousands. Maintenance becomes a full-time job. A creative restaurant owner names their biryani "The Royal Feast" and every rule breaks.
What the API call looks like
The POST /dedup endpoint takes a list of menu items and returns clusters. Here's a real call with items we scraped from three restaurants on the same street in Koramangala:
import requests
response = requests.post("https://dish-embed.latimal.com/dedup",
headers={"X-API-Key": "YOUR_KEY", "Content-Type": "application/json"},
json={
"items": [
"Butter Chicken", "Murgh Makhani", "Murg Makhni",
"Paneer Tikka", "Pnr Tikka", "Panir Tikka",
"Chicken Biryani", "Chkn Biryani",
"Manchurian", "Veg Manchurian Dry"
],
"threshold": 0.85
}
)Response:
{
"clusters": [
{
"canonical": "Butter Chicken",
"items": ["Butter Chicken", "Murgh Makhani", "Murg Makhni"],
"scores": [1.0, 0.94, 0.91]
},
{
"canonical": "Paneer Tikka",
"items": ["Paneer Tikka", "Pnr Tikka", "Panir Tikka"],
"scores": [1.0, 0.93, 0.96]
},
{
"canonical": "Chicken Biryani",
"items": ["Chicken Biryani", "Chkn Biryani"],
"scores": [1.0, 0.92]
}
],
"singletons": ["Manchurian", "Veg Manchurian Dry"]
}Notice "Manchurian" and "Veg Manchurian Dry" stayed separate. Different dishes. The model doesn't merge them just because they share a word.
The problems that keep catalog teams up at night
The first 90% of duplicates fall to a reasonable threshold. The last 10% are the ones where the dedup engine earns its keep.
Same name, different dish."Manchurian" at a North Indian restaurant is a gravy-heavy Indo-Chinese dish. "Manchurian" at a Chinese restaurant might be something closer to the original Cantonese preparation. "Chicken Biryani Regular" and "Chicken Biryani Family" share a name but are different SKUs with different prices, and your dedup pipeline needs to know the difference. Modifier groups and combo meals make this worse: is "McChicken Meal" a duplicate of "McChicken"?
Threshold tradeoffs. At 0.92, you catch clean duplicates and almost nothing else. At 0.85, you catch transliterations and abbreviations but you'll also merge some pairs that shouldn't be merged. At scale, a 0.88 threshold across 100,000 items can produce hundreds of false positive merges. One bad merge visible to a customer (their usual order pointing to the wrong dish) is worse than leaving the duplicate in place.
The practical approach: auto-merge above 0.92, route 0.85 to 0.92 to a human review queue, discard below 0.85. The API returns scores for exactly this reason. Your pipeline decides where the cutoffs sit based on how much you trust automation vs. how big your review team is.
After you find the clusters, then what?
Finding duplicates is half the problem. The other half is deciding what to do with them. Three common patterns:
- Canonical name selection. Pick the most common or most "standard" name from the cluster and display that. "Butter Chicken" wins over "Murg Makhni" because more customers search for it.
- Listing merge. Collapse duplicate listings into one, aggregate ratings and order counts. The restaurant sees one item in their dashboard instead of three.
- POS feedback. Flag duplicates to the restaurant owner so they can clean up at the source. This is slow but prevents the duplicates from recurring on every catalog refresh.
Most teams use all three: auto-merge the high-confidence clusters, human-review the borderlines, and push corrections back to the restaurant for anything structural.
Clean data changes what you can build
Dedup is one piece of the menu intelligence stack, and it makes everything downstream possible. Once your catalog knows that "Butter Chicken," "Murgh Makhani," and "बटर चिकन" are the same dish, search recall improves without touching the search system. Cuisine classification gets more accurate because you classify once per unique dish instead of five noisy variants. Price comparison across restaurants becomes possible. Trend reports stop lying to you.
You can audit overall catalog health using the menu health report endpoint, which flags duplicates alongside other quality issues. The interactive playground lets you paste items and watch clusters form in real time, no API key needed.