Thursday, September 9, 2021

Basic Latin, Diacritical Marks & IMDB

This is a story of not placing trust in public libraries.

The IMDB website has an auto-complete input element. While its mechanism isn't documented anywhere, you can easily explore it with curl:

$ alias labels='json d | json -a l'
$ imdb=https://v2.sg.media-imdb.com/suggestion

$ curl -s $imdb/a/ameli.json | labels
Amélie
Amelia Warner (I)
Austin Amelio
Amelia Clarkson
Amelia Rose Blaire
Amelia Heinle
Amelia Bullmore
Amelia Eve

The endpoint understands acute accents & strokes:

$ curl -s $imdb/b/boże+ciało.json | labels
Corpus Christi
Corpus Christi
Olecia Obarianyk
Alecia Orsini Lebeda
Zwartboek: The Special
The Cult: Edie (Ciao Baby)
Anne-Marie: Ciao Adios
The C.I.A.: Oblivion

(Corpus Christi is the translation of Boże Ciało.)

The funny part starts when you try to enter the same string (boże ciało) in the input field on the IMDB website:

Where's the movie? Turns out, the actual query that a page makes looks like

https://v2.sg.media-imdb.com/suggestion/b/boe_ciao.json

boe_ciao? Apparently, it tried to convert the string to a basic latin set, replacing spaces with an undescore along the way. It's not terribly hard to spot a little problem here.

This is the actual function that does the convertion:

var ae = /[àÀáÁâÂãÃäÄåÅæÆçÇèÈéÉêÊëËìÍíÍîÎïÏðÐñÑòÒóÓôÔõÕöÖøØùÙúÚûÛüÜýÝÿþÞß]/
, oe = /[àÀáÁâÂãÃäÄåÅæÆ]/g
, ie = /[èÈéÉêÊëË]/g
, le = /[ìÍíÍîÎïÏ]/g
, se = /[òÒóÓôÔõÕöÖøØ]/g
, ce = /[ùÙúÚûÛüÜ]/g
, ue = /[ýÝÿ]/g
, de = /[çÇ]/g
, me = /[ðÐ]/g
, pe = /[ñÑ]/g
, fe = /[þÞ]/g
, be = /[ß]/g;

function ve(e) {
if (e) {
var t = e.toLowerCase();
return t.length > 20 && (t = t.substr(0, 20)),
t = t.replace(/^\s*/, "").replace(/[ ]+/g, "_"),
ae.test(t) && (t = t.replace(oe, "a").replace(ie, "e")
.replace(le, "i").replace(se, "o")
.replace(ce, "u").replace(ue, "y")
.replace(de, "c").replace(me, "d")
.replace(pe, "n").replace(fe, "t").replace(be, "ss")),
t = t.replace(/[\W]/g, "")
}
return ""
}

(It took me some pains to extract it from god-awful obfuscated mess that IMDB returns to browsers.)

It's not only the Polish folks whose alphabet gets mangled. The Turks are out of luck too:

ve('Ruşen Eşref Ünaydın')     // => ruen_eref_unaydn

I say the function above sometimes does its job rather wrong:

ve('ąśćńżółıźćę')             // => o

deburr() from lodash is available publicly since February 5, 2015 &, unlike the forlorn IMDB attempt, works fine:

deburr('Boże Ciało')          // => Boze Cialo
deburr('Ruşen Eşref Ünaydın') // => Rusen Esref Unaydin
deburr('ąśćńżółıźćę') // => ascnzolizce

Why not use it?