Thursday, September 9, 2021

Basic Latin, Diacritical Marks & IMDB

This is a story of not placing trust in public libraries.

The IMDB website has an auto-complete input element. While its mechanism isn't documented anywhere, you can easily explore it with curl:

$ alias labels='json d | json -a l'
$ imdb=https://v2.sg.media-imdb.com/suggestion

$ curl -s $imdb/a/ameli.json | labels
Amélie
Amelia Warner (I)
Austin Amelio
Amelia Clarkson
Amelia Rose Blaire
Amelia Heinle
Amelia Bullmore
Amelia Eve

The endpoint understands acute accents & strokes:

$ curl -s $imdb/b/boże+ciało.json | labels
Corpus Christi
Corpus Christi
Olecia Obarianyk
Alecia Orsini Lebeda
Zwartboek: The Special
The Cult: Edie (Ciao Baby)
Anne-Marie: Ciao Adios
The C.I.A.: Oblivion

(Corpus Christi is the translation of Boże Ciało.)

The funny part starts when you try to enter the same string (boże ciało) in the input field on the IMDB website:

Where's the movie? Turns out, the actual query that a page makes looks like

https://v2.sg.media-imdb.com/suggestion/b/boe_ciao.json

boe_ciao? Apparently, it tried to convert the string to a basic latin set, replacing spaces with an undescore along the way. It's not terribly hard to spot a little problem here.

This is the actual function that does the convertion:

var ae = /[àÀáÁâÂãÃäÄåÅæÆçÇèÈéÉêÊëËìÍíÍîÎïÏðÐñÑòÒóÓôÔõÕöÖøØùÙúÚûÛüÜýÝÿþÞß]/
, oe = /[àÀáÁâÂãÃäÄåÅæÆ]/g
, ie = /[èÈéÉêÊëË]/g
, le = /[ìÍíÍîÎïÏ]/g
, se = /[òÒóÓôÔõÕöÖøØ]/g
, ce = /[ùÙúÚûÛüÜ]/g
, ue = /[ýÝÿ]/g
, de = /[çÇ]/g
, me = /[ðÐ]/g
, pe = /[ñÑ]/g
, fe = /[þÞ]/g
, be = /[ß]/g;

function ve(e) {
if (e) {
var t = e.toLowerCase();
return t.length > 20 && (t = t.substr(0, 20)),
t = t.replace(/^\s*/, "").replace(/[ ]+/g, "_"),
ae.test(t) && (t = t.replace(oe, "a").replace(ie, "e")
.replace(le, "i").replace(se, "o")
.replace(ce, "u").replace(ue, "y")
.replace(de, "c").replace(me, "d")
.replace(pe, "n").replace(fe, "t").replace(be, "ss")),
t = t.replace(/[\W]/g, "")
}
return ""
}

(It took me some pains to extract it from god-awful obfuscated mess that IMDB returns to browsers.)

It's not only the Polish folks whose alphabet gets mangled. The Turks are out of luck too:

ve('Ruşen Eşref Ünaydın')     // => ruen_eref_unaydn

I say the function above sometimes does its job rather wrong:

ve('ąśćńżółıźćę')             // => o

deburr() from lodash is available publicly since February 5, 2015 &, unlike the forlorn IMDB attempt, works fine:

deburr('Boże Ciało')          // => Boze Cialo
deburr('Ruşen Eşref Ünaydın') // => Rusen Esref Unaydin
deburr('ąśćńżółıźćę') // => ascnzolizce

Why not use it?

Tuesday, May 25, 2021

Missing charsets in String to FontSet conversion

After upgrading to Fedora 34 I started to get a strange warning when running vintage X11 apps:

$ xclock
Warning: Missing charsets in String to FontSet conversion

With gv(1) it was much worse–multi-line errors, all related to misconfigured fonts. Some errors I was able to fix via

# dnf reinstall xorg-x11-fonts\*

Why exactly rpm post-install scripts have miscarried during the distro upgrade, remains unknown. Still, the main warning about charsets persisted.

Most classic x11 apps (gv included) are written in (now ancient) libXt library. By grepping through libXt code, I found a function that emits the warning in question. It calls XCreateFontSet(3) & dutifully reports the error, but fails to describe which of the charsets weren't found for a particular font.

A simple patch to libXt:

--- libXt-1.2.0/src/Converters.c.save   2021-05-22 00:18:36.359273335 +0300
+++ libXt-1.2.0/src/Converters.c 2021-05-22 00:21:08.550340341 +0300
@@ -973,6 +973,10 @@
XtNmissingCharsetList,"cvtStringToFontSet",XtCXtToolkitError,
"Missing charsets in String to FontSet conversion",
NULL, NULL);
+ fprintf(stderr, "XFontSet fonts: %s\n", fromVal->addr);
+ for (int i = 0; i < missing_charset_count; i++) {
+ fprintf(stderr, " missing charset: %s\n", missing_charset_list[i]);
+ }
XFreeStringList(missing_charset_list);
}
if (f != NULL) {
@@ -1006,6 +1009,10 @@
XtCXtToolkitError,
"Missing charsets in String to FontSet conversion",
NULL, NULL);
+ fprintf(stderr, "XFontSet fonts: %s\n", value.addr);
+ for (int i = 0; i < missing_charset_count; i++) {
+ fprintf(stderr, " missing charset: %s\n", missing_charset_list[i]);
+ }
XFreeStringList(missing_charset_list);
}
if (f != NULL)
@@ -1030,6 +1036,10 @@
XtNmissingCharsetList,"cvtStringToFontSet",XtCXtToolkitError,
"Missing charsets in String to FontSet conversion",
NULL, NULL);
+ fprintf(stderr, "XFontSet fonts: %s\n", "-*-*-*-R-*-*-*-120-*-*-*-*,*");
+ for (int i = 0; i < missing_charset_count; i++) {
+ fprintf(stderr, " missing charset: %s\n", missing_charset_list[i]);
+ }
XFreeStringList(missing_charset_list);
}
if (f != NULL)

gave me some clue:

$ gv
Warning: Missing charsets in String to FontSet conversion
...
... missing charset: KSC5601.1987-0

What is KSC5601.1987-0? Looks Korean. Why can't XCreateFontSet(3) suddenly find it? I didn't uninstall any fonts during the distro upgrade.

Turns out, the only bitmap font that provided KSC5601.1987-0 charset, daewoo-misc, was removed from xorg-x11-fonts-misc package due to licensing concerns. This is very rude.

It forced me to make a custom rpm package for daewoo-misc fonts. The spec file is here. Notice that I didn't bother to provide a fontconfig configuration (hence the installed font is invisible to Xft), for all I cared was to silence the annoying gv warning.

Saturday, January 30, 2021

Fixing “30 seconds of code”

In the past, the JS portion of 30 seconds of code was a single, big README in a github repo. You can still browse an old revision, of course. It was near perfect for a cursory inspection or a quick search.

In full conformance with all that's bright must fade adage, the README was scraped away for an alternative version that looks like this:

Why, why did they do that?

Thankfully, they put each code "snippet" into a separate .md file (there are 511 of them), which means we can concatenate them in 1 gargantuan file & create a TOC. I thought about an absolute minimum amount of code one would need for that & came up with this:

$ cat Makefile
$(if $(i),,$(error i= param is missing))
out := _out

$(out)/%.html: $(i)/%.md
@mkdir -p $(dir $@)
echo '<h2 id="$(title)">$(title)</h2>' > $@
pandoc $< -t html --no-highlight >> $@

title = $(notdir $(basename $@))

$(out)/30-seconds-of-code.html: template.html $(patsubst $(i)/%.md, $(out)/%.html, $(sort $(wildcard $(i)/*.md)))
cat $^ > $@
echo '</main>' >> $@

.DELETE_ON_ERROR:

(i should be a path to a repo directory with .md files, e.g. make -j4 i=~/Downloads/30-seconds-of-code/snippets)

This converts each .md file to its .html counterpart & prepends template.html to the result:

What's in the template file?

  1. a TOC generator that runs once after DOM is ready;
  2. a handler for the <input> element that filters the TOC according to user's input;
  3. CSS for a 2-column layout.

There is nothing interesting about #3, hence I'm skipping it.

Items 1-2 could be accomplished using 3 trivial functions (look Ma, no React!):

$ sed -n '/script/,$p' template.html
<script>
document.addEventListener('DOMContentLoaded', main)

function main() {
let list = mk_list()
document.querySelector('#toc input').oninput = evt => {
render(list, evt.target.value)
}
render(list)
}

function render(list, filter) {
document.querySelector('#toc__list').innerHTML = list(filter).map( v => {
return `<li><a href="#${v}">${v}</a></li>`
}).join`\n`
}

function mk_list() {
let h2s = [...document.querySelectorAll('h2')].map( v => v.innerText)
return query => {
return query ? h2s.filter( v => v.toLowerCase().indexOf(query.toLowerCase()) !== -1) : h2s
}
}
</script>

<nav id="toc"><div><input type="search"><ul id="toc__list"></ul></div></nav>
<main id="doc">

This is all fine & dandy, but 30 seconds of code has many more interesting repos, like snippets of css or reactjs code. They share the same lamentable fate with the js one–once being in a single readme, they have converged lately on a single, badly-searchable website, that displays 1 recipe per user’s query.

The difference between the css/react snippets & the plain js ones is in a necessity of a preview: if you see a tasty recipe for a “Donut spinner”, you’d like to see how the donut spins, before copying the example into your editor.

In such cases, people oft resort to pasting code into one of “Online IDE”s & embedding the result into their tutorial. CodePen, for example, has even more convenient feature: you create a form (with a POST request) that holds a field with a json-formatted string which contains html/css/js assets. That way you can easily make a button “check this out on codepen”. The downside is that a user leaves your page to play with the code.

Another way to show previews alongside the docs is to create an iframe & inject all assets from a snipped into it–in this implementation you don’t rely on 3rd parties & the docs stay fully usable in off-line scenarios (nobody actually needs that, but it sounds useful to have as an option).

This requires greatly expanding the examples above: either we need 3 separate templates: one for js snippets, some other for css recipes & a disheartening one for reactjs chunks; or we force a single template act differently depending on a payload content.

For the latter approach, see this repo.

Wednesday, January 6, 2021

Twitter stats using gnuplot, json & make

Twitter allows to download a subset of user's activites as a zip archive. Unfortunately, there's no useful visualisations of the provided data, except for a simple list of tweets with a date filtering.

For example, what I expected to find but there were no signs of it:

  1. a graph of activities over time;
  2. a list of:
    1. the most popular tweets;
    2. users, to whow I reply the most.

Inside the archive there is data/tweet.js file that contains an array (assigned to a global variable) of "tweet" objects:

window.YTD.tweet.part0 = [ {
"tweet" : {
"retweeted" : false,
"source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"favorite_count" : "2",
"id" : "12345",
"created_at" : "Sat Jun 23 16:52:42 +0000 2012",
"full_text" : "hello",
"lang" : "en",
...
}
}, ...]

The array is already json-formatted, hence it's trivial to convert it to a proper json for filtering with json(1) tool.

Say we want a list of top 5 languages in thich tweets were written. A small makefile:

$ cat lang.mk
lang: tweets.json
json -a tweet.lang < $< | $(aggregate) | $(sort)
tweets.json: $(i)
unzip -qc $< data/tweet.js | sed 1d | cat <(echo [{) - > $@

aggregate = awk '{r[$$0] += 1} END {for (k in r) print k, r[k]}'
sort = sort -k2 -n | column -t
SHELL := bash -o pipefail

yields to:

$ make -f lang.mk i=1.zip | tail -5
cs 16
und 286
ru 333
en 460
uk 1075

(1.zip is the archive that Twitter permits us to download.)

To draw activity bars, the same technique is applied: we extract a date from each tweet object & aggregate results by a day:

2020-12-31 5
2021-01-03 10
2021-01-04 5

This can be fed to gnuplot:

$ make -f plot.mk i=1.zip activity.svg

This makefile has an embedded gnuplot script:

$ cat plot.mk
include lang.mk

%.svg: dates.txt
cat <(echo "$$plotscript") $< | gnuplot - > $@

dates.txt: tweets.json
json -e 'd = new Date(this.tweet.created_at); p = s => ("0"+s).slice(-2); this.tweet.date = [d.getFullYear(), p(d.getMonth()+1), p(d.getDate())].join`-`' -a tweet.date < $< | $(aggregate) > $@

export define plotscript =
set term svg background "white"
set grid

set xdata time
set timefmt "%Y-%m-%d"
set format x "%Y-%m"

set xtics rotate by 60 right

set style fill solid
set boxwidth 1

plot "-" using 1:2 with boxes title ""
endef

To list users, to whom one replies the most, is quite simple:

$ cat users.mk
users: tweets.json
json -e 'this.users = this.tweet.entities.user_mentions.map( v => v.screen_name).join`\n`' -a users < $< | $(aggregate) | $(sort)

include lang.mk

I'm not much of a tweeter:

$ make -f users.mk i=1.zip | tail -5
<redacted> 41
<redacted> 49
<redacted> 60
<redacted> 210
<redacted> 656

Printing the most popular tweets is more cumbersome. We need to:

  1. calculate the rating of each tweet (by a such a complex foumula as favorite_count + retweet_count);
  2. sort all the tweet objects;
  3. slice N tweet objects.

A Make recipe for it is a little too long to show here, but you can grab a makefile that contains the recipe + all the recipes shown above.