Saturday, December 23, 2017

Sometimes being stubborn does not pay off

From the GNU coreutils TODO file:

sort: Investigate better sorting algorithms; see Knuth vol. 3.

We tried list merge sort, but it was about 50% slower than the recursive algorithm currently used by sortlines, and it used more comparisons. We're not sure why this was, as the theory suggests it should do fewer comparisons, so perhaps this should be revisited.

List merge sort was implemented in the style of Knuth algorithm 5.2.4L, with the optimization suggested by exercise 5.2.4-22. The test case was 140,213,394 bytes, 426,4424 lines, text taken from the GCC 3.3 distribution, sort.c compiled with GCC 2.95.4 and running on Debian 3.0r1 GNU/Linux, 2.4GHz Pentium 4, single pass with no temporary files and plenty of RAM.

Since comparisons seem to be the bottleneck, perhaps the best algorithm to try next should be merge insertion.

$ git log -S 'sort: Investigate better sorting algorithms' --source --all
commit 93f9ffc6141ad028223a33e5ab0614aebbd4aa2e refs/tags/CPPI-1_11
Author: Jim Meyering 
Date:   Sat Aug 2 06:27:13 2003 +0000

    Document in TODO Paul's desire to make sort faster (and how he
    was foiled this time around).
    from Paul Eggert.

Tuesday, September 12, 2017

'indent-region' in the Emacs batch mode

I had an old project of mine that was using an archaic tab-based indentation w/ the assumed tab width of 4. The mere opening of its files in editors other than Emacs was causing such pain that I decided to do what you should rarely do--reindent the whole project.

(A side note: if you're setting the tab width to a value other than 8 & simultaneously is using tabs for indentation, you're a bad person.)

The project had ~40 files. Manually using Emacs indent-region for each file would have been too wearisome. Then I remembered that Emacs has the batch mode.

A quick googling gave me a recipe akin to:

$ emacs -Q -batch FILE --eval '(indent-region (point-min) (point-max))' \
  -f save-buffer

It would have worked but it had several drawbacks as a general solution, for it:

  1. modifies a file in-place;
  2. can't read from the stdin;
  3. doesn't work w/ multiple files forcing a user to use xargs(1) of smthg;
  4. if FILE doesn't exist, Emacs quietly creates an empty file, whereas a user probably expects to see an error message.

Whereas an ideal little script should:

  1. never overwrite the source file;
  2. read the input from the stdin or from its arguments;
  3. spit out the result to the stdout;
  4. handle the missing input reasonably well.

Everything else should be accommodated by the shell, including the item #2 from the 'drawbacks' list above.

Using Emacs for such a task is tempting, for it supports a big number of programming modes, giving us the ability to employ the editor as a universal reindenting tool, for example, in makefiles or Git hooks.

Writing to the stdout

We can slightly modify the above recipe to:

$ emacs -Q -batch FILE --eval '(indent-region (point-min) (point-max))' \
  --eval '(princ (buffer-string))'

In -f stead we are using --eval twice. buffer-string function does exactly what it says: returns the contents of the current buffer as a string.

$ cat 1.html
hey, <i>
what's up</i>?

$ emacs -Q -batch 1.html --eval '(indent-region (point-min) (point-max))' \
  --eval '(princ (buffer-string))'
Indenting region...
Indenting region...done
  hey, <i>
    what's up</i>?

The "Indenting region" lines come from Emacs message function (which the progress reporter uses). In the batch mode message prints the lines to the stderr.

The solution also address the item #3 from the "drawbacks" list--Emacs doesn't create an empty file on the invalid input, although it doesn't indicate the error properly, i.e., it fails w/ the exit code 0.

Processing all files at once

If you know what are you doing & the files you're going to process are under Git, overwriting the source it not a big deal. Perhaps for a quick hack this script will do:

:; exec emacs -Q --script "$0" -- "$@" # -*- emacs-lisp -*-

(defun indent(file)
  (set-buffer "*scratch*")
  (if (not (file-directory-p file))
      (when (and (file-exists-p file) (file-writable-p file))
        (message "Indenting `%s`" file)

        (find-file file)
        (indent-region (point-min) (point-max))

(setq args (cdr argv))                  ; rm --
(dolist (val args)
  (indent val))

If you put the src above into a file named emacs-indent-overwrite & add executable bits to it, then the shell thinks it's a usual sh-script that doesn't have a shebang line. A colon in sh is a noop, but on stumbling upon exec the shell replaces itself with the command

emacs -Q --script emacs-indent-overwrite -- arg1 arg2 ...

When Emacs reads the script, it doesn't expect it to be a sh one, but hopefully the file masks itself as a true Elisp, for in Elisp a symbol whose name starts with a colon is called a keyword symbol. : is a symbol w/o a name (a more usual constant would be :omg) that passes the check because it satisfies keywordp predicate:

ELISP> (keywordp :omg)
ELISP> (keywordp :)

Everything else in the file is usual Elisp code. Here's how it works:

$ emacs-indent-overwrite src/*
Indenting ‘src/1.html‘
Indenting region...
Indenting region...done
Indenting ‘src/2.html‘
Indenting region...
Indenting region...done

The only thing worth mentioning about the script is why indent procedure has (set-buffer "*scratch*") call. It's an easy way to switch the current directory to the directory from which the script was started. This is required, for find-file modifies the default directory of the current buffer (via modifying a buffer-local default-directory variable). The other way is to modify args w/ smthg like

(setq args (mapcar 'expand-file-name args))

& just pass the file argument as a full path.

Reading from the stdin

There is next to none info about the stdin in the Emacs Elisp manual. The section about minibuffers hints us that for reading from the standard input in the batch mode we ought to seek out for functions that normally read kbd input from the minibuffer.

One of such functions is read-from-minibuffer.

$ cat read-from-minibuffer
:; exec emacs -Q --script "$0"
(message "-%s-" (read-from-minibuffer ""))

$ printf hello | ./read-from-minibuffer
$ printf 'hello\nworld\n' | ./read-from-minibuffer

The function only read the input up to the 1st newline which it also impertinently ate. This leaves us w/ a dilemma: if we read multiple lines in a loop, should we unequivocally assume that the last line contained the newline?

read-from-minibuffer has another peculiarity: on receiving the EOF character it raises an unhelpful error:

$ ./read-from-minibuffer
Error reading from stdin
$ echo $?

The simplest Elisp cat program must watch out for that:

$ ./cat.el < cat.el
:; exec emacs -Q --script "$0"
(while (setq line (ignore-errors (read-from-minibuffer "")))
  (princ (concat line "\n")))

Next, if we read the stdin & print the result to the stdout, our "ideal" reindent script cannot rely on find-file anymore, unless we save the input in a tmp file. Or we can leave out "heavy" find-file altogether & just create a temp buffer & copy the input there for the further processing. The latter implies we must manually set the proper major mode for the buffer, otherwise indent-region won't do anything good.

set-auto-mode function does the major mode auto-detection. One of the 1st hints it looks for is the file extension, but the stdin has none. We can ask the user to provide one in the arguments of the script.

:; exec emacs -Q --script "$0" -- "$@" # -*- emacs-lisp -*-

(defun indent(mode text)
    (set-visited-file-name mode)
    (insert text)
    (set-auto-mode t)
    (message "`%s` major mode: %s" mode major-mode)

    (indent-region (point-min) (point-max))

(defun read-file(file)
    (insert-file-contents file)

(defun read-stdin()
  (let (line lines)
    (while (setq line (ignore-errors (read-from-minibuffer "")))
      (push line lines))
    (push "" lines)
    (mapconcat 'identity (reverse lines) "\n")


(setq args (cdr argv))                  ; rm --

(setq mode (car args))
(if (equal "-" mode)
      (setq mode (nth 1 args))
      (if (not mode) (error "No filename hint argument, like .js"))
      (setq text (read-stdin)))
  (setq text (read-file mode)))

(princ (indent mode text))

If the 1st argument to the script is '-', we look for the hint in the next argument & then start reading the stdin. It the 1st arg isn't '-' we assume it's a file that we insert into a temp buffer & return its contents as a string.


(mapconcat 'identity (reverse lines) "\n")

line in read-stdin procedure is an equivalent of

['world', 'hello'].reverse().join("\n")

in JavaScript.

Some examples:

$ printf "<p>\nhey, what's up?\n</p>" | ./emacs-indent - .html
‘.html‘ major mode: html-mode
Indenting region...
Indenting region...done
  hey, what's up?

$ printf "<p>\nhey, what's up?\n</p>" | ./emacs-indent - .txt
‘.txt‘ major mode: text-mode
Indenting region...
Indenting region...done
hey, what's up?

$ ./emacs-indent src/2.html 2>/dev/null
  not much

Saturday, June 17, 2017

Texinfo is back

I've revived the old project of converting the Nodejs API docs into the Texinfo format.

The orig docs are in markdown, but node folks have made sure that no plain markdown parser would ever suffice for the conversion of the docs into anything other than html.

Each function's description starts w/ a heading w/ the name of the function + its args (which is handy, for we can use the label as an index entry) but after the heading there is an html comment which in reality is a YAML block w/ a metadata.

There is no uniformity in the way of writing tables: in some sections there are github-style tables, the other contain html <table> blocks which are a royal pain to dissect in case the table contains a tiny markup error that is invisible to a visual inspection (for browsers are known to be quite liberal at consuming sloppy <table>s) but wholly breaks the assumptions of your parser.

Sometimes there are additional anchors before the headings, like

<a id="FOO"></a>
### foo(a, b, c, [bar], [baz])

that supposedly serve as a quick way of linking to the subsection, i.e., [read this](#FOO) instead of [read this](#module_foo_a_b_c_bar_baz) but, again, forces us to apply additional logic during the header parsing.

The Texinfo itself can be rather irritating: it doesn't allow to reliably use stops or colons within a header text. texi2any bitterly complains about the issue if the output format is info but prints nothing worth mentioning when the output is html.

Yet, the ability to use indices in Emacs outweighs the disadvantages of maintaining a custom markdown->texi util.

Sunday, March 26, 2017

How to prevent Wine from auto-running .exe files in Fedora

After installing Wine on Fedora 25 (for a reason that isn't terribly important here) I've noticed that I can run Window executables straight from the bash command line:

$ file -b ~/.wine/drive_c/Program\ Files/Internet\ Explorer/iexplore.exe
PE32+ executable (GUI) x86-64, for MS Windows
$ !$

That iexplore.exe file is clearly not a ELF, yet the kernel successfully executes it.

Since Windows ecosystems is known for its total absence of malware & ransomware; & Wine, in turn, is known for a military grade, bulletproof sandbox (i.e., it provides no protection whatsoever), I find the notion of auto-running very difficult to reconcile w/ common sence.

How does it work

If a kernel was compiled w/ CONFIG_BINFMT_MISC option, execve(2) (I'm simplifying here a little) obtains an ability to delegate the execution of unknown binaries to external processes. The subsystem that manages the format → interpreter association table is called binfmt_misc.

binfmt_misc maintains its ephemeral database in the procfs. At the boot time you mount /proc/sys/fs/binfmt_misc directory & feed /proc/sys/fs/binfmt_misc/register file w/ specially crafted text lines to create association entries (or rules as systemd docs like to call them).

E.g., in the good old days before systemd, we would have put

echo :win:M:0:MZ::/usr/bin/wine: > /proc/sys/fs/binfmt_misc/register

somewhere in /etc/rc.local.

Such a command creates /proc/sys/fs/binfmt_misc/win file:

interpreter /usr/bin/wine
offset 0
magic 4d5a

where the offset & the magic 4d5a (MZ) means the 1st 2 bytes of a typical Windows executable:

$ hexdump -C -n10 iexplore.exe
00000000  4d 5a 40 00 01 00 00 00  06 00                    |MZ@.......|

We can be even more fancier & make an extension → interpreter association, e.g.:

# echo :xv:E::gif::/usr/bin/xv: > /proc/sys/fs/binfmt_misc/register
$ chmod +x slave-ship-daily-schedules.gif
$ ./!$

creates .gif → xv entry, & starts /usr/bin/xv slave-ship-daily-schedules.gif under the hood when we try to "run" the .gif image.

To delete all the entries, we write -1 to /proc/sys/fs/binfmt_misc/status file or if we want to delete only a particular entry:

# echo -1 > /proc/sys/fs/binfmt_misc/xv

The systemd way

systemd doesn't allow us to mingle w/ binfmt_misc subsystem directly. We ought to write the text lines in the same format binfmt_misc undestands but put them in special .conf files, where a separate program systemd-binfmt expects to find them.

If we provide an rpm for our app, we should put my-app.conf file in /usr/lib/binfmt.d/ directory during the installation & run

/usr/lib/systemd/systemd-binfmt my-app.conf

in the post-install hook.

The algo systemd-binfmt uses is fairly straightforward. When run w/o any command line args, it deletes all the existing entries & recreates the ones specified in the .conf files. If provided w/ the name of the .conf file (w/o a directory prefix!), it scans the file to add new entries.

An excerpt from src/binfmt/binfmt.c (commit 1539a65):

if (argc > optind) {
  int i;

  for (i = optind; i < argc; i++) {
    k = apply_file(argv[i], false);
    if (k < 0 && r == 0)
      r = k;
} else {
  _cleanup_strv_free_ char **files = NULL;
  char **f;

  r = conf_files_list_nulstr(&files, ".conf", NULL, conf_file_dirs);
  if (r < 0) {
    log_error_errno(r, "Failed to enumerate binfmt.d files: %m");
    goto finish;

  /* Flush out all rules */
  write_string_file("/proc/sys/fs/binfmt_misc/status", "-1", 0);

  STRV_FOREACH(f, files) {
    k = apply_file(*f, true);
    if (k < 0 && r == 0)
      r = k;

It sounds fine, until we remember that a typical user (a) doesn't know much about binfmt-misc kernel module, (b) doesn't know anything about the standalone systemd-binfmt program, for he uses systemd-binfmt.service unit, as in

# systemctl restart systemd-binfmt

That unit (/usr/lib/systemd/system/systemd-binfmt.service) incorporates a handful of preconditions. The relevant ones:


When we install Wine, it makes 2 entries in the systemd-binfmt DB. If we (a) remove the offending package wine-systemd that contains the evil /usr/lib/binfmt.d/wine.conf file & (b) dutifully restart systemd-binfmt service--no binfmt-misc association entries gets removed!

# rpm -qf /lib/binfmt.d/wine.conf
# rpm --nodeps -e wine-systemd
# systemctl restart systemd-binfmt
# ls -l /proc/sys/fs/binfmt_misc/
total 0K
--w------- 1 root root 0 Mar 24 20:49 register
-rw-r--r-- 1 root root 0 Mar 25 20:48 status
-rw-r--r-- 1 root root 0 Mar 25 20:58 windows
-rw-r--r-- 1 root root 0 Mar 25 20:58 windowsPE

because systemd will forbid systemd-binfmt to run at all, because all the conf directories are empty. The remedy is to run systemd-binfmt by hand:

# /usr/lib/systemd/systemd-binfmt
# ls -l /proc/sys/fs/binfmt_misc/
total 0K
--w------- 1 root root 0 Mar 24 20:49 register
-rw-r--r-- 1 root root 0 Mar 25 20:48 status

You should not manually delete the rpm w/ the evil .conf file for dnf will reinstall wine-systemd package during the next Wine update. The recommended systemd-style solution is to create a symlink in /etc/binfmt.d/ to /dev/null named wine.conf & then restart systemd-binfmt service:

# ln -s /dev/null /etc/binfmt.d/wine.conf
# systemctl restart systemd-binfmt

Saturday, March 4, 2017

Bloatware comes when nobody's lookin'

If you write a Chrome extension for distribution outside of Google Webstore, you will inevitably need to pack the extension into a .crx file. It can be done either using the Chrome UI (the "Pack extension..." button) or "manually" via creating a .zip file & prepending a specially crafted header to it.

Obviously, the 2nd variant can be easily automated. If you have several extensions to maintain, a part of your build system that's responsible for the .crx generation could be extracted to some kind of a "plugin" that can be shared between all the extensions you maintain. In my case, the "plugin" consists of 2 files: a small sh script & a tiny makefile (that can be safely included into another makefile).

I was thinking about uploading those 2 files to npm (& hesitating a little for if a package wouldn't contain any JS code soever, should it be on npm?) but then decided to check the registry first.

The npm registry indeed contains multiple packages for dealing w/ .crx files. One of the most popular is called "crx" (at the time of writing it had 192 ★ on Github). Glancing through its code I failed not to notice the unfortunate difference between the classic Unix approach to such a problem & the JavaScript one. Perhaps, the latter is everything you may do ironically, unless you're a grand, hardcore troll.

Before I explain myself further, let's digress for a bit to explain what .crx files are & how to create them.

CRX Package Format

In the ideal world, Alice would package her extension (a set of .js & .json files) into a .zip file, would upload the file to her page & would tell her dear friend Bob about it.

In reality, the extension must be protected from tampering. If Eve, the evil sysadmin of zone, will entertain an idea of mingling w/ Alice's extension via adding a hot Bitcoin miner to it, Bob should be able to detect that before the extension gets intalled.

One way of doing it is to generate a pair of public & private keys. You sign the extension w/ your private key & users check the downloaded .zip w/ you public key. The trouble is there is no standard way to "sign" a .zip file for neither its metadata supports such a thing as an embedded signature, not any of "archive managers" would know what to do w/ such an upgraded metadata.

This is where .crx files come in: they contain a copy of an RSA public key + an encrypted SHA1 of a .zip archive in question. This metadata simply gets prepended to the original .zip file.

E.g., Alice, after testing her Chrome extension, zips it to a file:

$ touch manifest.json
$ zip !$

Next, she generates a private 1024-bit RSA key:

$ openssl genrsa 1024 > private.pem

Then prepends the aforementioned block to As it's a multiple step process, she writes it down in zip2crx script (this is a modified version of the script from


[ $# -ne 2 ] && {
    echo "Usage: `basename $0` private_key" 1>&2
    exit 1


trap 'rm -f "$pub" "$sig"' EXIT

# signature
openssl sha1 -sha1 -binary -sign "$key" < "$zip" > "$sig"

# public key
openssl rsa -pubout -outform DER < "$key" > "$pub" 2>/dev/null

# Take "abcdefgh" and return it as "ghefcdab"
byte_swap() {
    echo "${1:6:2}${1:4:2}${1:2:2}${1:0:2}"

crmagic_hex="4372 3234" # Cr24
version_hex="0200 0000" # 2
pub_len_hex=$(byte_swap $(printf '%08x\n' $(stat -c %s "$pub")))
sig_len_hex=$(byte_swap $(printf '%08x\n' $(stat -c %s "$sig")))

    echo "$crmagic_hex $version_hex $pub_len_hex $sig_len_hex" | xxd -r -p
    cat "$pub" "$sig" "$zip"
) > "$crx"

Her script signs the .zip, automatically derives a public key from private.pem, combines together all the necessary data for a .crx file consumer (like the length of the private key, &c) & concatenates the obtained header w/ the .zip

$ ./zip2crx private.pem
$ file foo.crx
foo.crx: Google Chrome extension, version 2

The resulting .crx is ready to be used in Chrome. She uploads it to her web page & sends to Bob (via email) her public key (possibly encrypted w/ his GPG public key, but we won't get into that).

Bob, having obtained Alice's public key, downloads foo.crx, extracts from it the embedded public key, the signature & & checks (w/ Alice's public key) the validity of

(The whole verification process is a little too convoluted to present it here, but if you're interested, download this script & run it against the provided key:

$ ./crx2zip foo.crx
RSA key                 1024-bit
Total header size:      306 bytes
Public key:   
Signature status:       Verified OK

where would be a public key in the DER-format.)

If Eve, the evil sysadmin, did indeed modify foo.crx in any way, Bob is able to detect that, for despite that Eve can do all the same operations as Bob did, she doesn't have Alice's private key & thus is unable to re-sign the tampered properly in an undetectable way. All she can hope for is that Bob, being a lazybones, won't bother to do the necessary checks before installing foo.crx into his browser.

The example is somewhat contrived, for if Alice decides to upload to Google Webstore instead, by this virtue she exempts herself from managing the crypto keys. The Webstore makes the key pair for her, does all the checks of all the sub-sequential updates of & generates the correct .crx. The final extension is being delivered to Bob via HTTPS, thus leaving Eve, the evil sysadmin, out of luck.

The JavaScript way

Back to our findings from the npm registry.

crx package has dual nature: it's a library & a CLI util. The distinguishing feature of this program is that "It is written purely in JavaScript and does not require OpenSSL!".

The author states it as if the mere act of depending on OpenSSL code is somehow inconvenient or morally repugnant. I must say he's not alone in his view. I may sympatise for the overly cautious approach in the case of maintaining a big farm of cloud services like Amazon does, but I cannot reconcile w/ this stance in the case of a local developer machine.

A user of a .crx file doesn't have to interfere w/ OpenSSL, it's solely the developer's job to create the proper .crx. To find a developer machine that doesn't have OpenSSL intalled already is a challenge that only a few of us can achieve. (Certainly, we may be so bold; we may think of some poor souls who have to use old versions of Windows but even they can always download Cygwin.)

Then there is the size. Our zip2crx example is 37 lines long. Anybody can read it (even indolent Bob) & in a minute understand what the script is doing. crx package on the other hand

$ find node_modules/crx/{bin,src} -type f | xargs wc -l
  150 node_modules/crx/bin/crx.js
  294 node_modules/crx/src/crx.js
   47 node_modules/crx/src/resolver.js
  491 total

is ~13 times bigger.

The package also provides some kind of an artisan build system surrogate. It "packs" the source code into different places depending on its command line options. It generates RSA keys.

I know that it's quite fashionable in the JavaScript world to write a replacement for Make every year, but it still amazes me why would anyone do the job in the courageous crx package way, instead of reusing available, well-tested tools on your machine.

The Unix way

Everything except zip2crx is already here. Suppose you have foobar extension:

├── Makefile
├── src/
│   ├── hello.js
│   └── manifest.json
└── zip2crx*

src directory contains the source code for the extension. Makefile is a primitive 18-lines long set of shell instructions: := foobar
out := _build
src := $(shell find src -type f)

mkdir = @mkdir -p $(dir $@)     # a canned recipe

.PHONY: crx
crx: $(out)/$(

$(out)/$( $(src)
        cd $(dir $<) && zip -qr $(CURDIR)/$@ *

%.crx: private.pem
        ./zip2crx $< private.pem

        openssl genrsa 2048 > $@

If you type make it'll generate in _build/ directory & foobar.crx. If you don't have an RSA key, Make will generate it for you too.

Why would you need JavaScript for that?

I'll conclude by quoting Doug McIlroy:

"A first engineering question to ask is: how often is one likely to have to do this exact task? Not at all often, I contend. It is plausible, though, that similar, but not identical, problems might arise. A wise engineering solution would produce--or better, exploit--reusable parts."

Wednesday, February 1, 2017

A JavaScript client for

Following the recent exodus from a hostile LiveJournal, I've noticed that I'm unable to perform 2 things from the command line in the good old DreamWidth:

  • Posting in markdown
  • Uploading the image

Turns out, DW still uses a subset of ancient LJ-like xmlrpc APIs. Instead of reusing some of the available LJ clients, I've decided to hack my own.

The API has a weird auth scheme. The 1st step is to obtain a "challenge", then md5 it w/ a password (yes, you've read it correctly: it's md5, folks! Sonja Henie's tutu.) & send it over w/ all the subsequent requests. The weird part comes when you realise that such a token lives only 60 sec & for a long-lived session you need to obtain another generated token that doesn't actually work w/ the xmlrpc API & useful only for manual scraping/posting purposes.

The CLI client hides everything under 1 user-visible command dreamwidth-js. To upload an image to the DW cloud, type:

  $ dreamwidth-js img-upload < cat.jpg  

To make a new post:

$ dreamwidth-js entry-post-md
subject: On today's proceedings
tags: meeting, an exciting waste of time
security: friends

## Agenda for Mon. budget meeting

It was a dark and stormy night; the rain fell in torrents