Friday, June 1, 2018

Emacs & gpg files: use the minibuffer for password prompts

In the past Emacs was communicating w/ gnupg directly & hence was responsible for reading/sending/catching passwords. In contrast, Emacs 26.1, by default, fully delegates the password handling to gpg2.

The code for the interoperation w/ gpg1 is still present in the Emacs core, but it's no longer advertised in favour of gpg2 + elpa pinentry.

If you don't want an additional overhead or a special gpg-agent setup, it's still possible to use gpg1 for (en|de)crypting ops.

Say we have a text file we want to encrypt & then transparently edit in Emacs afterwards. The editor should remember the correct pw for the file & not bother us w/ the pw during the file saving op.

$ rpm -qf `which gpg gpg2`
gnupg-1.4.22-6.fc28.x86_64
gnupg2-2.2.6-1.fc28.x86_64

$ echo rain raine goe away, little Johnny wants to play | gpg -c > nr.gpg
$ file nr.gpg
nr.gpg: GPG symmetrically encrypted data (AES cipher)

If you have both gpg1 & gpg2 installed, Emacs ignores gpg1 completely. E.g., run 'emacs -Q' & open nr.gpg file–gpg2 promptly contacts gpg-agent, which, in turn, runs the pinentry app:

Although, it may look as if everything is alright, try to edit the decrypted file & then save it. The pinentry window will reappear & you'll be forced to enter the pw twice.

The Emacs mode that handles the gnupg dispatch is called EasyPG Assistant. To check its current state, use epg-find-configuration fn:

ELISP> (car (epg-find-configuration 'OpenPGP))
(program . "/usr/bin/gpg2")

We can force EasyPG to use gpg1, despite that it's not documented anywhere.

The actual config data is located in epg-config--program-alist var:

ELISP> epg-config--program-alist
((OpenPGP epg-gpg-program
("gpg2" . "2.1.6")
("gpg" . "1.4.3"))
(CMS epg-gpgsm-program
("gpgsm" . "2.0.4")))

Here, if we shadow the gpg2 entry in the alist, EasyPG would regenerate a new config for all the (en|de)crypting ops on the fly:

(require 'epg-config)
(add-to-list 'epg-config--program-alist `(OpenPGP epg-gpg-program ("gpg" . ,epg-gpg-minimum-version)))
(setq epa-file-cache-passphrase-for-symmetric-encryption t)
(setq epg--configurations nil)

Now, if you open nr.gpg afresh, Emacs neither should use the gpg-agent any more:

Nor should it ask for the pw when you'll do edit+save later on.

To clear the internal pw cache, type

ELISP> (setq epa-file-passphrase-alist nil)

Friday, May 11, 2018

Writing a podcast client in GNU Make

Why? First, because I wanted parallel downloads & my old podcacher didn't support that. Second, because it sounded like a joke.

The result is gmakepod.

Evidently, some ingredients for such a client are practically impossible to write using plain Make (like xml parser). Would the client then be considered a truly Make program? E.g, there is a clever bash json parser, but when you look in its src you see that it shies now away from awk & grep.

At least we can try to write in Make as many components as possible, even if (when) they become a bottleneck. Again, why? 1

Overview

Using Make means constructing a proper DAG. gmakepod uses 6 vertices: run.download.mk.files.new.files.enclosures.feeds. All except the 1st one are file targets.

target desc
.feeds (phony) parse a config file to extract feeds names & urls
.enclosures fetch & parse each feed to extract enclosures urls
.files generate a proper output file name for each url
.files.new check if we've already downloaded a url in the past, filter out
.download.mk generate a makefile, where we list all the rules for all the enclosures
run (default) run the makefile

Every time a user runs gmakepod, it remakes all those files anew.

Config file

A user needs to keep a list of feed subscriptions somewhere. The 1st thing that comes to mind is to use a list of newline-separated urls, but what if we want to have diff options for each feed? E.g., a enclosures filter of some sort? We can just add 'options' to the end of line (Idk, like url=http://example.com!filer.type=audio) but then we need to choose a record sep that isn't a space, which means we the escaping of the record sep in urls or living w/ the notion 'no ! char is allowed' or similar nonsense.

The next question is: how does a makefile process a single record? It turns out, we can eval foo=bar inside of a recipe, so if we pass

make -f feed_parse.mk 'url=http://example.com!filer.type=audio'

where feed_parse.mk looks like

parse-record = ... # replace ! w/ a newline
%:
$(eval $(call parse-record,$*))
@echo $(url)
@echo $(filter.type)

then every 'option' becomes a variable! This sounds good but is actually a gimcrack.

Make will think that url=http://example.com!filer.type=audio is a variable override & complain about missing targets. To ameliorate that we can prefix the line w/ # or :. Sounds easy, but than we need to slice the line in parse-record macro. This is the easiest job in any lang except Make--you won't do it correctly w/o invoking awk or any other external tool.

If we use an external tool for parsing a mere config line, why use a self-inflicted parody of the config file instead of a human-readable one?

Ini format would perfectly fit. E.g.,

[JS Party]
url = https://changelog.com/jsparty/feed

lines are self-explanatory to anyone who has seen a computer once. We can use Ruby, for example, to convert the lines to :name=JS_Party!url=http://changelog.com... or even better to

:\{\".name\":\"JS_Party\",\".url\":\"https://changelog.com/jsparty/feed\"\}

(notice the amount of shell escaping) & use Ruby again in makefile to transform that json to name=val pairs, that we eval in the recipe later on.

How do we pass them to the makefile? If we escape each line correctly, xargs will suffice:

ruby ini-parse.rb subs.ini | xargs make -f feed-parse.mk

Parsing xml

Ruby, of course, has a full fledged rss parser in its stdlib, but do we need it? A fancy podcast client (that tracks your every inhalation & exhalation) would display all metadata from an rss it can obtain, but I don't want the fancy podcast client, what I want what's most important to me, is that I have a guarantee is a program that reliably downloads the last N enclosures from a list of feeds.

Thus the minimal parser looks like

$ curl -s https://emacsel.com/mp3.xml | \
nokogiri -e 'puts $_.css("enclosure,link[rel=\"enclosure\"]").\
map{|e| e["url"] || e["href"]}' \
| head -2
https://cdn.emacsel.com/episodes/emacsel-ep7.mp3
https://cdn.emacsel.com/episodes/emacsel-ep6.mp3

Options

One of the obviously helpful user options is the number of enclosures he wants to download. E.g, when the user types

$ gmakepod g=emacs e=5

the client produces .files file that has a list of 5 shell-escaped json 'records'. e=5 option could also appear in an .ini file. To distinguish options passed from the CL from options read from the .ini, we prefix options from the .ini w/ a dot. The opt macro is used to get the final value:

opt = $(or $($1),$(.$1),$2)

E.g.: $(call opt,e,2) checks the CL opt first, then the .ini opt, &, as a last resort, returns the def value 2.

Output file names

Not every enclosure url has a nice path name. What file name should we assign to an .mp3 from the url below?

https://play.podtrac.com/npr-510289/npr.mc.tritondigital.com/NPR_510289/media/anon.npr-mp3/npr/pmoney/2018/05/20180504_pmoney_pmpod839v2.mp3?orgId=1&d=1606&p=510289&story=608577210&t=podcast&e=608577210&ft=pod&f=510289

Maybe we can use the URI path, 20180504_pmoney_pmpod839v2.mp3 in this case. Is it possible to extract it in pure Make?

In the most extreme case, the uri path may not be even unique. Say a feed has 2 entries & each article has 1 enclosure, than they both may have the same path name:

<entry>
<title>Foo</title>
<link rel="enclosure" type="audio/mpeg" length="1234"
href="http://example.com/podcast?episode=2"/>
<id>2ba2a6ee-52fb-11e8-9176-000c2945132f</id>
</entry>

<entry>
<title>Bar</title>
<link rel="enclosure" type="audio/mpeg" length="5678"
href="http://example.com/podcast?episode=1"/>
<id>3f66b198-52fb-11e8-a2a8-000c2945132f</id>
</entry>

In addition, the output name must be 'safe' in terms of Make. This means no spaces or $, :, %, ?, *, [, ~, \, # chars.

All of this leads us to another use of Ruby in Make stead. We extract the uri path from the url, strip out the extension, prefix the path w/ a name of a feed (listed in the .ini), append a random string + an extension name, so the output file from the above url looks similar to:

media/NPR_Planet_Money/20180504_pmoney_pmpod839v2.84a8b172.mp3

A homework question: what should we do if a uri path lacks a file extension?

History

If we successfully downloaded an enclosure, there is rarely a need to download it again. A 'real' podcast client would look at id/guid (the date is usually useless) to determine if the entry has any updated enclosures; our Mickey Mouse parser relies on urls only.

Make certainly doesn't have any key/value store. We could try employing the sqlite CL interface or dig out gdbm or just append a url+'\n' to some history.txt file.

The last one is a tempting one, for grep is uber-fast. As the history file becomes a shared resource, we might get ourselves in trouble during parallel downloads, though. lockfile rubygem provides a CL wrapper around a user specified command, hence can protect our 'db':

rlock history.lock -- ruby -e 'IO.write "history.txt", ARGV[0]+"\n", mode: "a"' 'http://example.com/file.mp3'

It works similarly to flock(1), but supposedly is more portable.

Makefile generation

The last but one step is to generate a makefile named .download.mk. After we collected all enclosure urls, we write to the .mk file a set of rules like

media/Foobar_Podcast/file.84a8b172.mp3
@mkdir -p $(dir $@)
curl 'http://example.com/file.mp3' > $@
@rlock history.lock -- ruby -e 'IO.write "history.txt", ARGV[0]+"\n", mode: "a"' 'http://example.com/file.mp3'

Our last step is to run

make -f .download.mk -k -j1 -Oline

The number of jobs is 1 by default, but is controllable via the CL param (gmakepod j=4).


  1. Image src: Fried-Tomato 

Monday, April 2, 2018

Node FeedParser & Transform streams

FeedParser is itself a Transform stream that operates in object mode. Nevertheless, in the majority of examples it appears at the end of a pipeline, e.g.:

let fp = new FeedParser()
fp.on('readable', () => {
    // get the data
})
my_readable_stream.pipe(fp)

Say we want to get first 2 headlines from an rss:

$ curl -s https://emacsel.com/mp3.xml | node headlines1.js | head -2
Episode 7 - Jorgen Schäfer
Episode 6 - Charles Lowell

Let's use FeedParser as god hath intended it as a transform stream that reads the rss from the stdin & writes the articles to another transform stream that grabs only the headlines &, in turn, pipes them to the stdout:

$ cat headlines1.js
let Transform = require('stream').Transform
let FeedParser = require('feedparser')

class Filter extends Transform {
    constructor() {
      super()
      this._writableState.objectMode = true // we can eat objects
    }
    _transform(input, encoding, done) {
      this.push(input.title + '\n')
      done()
    }
}

process.stdin.pipe(new FeedParser()).pipe(new Filter()).pipe(process.stdout)

(Type npm i feedparser before running the example.)

This works, although it throws an EPIPE error, because head command abruptly closes the stdout descriptor, while the script tries to write into it.

You may add smthg like

process.stdout.on('error', (e) => e.code !== 'EPIPE' && console.error(e))

to silence it, but it's better to use pump module (npm i pump) & catch all errors from all the streams. There's a chance that the module will be added to the node core, so get used to it already.

$ diff -u1 headlines1.js headlines3.js | tail -n+4
 let FeedParser = require('feedparser')
+let pump = require('pump')

@@ -14,2 +15,3 @@

-process.stdin.pipe(new FeedParser()).pipe(new Filter()).pipe(process.stdout)
+pump(process.stdin, new FeedParser(), new Filter(), process.stdout,
+     err => err && err.code !== 'EPIPE' && console.error(err))

Now, what if we want to control the exact number of articles our Filter stream receives? I.e., if an rss is many MBs long & we want only n articles from it? First, we add a CL parameter to our script:

$ cat headlines4.js
let Transform = require('stream').Transform
let FeedParser = require('feedparser')
let pump = require('pump')

class Filter extends Transform {
    constructor(articles_max) {
      super()
      this._writableState.objectMode = true // we can eat objects
      this.articles_max = articles_max
      this.articles_count = 0
    }
    _transform(input, encoding, done) {
      if (this.articles_count++ < this.articles_max) {
          this.push(input.title + '\n')
      } else {
          console.error('ignore', this.articles_count)
      }
      done()
    }
}

let articles_max = Number(process.argv.slice(2)) || 1
pump(process.stdin, new FeedParser(), new Filter(articles_max), process.stdout,
     err => err && err.code !== 'EPIPE' && console.error(err))

Although this works too, it still downloads & parses the articles we don't want:

$ curl -s https://emacsel.com/mp3.xml | node headlines4.js 2
Episode 7 - Jorgen Schäfer
Episode 6 - Charles Lowell
ignore 3
ignore 4
ignore 5
ignore 6
ignore 7

Unfortunately to be able to 'unpipe' a readable stream (from the Filter standpoint it's the FeedParser instance) we have to have a ref to it, & I don't know a way to get such a ref from a Transform stream within, except via explicitly passing a pointer:

$ diff -u headlines4.js headlines5.js | tail -n+4
 let pump = require('pump')

 class Filter extends Transform {
-    constructor(articles_max) {
+    constructor(articles_max, feedparser) {
      super()
      this._writableState.objectMode = true // we can eat objects
      this.articles_max = articles_max
      this.articles_count = 0
+
+     if (feedparser) {
+         this.once('unpipe', () => {
+             this.end()      // ensure 'finish' event gets emited
+         })
+     }
+     this.feedparser = feedparser
     }
     _transform(input, encoding, done) {
      if (this.articles_count++ < this.articles_max) {
          this.push(input.title + '\n')
      } else {
-         console.error('ignore', this.articles_count)
+         console.error('stop on', this.articles_count)
+         if (this.feedparser) this.feedparser.unpipe(this)
      }
      done()
     }
 }

 let articles_max = Number(process.argv.slice(2)) || 1
-pump(process.stdin, new FeedParser(), new Filter(articles_max), process.stdout,
+let fp = new FeedParser()
+pump(process.stdin, fp, new Filter(articles_max, fp), process.stdout,
      err => err && err.code !== 'EPIPE' && console.error(err))

Test:

$ curl -s https://emacsel.com/mp3.xml | node headlines5.js 2
Episode 7 - Jorgen Schäfer
Episode 6 - Charles Lowell
stop on 3

Grab the gist w/ a final version here.

Friday, March 9, 2018

YA introduction to GNU Make

I don't want to convert this blog to a Make-propaganda outlet, but here's my another take on that versatile tool + a bunch of HN comments. Enjoy.

Saturday, February 24, 2018

A shopping hours calculator

Say you have a small mom&pop online shop that sell widgets.

If you have (a) dedicated personnel that calls customers on the phone to confirm an order w/ its shipping details, & (b) such a 'department' usually work regular hours & isn't available 24/7.

(This is the exact scheme to which the vast majority of Ukrainian online shops still adhere to.)

This is how a shop gets its 1st bad review: it's Friday evening 7 o'clock, a client places an order for a widget & waits for a call that doesn't come until the Monday morning. The enraged client then may even try to call the shop during the weekend not realising it's closed.

One of the possible solutions to this is to have a note (on a page where customers review their orders) that at this very moment the person that can confirm/discuss their order is offline.

How do you calculate that? It's an easy task if you work the same hours every day of the year but for a small online shops it's often not the case. Not only there are a handful of the official gov holidays, some holidays have moveable dates (Easter), sometimes a holiday falls on a weekend & must be transferred to the next working day. What if your customer department has a lunch break like everyone else? A client should be able to see that the call they're so eagerly waiting for is going to come in an hour, not in 2 minutes.

So I wrote a small JS library to help w/ that: https://github.com/gromnitsky/shopping-hours. You can use it on either server or client side. The basic idea is: we fetch a .txt file (a calendar) & check for a status "is the shop open" that simultaneously tells us when the shop will be open/closed. We fetch the cal only once & do the checking however often we care.

The calendar DSL looks like this:

-/-                 9:00-13:00,14:00-18:00
1/1                 0:0-0:0                   o   new year
easter_orthodox     0:0-0:0                   o
fri.4/11            6:30-23:00                -   black friday
sat/-               10:30-17:00
sun/-               0:0-0:0

On a client side, you can test it w/ adding to .html: <script src="shopping_hours.min.js"></script> & to .js:

async function getcal(url) {
    let cal = await fetch(url).then( r => r.text())
    return shopping_hours(cal)  // parse the calendar
}

getcal('calendar1.txt').then(sh => {
    console.log(sh.business())
})

which outputs smthg like {status: "open", next: Sat Feb 24 2018 17:00:00 GMT+0200 (EET)}. See the github page for the details.

Thursday, January 18, 2018

There is no price for good advice

From The Design & Evolution of C++ by Bjarne Stroustrup:

'In 1982 when I first planned Cfront, I wanted to use a recursive descent parser because I had experience writing and maintaining such a beast, because I liked such parsers' ability to produce good error messages, and because I liked the idea of having the full power of a general-purpose programming language available when decisions had to be made in the parser.

However, being a conscientious young computer scientist I asked the experts. Al Aho and Steve Johnson were in the Computer Science Research Center and they, primarily Steve, convinced me that writing a parser by hand was most old-fashioned, would be an inefficient use of my time, would almost certainly result in a hard-to-understand and hard-to-maintain parser, and would be prone to unsystematic and therefore unreliable error recovery. The right way was to use an LALR(1) parser generator, so I used Al and Steve's YACC.

For most projects, it would have been the right choice. For almost every project writing an experimental language from scratch, it would have been the right choice. For most people, it would have been the right choice. In retrospect, for me and C++ it was a bad mistake.

C++ was not a new experimental language, it was an almost compatible superset of C - and at the time nobody had been able to write an LALR(1) grammar for C. The LALR(1) grammar used by ANSI C was constructed by Tom Pennello about a year and a half later - far too late to benefit me and C++. Even Steve Johnson's PCC, which was the preeminent C compiler at the time, cheated at details that were to prove troublesome to C++ parser writers. For example, PCC didn't handle redundant parentheses correctly so that int(x); wasn't accepted as a declaration of x.

Worse, it seems that some people have a natural affinity to some parser strategies and others work much better with other strategies. My bias towards topdown parsing has shown itself many times over the years in the form of constructs that are hard to fit into a YACC grammar. To this day [1993], Cfront has a YACC parser supplemented by much lexical trickery relying on recursive descent techniques. On the other hand, it is possible to write an efficient and reasonably nice recursive descent parser for C++. Several modern C++ compilers use recursive descent.'

Saturday, December 23, 2017

Sometimes being stubborn does not pay off

From the GNU coreutils TODO file:

sort: Investigate better sorting algorithms; see Knuth vol. 3.

We tried list merge sort, but it was about 50% slower than the recursive algorithm currently used by sortlines, and it used more comparisons. We're not sure why this was, as the theory suggests it should do fewer comparisons, so perhaps this should be revisited.

List merge sort was implemented in the style of Knuth algorithm 5.2.4L, with the optimization suggested by exercise 5.2.4-22. The test case was 140,213,394 bytes, 426,4424 lines, text taken from the GCC 3.3 distribution, sort.c compiled with GCC 2.95.4 and running on Debian 3.0r1 GNU/Linux, 2.4GHz Pentium 4, single pass with no temporary files and plenty of RAM.

Since comparisons seem to be the bottleneck, perhaps the best algorithm to try next should be merge insertion.

$ git log -S 'sort: Investigate better sorting algorithms' --source --all
commit 93f9ffc6141ad028223a33e5ab0614aebbd4aa2e refs/tags/CPPI-1_11
Author: Jim Meyering 
Date:   Sat Aug 2 06:27:13 2003 +0000

    Document in TODO Paul's desire to make sort faster (and how he
    was foiled this time around).
    
    from Paul Eggert.

Tuesday, October 31, 2017

Flycheck, eslint & "libxml2 library not found" error

The javascript-eslint syntax checker requires an XML parser. It can use the built-in xml.el (a pure Elisp implementation) or it can employ the external libxml2 library. The latter is preferred because it's significantly faster.

The defaul Windows port of Emacs doesn't include the libxml2 dll. To get a proper version:

  1. Add the bin/ dir from the Emacs distro (where emacs.exe file is) to the PATH.

  2. Download a proper version of emacs-2x-ARCH-deps.zip bundle (i.e., watch for i686 vs. x86_64 builds).

  3. Extract 3 files from bin/ dir of the zip:

    • libiconv-2.dll
    • libxml2-2.dll
    • zlib1.dll

    & copy them to the Emacs bin/ dir.

  4. Restart Emacs.

Tuesday, September 12, 2017

'indent-region' in the Emacs batch mode

I had an old project of mine that was using an archaic tab-based indentation w/ the assumed tab width of 4. The mere opening of its files in editors other than Emacs was causing such pain that I decided to do what you should rarely do--reindent the whole project.

(A side note: if you're setting the tab width to a value other than 8 & simultaneously is using tabs for indentation, you're a bad person.)

The project had ~40 files. Manually using Emacs indent-region for each file would have been too wearisome. Then I remembered that Emacs has the batch mode.

A quick googling gave me a recipe akin to:

$ emacs -Q -batch FILE --eval '(indent-region (point-min) (point-max))' \
  -f save-buffer

It would have worked but it had several drawbacks as a general solution, for it:

  1. modifies a file in-place;
  2. can't read from the stdin;
  3. doesn't work w/ multiple files forcing a user to use xargs(1) of smthg;
  4. if FILE doesn't exist, Emacs quietly creates an empty file, whereas a user probably expects to see an error message.

Whereas an ideal little script should:

  1. never overwrite the source file;
  2. read the input from the stdin or from its arguments;
  3. spit out the result to the stdout;
  4. handle the missing input reasonably well.

Everything else should be accommodated by the shell, including the item #2 from the 'drawbacks' list above.

Using Emacs for such a task is tempting, for it supports a big number of programming modes, giving us the ability to employ the editor as a universal reindenting tool, for example, in makefiles or Git hooks.

Writing to the stdout

We can slightly modify the above recipe to:

$ emacs -Q -batch FILE --eval '(indent-region (point-min) (point-max))' \
  --eval '(princ (buffer-string))'

In -f stead we are using --eval twice. buffer-string function does exactly what it says: returns the contents of the current buffer as a string.

$ cat 1.html
<p>
hey, <i>
what's up</i>?
</p>

$ emacs -Q -batch 1.html --eval '(indent-region (point-min) (point-max))' \
  --eval '(princ (buffer-string))'
Indenting region...
Indenting region...done
<p>
  hey, <i>
    what's up</i>?
</p>

The "Indenting region" lines come from Emacs message function (which the progress reporter uses). In the batch mode message prints the lines to the stderr.

The solution also address the item #3 from the "drawbacks" list--Emacs doesn't create an empty file on the invalid input, although it doesn't indicate the error properly, i.e., it fails w/ the exit code 0.

Processing all files at once

If you know what are you doing & the files you're going to process are under Git, overwriting the source it not a big deal. Perhaps for a quick hack this script will do:

:; exec emacs -Q --script "$0" -- "$@" # -*- emacs-lisp -*-

(defun indent(file)
  (set-buffer "*scratch*")
  (if (not (file-directory-p file))
      (when (and (file-exists-p file) (file-writable-p file))
        (message "Indenting `%s`" file)

        (find-file file)
        (indent-region (point-min) (point-max))
        (save-buffer))))

(setq args (cdr argv))                  ; rm --
(dolist (val args)
  (indent val))

If you put the src above into a file named emacs-indent-overwrite & add executable bits to it, then the shell thinks it's a usual sh-script that doesn't have a shebang line. A colon in sh is a noop, but on stumbling upon exec the shell replaces itself with the command

emacs -Q --script emacs-indent-overwrite -- arg1 arg2 ...

When Emacs reads the script, it doesn't expect it to be a sh one, but hopefully the file masks itself as a true Elisp, for in Elisp a symbol whose name starts with a colon is called a keyword symbol. : is a symbol w/o a name (a more usual constant would be :omg) that passes the check because it satisfies keywordp predicate:

ELISP> (keywordp :omg)
t
ELISP> (keywordp :)
t

Everything else in the file is usual Elisp code. Here's how it works:

$ emacs-indent-overwrite src/*
Indenting ‘src/1.html‘
Indenting region...
Indenting region...done
Indenting ‘src/2.html‘
Indenting region...
Indenting region...done

The only thing worth mentioning about the script is why indent procedure has (set-buffer "*scratch*") call. It's an easy way to switch the current directory to the directory from which the script was started. This is required, for find-file modifies the default directory of the current buffer (via modifying a buffer-local default-directory variable). The other way is to modify args w/ smthg like

(setq args (mapcar 'expand-file-name args))

& just pass the file argument as a full path.

Reading from the stdin

There is next to none info about the stdin in the Emacs Elisp manual. The section about minibuffers hints us that for reading from the standard input in the batch mode we ought to seek out for functions that normally read kbd input from the minibuffer.

One of such functions is read-from-minibuffer.

$ cat read-from-minibuffer
:; exec emacs -Q --script "$0"
(message "-%s-" (read-from-minibuffer ""))

$ printf hello | ./read-from-minibuffer
-hello-
$ printf 'hello\nworld\n' | ./read-from-minibuffer
-hello-

The function only read the input up to the 1st newline which it also impertinently ate. This leaves us w/ a dilemma: if we read multiple lines in a loop, should we unequivocally assume that the last line contained the newline?

read-from-minibuffer has another peculiarity: on receiving the EOF character it raises an unhelpful error:

$ ./read-from-minibuffer
^D
Error reading from stdin
$ echo $?
255

The simplest Elisp cat program must watch out for that:

$ ./cat.el < cat.el
:; exec emacs -Q --script "$0"
(while (setq line (ignore-errors (read-from-minibuffer "")))
  (princ (concat line "\n")))

Next, if we read the stdin & print the result to the stdout, our "ideal" reindent script cannot rely on find-file anymore, unless we save the input in a tmp file. Or we can leave out "heavy" find-file altogether & just create a temp buffer & copy the input there for the further processing. The latter implies we must manually set the proper major mode for the buffer, otherwise indent-region won't do anything good.

set-auto-mode function does the major mode auto-detection. One of the 1st hints it looks for is the file extension, but the stdin has none. We can ask the user to provide one in the arguments of the script.

:; exec emacs -Q --script "$0" -- "$@" # -*- emacs-lisp -*-

(defun indent(mode text)
  (with-temp-buffer
    (set-visited-file-name mode)
    (insert text)
    (set-auto-mode t)
    (message "`%s` major mode: %s" mode major-mode)

    (indent-region (point-min) (point-max))
    (buffer-string)))

(defun read-file(file)
  (with-temp-buffer
    (insert-file-contents file)
    (buffer-string)))

(defun read-stdin()
  (let (line lines)
    (while (setq line (ignore-errors (read-from-minibuffer "")))
      (push line lines))
    (push "" lines)
    (mapconcat 'identity (reverse lines) "\n")
    ))

 

(setq args (cdr argv))                  ; rm --

(setq mode (car args))
(if (equal "-" mode)
    (progn
      (setq mode (nth 1 args))
      (if (not mode) (error "No filename hint argument, like .js"))
      (setq text (read-stdin)))
  (setq text (read-file mode)))

(princ (indent mode text))

If the 1st argument to the script is '-', we look for the hint in the next argument & then start reading the stdin. It the 1st arg isn't '-' we assume it's a file that we insert into a temp buffer & return its contents as a string.

The

(mapconcat 'identity (reverse lines) "\n")

line in read-stdin procedure is an equivalent of

['world', 'hello'].reverse().join("\n")

in JavaScript.

Some examples:

$ printf "<p>\nhey, what's up?\n</p>" | ./emacs-indent - .html
‘.html‘ major mode: html-mode
Indenting region...
Indenting region...done
<p>
  hey, what's up?
</p>

$ printf "<p>\nhey, what's up?\n</p>" | ./emacs-indent - .txt
‘.txt‘ major mode: text-mode
Indenting region...
Indenting region...done
<p>
hey, what's up?
</p>

$ ./emacs-indent src/2.html 2>/dev/null
<p>
  not much
</p>

Saturday, June 17, 2017

Texinfo is back

I've revived the old project of converting the Nodejs API docs into the Texinfo format.

The orig docs are in markdown, but node folks have made sure that no plain markdown parser would ever suffice for the conversion of the docs into anything other than html.

Each function's description starts w/ a heading w/ the name of the function + its args (which is handy, for we can use the label as an index entry) but after the heading there is an html comment which in reality is a YAML block w/ a metadata.

There is no uniformity in the way of writing tables: in some sections there are github-style tables, the other contain html <table> blocks which are a royal pain to dissect in case the table contains a tiny markup error that is invisible to a visual inspection (for browsers are known to be quite liberal at consuming sloppy <table>s) but wholly breaks the assumptions of your parser.

Sometimes there are additional anchors before the headings, like

<a id="FOO"></a>
### foo(a, b, c, [bar], [baz])

that supposedly serve as a quick way of linking to the subsection, i.e., [read this](#FOO) instead of [read this](#module_foo_a_b_c_bar_baz) but, again, forces us to apply additional logic during the header parsing.

The Texinfo itself can be rather irritating: it doesn't allow to reliably use stops or colons within a header text. texi2any bitterly complains about the issue if the output format is info but prints nothing worth mentioning when the output is html.

Yet, the ability to use indices in Emacs outweighs the disadvantages of maintaining a custom markdown->texi util.

Sunday, March 26, 2017

How to prevent Wine from auto-running .exe files in Fedora

After installing Wine on Fedora 25 (for a reason that isn't terribly important here) I've noticed that I can run Window executables straight from the bash command line:

$ file -b ~/.wine/drive_c/Program\ Files/Internet\ Explorer/iexplore.exe
PE32+ executable (GUI) x86-64, for MS Windows
$ !$

That iexplore.exe file is clearly not a ELF, yet the kernel successfully executes it.

Since Windows ecosystems is known for its total absence of malware & ransomware; & Wine, in turn, is known for a military grade, bulletproof sandbox (i.e., it provides no protection whatsoever), I find the notion of auto-running very difficult to reconcile w/ common sence.

How does it work

If a kernel was compiled w/ CONFIG_BINFMT_MISC option, execve(2) (I'm simplifying here a little) obtains an ability to delegate the execution of unknown binaries to external processes. The subsystem that manages the format → interpreter association table is called binfmt_misc.

binfmt_misc maintains its ephemeral database in the procfs. At the boot time you mount /proc/sys/fs/binfmt_misc directory & feed /proc/sys/fs/binfmt_misc/register file w/ specially crafted text lines to create association entries (or rules as systemd docs like to call them).

E.g., in the good old days before systemd, we would have put

echo :win:M:0:MZ::/usr/bin/wine: > /proc/sys/fs/binfmt_misc/register

somewhere in /etc/rc.local.

Such a command creates /proc/sys/fs/binfmt_misc/win file:

enabled
interpreter /usr/bin/wine
flags:
offset 0
magic 4d5a

where the offset & the magic 4d5a (MZ) means the 1st 2 bytes of a typical Windows executable:

$ hexdump -C -n10 iexplore.exe
00000000  4d 5a 40 00 01 00 00 00  06 00                    |MZ@.......|
0000000a

We can be even more fancier & make an extension → interpreter association, e.g.:

# echo :xv:E::gif::/usr/bin/xv: > /proc/sys/fs/binfmt_misc/register
$ chmod +x slave-ship-daily-schedules.gif
$ ./!$

creates .gif → xv entry, & starts /usr/bin/xv slave-ship-daily-schedules.gif under the hood when we try to "run" the .gif image.

To delete all the entries, we write -1 to /proc/sys/fs/binfmt_misc/status file or if we want to delete only a particular entry:

# echo -1 > /proc/sys/fs/binfmt_misc/xv

The systemd way

systemd doesn't allow us to mingle w/ binfmt_misc subsystem directly. We ought to write the text lines in the same format binfmt_misc undestands but put them in special .conf files, where a separate program systemd-binfmt expects to find them.

If we provide an rpm for our app, we should put my-app.conf file in /usr/lib/binfmt.d/ directory during the installation & run

/usr/lib/systemd/systemd-binfmt my-app.conf

in the post-install hook.

The algo systemd-binfmt uses is fairly straightforward. When run w/o any command line args, it deletes all the existing entries & recreates the ones specified in the .conf files. If provided w/ the name of the .conf file (w/o a directory prefix!), it scans the file to add new entries.

An excerpt from src/binfmt/binfmt.c (commit 1539a65):

if (argc > optind) {
  int i;

  for (i = optind; i < argc; i++) {
    k = apply_file(argv[i], false);
    if (k < 0 && r == 0)
      r = k;
  }
} else {
  _cleanup_strv_free_ char **files = NULL;
  char **f;

  r = conf_files_list_nulstr(&files, ".conf", NULL, conf_file_dirs);
  if (r < 0) {
    log_error_errno(r, "Failed to enumerate binfmt.d files: %m");
    goto finish;
  }

  /* Flush out all rules */
  write_string_file("/proc/sys/fs/binfmt_misc/status", "-1", 0);

  STRV_FOREACH(f, files) {
    k = apply_file(*f, true);
    if (k < 0 && r == 0)
      r = k;
  }
}

It sounds fine, until we remember that a typical user (a) doesn't know much about binfmt-misc kernel module, (b) doesn't know anything about the standalone systemd-binfmt program, for he uses systemd-binfmt.service unit, as in

# systemctl restart systemd-binfmt

That unit (/usr/lib/systemd/system/systemd-binfmt.service) incorporates a handful of preconditions. The relevant ones:

ConditionDirectoryNotEmpty=|/lib/binfmt.d
ConditionDirectoryNotEmpty=|/usr/lib/binfmt.d
ConditionDirectoryNotEmpty=|/usr/local/lib/binfmt.d
ConditionDirectoryNotEmpty=|/etc/binfmt.d
ConditionDirectoryNotEmpty=|/run/binfmt.d

When we install Wine, it makes 2 entries in the systemd-binfmt DB. If we (a) remove the offending package wine-systemd that contains the evil /usr/lib/binfmt.d/wine.conf file & (b) dutifully restart systemd-binfmt service--no binfmt-misc association entries gets removed!

# rpm -qf /lib/binfmt.d/wine.conf
wine-systemd-2.3-1.fc25.noarch
# rpm --nodeps -e wine-systemd
# systemctl restart systemd-binfmt
# ls -l /proc/sys/fs/binfmt_misc/
total 0K
--w------- 1 root root 0 Mar 24 20:49 register
-rw-r--r-- 1 root root 0 Mar 25 20:48 status
-rw-r--r-- 1 root root 0 Mar 25 20:58 windows
-rw-r--r-- 1 root root 0 Mar 25 20:58 windowsPE

because systemd will forbid systemd-binfmt to run at all, because all the conf directories are empty. The remedy is to run systemd-binfmt by hand:

# /usr/lib/systemd/systemd-binfmt
# ls -l /proc/sys/fs/binfmt_misc/
total 0K
--w------- 1 root root 0 Mar 24 20:49 register
-rw-r--r-- 1 root root 0 Mar 25 20:48 status

You should not manually delete the rpm w/ the evil .conf file for dnf will reinstall wine-systemd package during the next Wine update. The recommended systemd-style solution is to create a symlink in /etc/binfmt.d/ to /dev/null named wine.conf & then restart systemd-binfmt service:

# ln -s /dev/null /etc/binfmt.d/wine.conf
# systemctl restart systemd-binfmt

Saturday, March 4, 2017

Bloatware comes when nobody's lookin'

If you write a Chrome extension for distribution outside of Google Webstore, you will inevitably need to pack the extension into a .crx file. It can be done either using the Chrome UI (the "Pack extension..." button) or "manually" via creating a .zip file & prepending a specially crafted header to it.

Obviously, the 2nd variant can be easily automated. If you have several extensions to maintain, a part of your build system that's responsible for the .crx generation could be extracted to some kind of a "plugin" that can be shared between all the extensions you maintain. In my case, the "plugin" consists of 2 files: a small sh script & a tiny makefile (that can be safely included into another makefile).

I was thinking about uploading those 2 files to npm (& hesitating a little for if a package wouldn't contain any JS code soever, should it be on npm?) but then decided to check the registry first.

The npm registry indeed contains multiple packages for dealing w/ .crx files. One of the most popular is called "crx" (at the time of writing it had 192 ★ on Github). Glancing through its code I failed not to notice the unfortunate difference between the classic Unix approach to such a problem & the JavaScript one. Perhaps, the latter is everything you may do ironically, unless you're a grand, hardcore troll.

Before I explain myself further, let's digress for a bit to explain what .crx files are & how to create them.

CRX Package Format

In the ideal world, Alice would package her extension (a set of .js & .json files) into a .zip file, would upload the file to her http://cs.a-nice-university.edu/~alice/ page & would tell her dear friend Bob about it.

In reality, the extension must be protected from tampering. If Eve, the evil sysadmin of cs.a-nice-university.edu zone, will entertain an idea of mingling w/ Alice's extension via adding a hot Bitcoin miner to it, Bob should be able to detect that before the extension gets intalled.

One way of doing it is to generate a pair of public & private keys. You sign the extension w/ your private key & users check the downloaded .zip w/ you public key. The trouble is there is no standard way to "sign" a .zip file for neither its metadata supports such a thing as an embedded signature, not any of "archive managers" would know what to do w/ such an upgraded metadata.

This is where .crx files come in: they contain a copy of an RSA public key + an encrypted SHA1 of a .zip archive in question. This metadata simply gets prepended to the original .zip file.

http://gromnitsky.users.sourceforge.net/articles/crx.svg

E.g., Alice, after testing her Chrome extension, zips it to a file:

$ touch manifest.json
$ zip foo.zip !$

Next, she generates a private 1024-bit RSA key:

$ openssl genrsa 1024 > private.pem

Then prepends the aforementioned block to foo.zip. As it's a multiple step process, she writes it down in zip2crx script (this is a modified version of the script from developer.chrome.com):

#!/bin/sh

[ $# -ne 2 ] && {
    echo "Usage: `basename $0` file.zip private_key" 1>&2
    exit 1
}

zip=$1
key=$2

name=${zip%.zip}
crx="$name.crx"
pub="$name.pub"
sig="$name.sig"
zip="$name.zip"
trap 'rm -f "$pub" "$sig"' EXIT

# signature
openssl sha1 -sha1 -binary -sign "$key" < "$zip" > "$sig"

# public key
openssl rsa -pubout -outform DER < "$key" > "$pub" 2>/dev/null

# Take "abcdefgh" and return it as "ghefcdab"
byte_swap() {
    echo "${1:6:2}${1:4:2}${1:2:2}${1:0:2}"
}

crmagic_hex="4372 3234" # Cr24
version_hex="0200 0000" # 2
pub_len_hex=$(byte_swap $(printf '%08x\n' $(stat -c %s "$pub")))
sig_len_hex=$(byte_swap $(printf '%08x\n' $(stat -c %s "$sig")))

(
    echo "$crmagic_hex $version_hex $pub_len_hex $sig_len_hex" | xxd -r -p
    cat "$pub" "$sig" "$zip"
) > "$crx"

Her script signs the .zip, automatically derives a public key from private.pem, combines together all the necessary data for a .crx file consumer (like the length of the private key, &c) & concatenates the obtained header w/ the .zip

$ ./zip2crx foo.zip private.pem
$ file foo.crx
foo.crx: Google Chrome extension, version 2

The resulting .crx is ready to be used in Chrome. She uploads it to her web page & sends to Bob (via email) her public key (possibly encrypted w/ his GPG public key, but we won't get into that).

Bob, having obtained Alice's public key, downloads foo.crx, extracts from it the embedded public key, the signature & foo.zip & checks (w/ Alice's public key) the validity of foo.zip.

(The whole verification process is a little too convoluted to present it here, but if you're interested, download this script & run it against the provided key:

$ ./crx2zip foo.crx alice.pub
RSA key                 1024-bit
Total header size:      306 bytes
Public key:             alice.pub
Signature status:       Verified OK

where alice.pub would be a public key in the DER-format.)

If Eve, the evil sysadmin, did indeed modify foo.crx in any way, Bob is able to detect that, for despite that Eve can do all the same operations as Bob did, she doesn't have Alice's private key & thus is unable to re-sign the tampered foo.zip properly in an undetectable way. All she can hope for is that Bob, being a lazybones, won't bother to do the necessary checks before installing foo.crx into his browser.

The example is somewhat contrived, for if Alice decides to upload foo.zip to Google Webstore instead, by this virtue she exempts herself from managing the crypto keys. The Webstore makes the key pair for her, does all the checks of all the sub-sequential updates of foo.zip & generates the correct .crx. The final extension is being delivered to Bob via HTTPS, thus leaving Eve, the evil sysadmin, out of luck.

The JavaScript way

Back to our findings from the npm registry.

crx package has dual nature: it's a library & a CLI util. The distinguishing feature of this program is that "It is written purely in JavaScript and does not require OpenSSL!".

The author states it as if the mere act of depending on OpenSSL code is somehow inconvenient or morally repugnant. I must say he's not alone in his view. I may sympatise for the overly cautious approach in the case of maintaining a big farm of cloud services like Amazon does, but I cannot reconcile w/ this stance in the case of a local developer machine.

A user of a .crx file doesn't have to interfere w/ OpenSSL, it's solely the developer's job to create the proper .crx. To find a developer machine that doesn't have OpenSSL intalled already is a challenge that only a few of us can achieve. (Certainly, we may be so bold; we may think of some poor souls who have to use old versions of Windows but even they can always download Cygwin.)

Then there is the size. Our zip2crx example is 37 lines long. Anybody can read it (even indolent Bob) & in a minute understand what the script is doing. crx package on the other hand

$ find node_modules/crx/{bin,src} -type f | xargs wc -l
  150 node_modules/crx/bin/crx.js
  294 node_modules/crx/src/crx.js
   47 node_modules/crx/src/resolver.js
  491 total

is ~13 times bigger.

The package also provides some kind of an artisan build system surrogate. It "packs" the source code into different places depending on its command line options. It generates RSA keys.

I know that it's quite fashionable in the JavaScript world to write a replacement for Make every year, but it still amazes me why would anyone do the job in the courageous crx package way, instead of reusing available, well-tested tools on your machine.

The Unix way

Everything except zip2crx is already here. Suppose you have foobar extension:

foobar/
├── Makefile
├── src/
│   ├── hello.js
│   └── manifest.json
└── zip2crx*

src directory contains the source code for the extension. Makefile is a primitive 18-lines long set of shell instructions:

pkg.name := foobar
out := _build
src := $(shell find src -type f)

mkdir = @mkdir -p $(dir $@)     # a canned recipe

.PHONY: crx
crx: $(out)/$(pkg.name).crx

$(out)/$(pkg.name).zip: $(src)
        $(mkdir)
        cd $(dir $<) && zip -qr $(CURDIR)/$@ *

%.crx: %.zip private.pem
        ./zip2crx $< private.pem

private.pem:
        openssl genrsa 2048 > $@

If you type make it'll generate in _build/ directory foobar.zip & foobar.crx. If you don't have an RSA key, Make will generate it for you too.

Why would you need JavaScript for that?

I'll conclude by quoting Doug McIlroy:

"A first engineering question to ask is: how often is one likely to have to do this exact task? Not at all often, I contend. It is plausible, though, that similar, but not identical, problems might arise. A wise engineering solution would produce--or better, exploit--reusable parts."

Wednesday, February 1, 2017

A JavaScript client for dreamwidth.org

Following the recent exodus from a hostile LiveJournal, I've noticed that I'm unable to perform 2 things from the command line in the good old DreamWidth:

  • Posting in markdown
  • Uploading the image

Turns out, DW still uses a subset of ancient LJ-like xmlrpc APIs. Instead of reusing some of the available LJ clients, I've decided to hack my own.

The API has a weird auth scheme. The 1st step is to obtain a "challenge", then md5 it w/ a password (yes, you've read it correctly: it's md5, folks! Sonja Henie's tutu.) & send it over w/ all the subsequent requests. The weird part comes when you realise that such a token lives only 60 sec & for a long-lived session you need to obtain another generated token that doesn't actually work w/ the xmlrpc API & useful only for manual scraping/posting purposes.

The CLI client hides everything under 1 user-visible command dreamwidth-js. To upload an image to the DW cloud, type:

  $ dreamwidth-js img-upload < cat.jpg  

To make a new post:

$ dreamwidth-js entry-post-md
---
subject: On today's proceedings
tags: meeting, an exciting waste of time
security: friends
---

## Agenda for Mon. budget meeting

It was a dark and stormy night; the rain fell in torrents
^D

Wednesday, December 14, 2016

A MITM attack in the reign of Elizabeth I

This is what you end up w/ when you have an encryption but no message authentication code:

"Babington and his associates, having laid such a plan [of the assassination of Elizabeth], as, they thought, promised infallible success, were impatient to communicate the design to the queen of Scots [...]

For this service, they employed Gifford, who immediately applied to Walsingham [Sir Francis, the Secretary of State], that the interest of that minister might forward his secret correspondence with Mary. Walsingham proposed the matter to Paulet [...] The letters, by Paulet's connivance, were thrust through a chink in the wall; and answers were returned by the same conveyance.

Babington informed Mary of the design laid for a foreign invasion, the plan of an insurrection at home, the scheme for her deliverance, and the conspiracy for assassinating the usurper, by 6 noble gentlemen [...] Mary replied, that she approved highly of the design; [...]

These letters [...] were carried by Gifford to secretary Walsingham; were decyphered by the art of Philips, his clerk; and copies taken of them.

Walsingham employed another artifice, in order to obtain full insight into the plot: He subjoined to a letter of Mary's a postscript in the same cipher; in which he made her desire Babington to inform her of the names of the conspirators. The indiscretion of Babington furnished Walsingham with still another means of detection, as well as of defence. That gentleman had caused a picture to be drawn, where he himself was represented standing amidst the six assassins; and a motto was subjoined, expressing that their common perils were the band of their confederacy."

(From The History of England by David Hume.)

On her trial, Mary denied the charges of the insurrection & the assassination, stating that she personally did not write those letters in such a form, for all her correspondence was controlled by 2 secretaries, who did the tedious process of (de|en)cryption on her behalf.

Saturday, October 29, 2016

BOM & exec

Abstract

Recently, I’ve stumbled upon a post about an accidental BOM in a shell script file. tl;dr for those who don’t read Ukrainian:

  1. A guy had a typical shell script that got corrupted by some Windows editor by prefixing the first line of the file (the shebang line) with the BOM.
  2. The shell was trying to execute the script.
  3. Everybody got upset.

I got curious why bash tries to run scripts w/ BOM in the first place. I’ve looked into the latest bash-4.3 & tcsh-6.19.00 on Fedora 24. Everywhere in the text below we draw the BOM w/ the replacement character (codepoint U+FFFD): �.

Some findings:

  • I was wrong about the bloody shebang lines for I thought that no shell ever reads them.
  • bash & tcsh don’t use libc properly & both invent their own rigmarole instead of using the provided routine.
  • bash is a mess! (Which is hardly a discovery.)

With shebang

If a file contains a valid shebang line, everything is easy: when you pass the file name to any of execv, execve, execlp, etc. functions, the kernel steps in, reads the shebang line and executes the interpreter, that was mentioned in the shebang, with the file in question as its argument.

This picture falls to pieces, when the file contains the petty BOM, for the kernel fails to recognize that �#!/omg/lol should be (in our naïve mind) an equivalent to #!/omg/lol.

Both tcsh & bash have a backup plan for systems w/o the shebang support in the kernel. Besides the obvious win32 candidate, tcsh lists 2 other systems: os390 & bs2000 (I wonder who on earth still have them). bash uses autoconf & therefore doesn’t have a pre hard-coded build configuration set. Unfortunately, I believe the autoconf test for the shebang line support is bogus:

$ cat ac_sys_interpreter
#! /bin/cat
exit 69

Presumably, the thinking was: if you run it on any modern system, the kernel will run /bin/cat ac_sys_interpreter which will just print the file, but on prehistoric time-sharing machines a simple-minded /bin/sh will execute it as a shell script & then you can test if the exit code == 69. (For why it would do so–read the next section.) The trouble is, that the old system may very well have /bin/sh that does its own shebang processing in case kernel doesn’t, alas rendering the test useless, & henceforth compiling bash w/o shebang support.

Without shebang

As long as the kernel flops at the invalid first line, the whole commotion becomes the case of a file w/o the shebang.

This is how we were all taught about interpreter files back in the day:

“the shell reads the command and tries to execlp the filename. Because the shell script is an executable file but isn’t a machine executable, an error is returned and execlp assumes that the file is a shell script (which it is). Then /bin/sh is executed with the pathname of the shell script as its argument.”

(from APUE, the 3rd ed)

E.g. suppose we have

$ cat demo2.sh
echo Діти, це їжачок!
ps -p $$                # print the shell the script is running under

If we run it, the shell

  1. checks if the script has executable bits (suppose it has)
  2. tries to exec the file
  3. which fails with ENOEXEC, for it’s not a ELF
  4. [a tcsh/bash dance]
  5. exec again but this time it’s /bin/sh with demo2.sh as an argument

The last item is important & may be not quite apparent, for if you have a csh-script

$ cp demo2.sh demo2.csh

you may expect that tcsh will not run it as sh-one:

$ tcsh -f
> ./demo2.csh
Діти, це їжачок!
   PID TTY          TIME CMD
102213 pts/21   00:00:00 sh

which is false, for tcsh follows the standards here.

Expectations vs. reality

APUE says a shell is ought to use execlp that in turn is supposed to do all the dirty work for us. As it happens execlp does exactly that, at least in Linux glibc. Of course, both bash/tcsh ignore the advice & use their own scheme.

tcsh does a plain execv then, after failure, peeks into the first 2 bytes to see (w/ the help of iswprint(3)) if they are “printable”. Here, if tcsh (a) finds the file “acceptable” & (b) tries to run the script with the shebang line in it on a system w/o kernel support for such a line, it processes that line by itself.

If we poison our script with the BOM:

$ uconv --add-signature demo2.sh > demo2.bom.sh
$ chmod +x !$
$ head -c 37 !$ | hexdump -c
0000000 357 273 277   e   c   h   o     320 224 321 226 321 202 320 270
0000010   ,     321 206 320 265     321 227 320 266 320 260 321 207 320
0000020 276 320 272   !  \n                                            
0000025

tcsh doesn’t try to re-execv & aborts:

> ./demo2.bom.sh
./demo2.bom.sh: Exec format error. Wrong Architecture.

bash, on the other hand, tries to be more clever, failing spectacularly. After execve it goes into a journey of figuring out why the exec has failed. It:

  1. opens the file & analyses the shebang line! In the example above we didn’t have one, but if we did, bash would have produced a message:

    $ cat demo3.invalid.awk
    #!/usr/bin/awwwwwwwk -f
    BEGIN { print "this is awk" }
    
    $ ./demo3.invalid.awk
    sh: ./demo3.invalid.awk: /usr/bin/awwwwwwwk: bad interpreter: No such file or directory
    

    tcsh won’t do anything like that & will print ./demo3.invalid.awk: Command not found..

  2. checks if the file has an ELF header & tries to find out what is wrong w/ it;

  3. reports the “success” of the execution, if the file has the length of 0.

  4. checks if the file is “binary”. I use quotes here, for this is an example of how the good intentions don’t always turn into reality. Instead of a simple 2 bytes check, like it’s done in tcsh, bash reads 80 bytes & calls a certain check_binary_file() function that is a good example of why you should not blindly trust the comments in the code:

    /* Return non-zero if the characters from SAMPLE are not all valid
       characters to be found in the first line of a shell script.  We
       check up to the first newline, or SAMPLE_LEN, whichever comes first.
       All of the characters must be printable or whitespace. */
       
    int
    check_binary_file (sample, sample_len)
         char *sample;
         int sample_len;
    {
      register int i;
      unsigned char c;
       
      for (i = 0; i < sample_len; i++)
        {
          c = sample[i];
          if (c == '\n')
            return (0);
          if (c == '\0')
            return (1);
        }
       
      return (0);
    }
    

    Despite of the resolution for all of the characters must be printable or whitespace, the function returns 1 only in case when sample contains the NULL character. Our BOM-example doesn’t have one, thus the script runs, albeit with a somewhat cryptic error if you have no idea about the existence of the BOM in the file:

    $ ./demo2.bom.sh
    ./demo2.bom.sh: line 1: �echo: command not found
       PID TTY          TIME CMD
    115569 pts/26   00:00:00 sh
    

    What if we do have the NULL character?

    $ hexdump -c demo4.null.sh
    0000000   e   c   h   o      \0  \n   e   c   h   o     320 224 321 226
    0000010 321 202 320 270   ,     321 206 320 265     321 227 320 266 320
    0000020 260 321 207 320 276 320 272   !  \n   p   s       -   p       $
    0000030   $  \n                                                        
    0000032
    

    Here NULL is an argument to echo command, which should be totally legal, but not w/ bash!

    $ ./demo4.null.sh
    sh: ./demo4.null.sh: cannot execute binary file: Exec format error
    

    Which of course wouldn’t be an issue had the file had the shebang line.

  5. If bash finds the file “acceptable” on a system w/o kernel support for the shebang line when the file indeed contains one, it does the same thing tcsh does: tries to process it by itself.

Conclusion

The most popular shells are too bloated, bizarre & have many undocumented features.

Some hints:

  • The shebang line isn’t necessary if you target /bin/sh, but the shell does less work if you provide it.
  • To view BOMs, use less(1) or hexdump(1).
  • To test for the BOM, use file(1).
  • To remove the BOM manually, use M-x find-file-literally in Emacs.

Sunday, October 2, 2016

How GNU Make's patsubst Function Really Works

Abstract

$(patsubst) is a GNU Make internal function that deals with text processing such as file names transformations. Despite of having a very simple idea behind it, the peculiar way of its implementation leads to confusion & uncertainty for novice Make users. The function doesn’t return any errors or signal any warnings. It uses its own wildcard mechanism that doesn’t have any resemblance with the usual glob or regexp patterns.

For example, why this transformation doesn’t work?

$(patsubst src/%.js, build/%.js, ./src/foo.js)

We expect ./src/foo.js to be converted to build/foo.js, but patsubst leaves the file name untouched.

Extract method

Before we begin, we need a quick way of inspecting the results of patsubst evaluations. GNU Make doesn’t have a REPL. There are primitive hacks around it like ims:

$ rlwrap -S 'ims> ' ./ims
ims> . $(words will cost ten thousand lives this day)
7

that allow you to play with Make functions interactively, but they won’t help you to examine Make’s internals, for there is no way to view the source code of a particular function like you do it in irb + method_source gem, for example.

I’ve extracted patsubst function from Make 4.2.90 into a separate command gmake-patsubst. After you compile it, just run it from the shell as:

$ ./gmake-patsubst src/%.js build/%.js ./src/foo.js
./src/foo.js

providing exactly 3 arguments as you would do in makefiles, only using the shell quoting/splitting rules instead of Make’s (i.e., use a space as an argument separator instead of a comma).

(A side note about the extract: it’s ≈ 520 lines of an imperative code! This is what you get when you program in C.)

If you want to read the algo itself, start from patsubst_expand_pat().

patsubst explained

Let’s recap first what patsubst does.

$(patsubst PATTERN, REPLACEMENT, TEXT)

The majority of its use is to tranform a list of file names. It operates like a map() on an iterable in JavaScript:

TEXT
    .split(/\s+/)
    .map( (file) => magic_transform(PATTERN, REPLACEMENT, file) )
    .join(' '))

It’s a pure function that returns a new result, leaving its arguments untouched. It works with supplied file names in TEXT as strings–it doesn’t do any IO.

The first thing to remember is that it splits TEXT into chunks before doing any substantial work further. All transforming is being done by individually applying PATTERN to each chunk.

For example, we have a list of .jsx file that we want to tranform into the list of .js files. You may think that the simplest way of doing it with patsubst would look like this:

$ ./gmake-patsubst .jsx .js "foo.jsx bar.jsx"
foo.jsx bar.jsx

Well, that didn’t work!

The problem here is that in this case patsubst checks if each chunk matches PATTERN exacly as a full word byte-to-byte. In regex terms this would look as ^\.jsx$. To prove this, we modify our pattern to be exactly foo.jsx:

$ ./gmake-patsubst foo.jsx .js "foo.jsx bar.jsx"
.js bar.jsx

Which works as we described but isn’t much of a help in real makefiles.

Thus patsubst has a wildcard support. It is similar to the character % in Make pattern rules, that mathes any non-empty string. For example, % in %.jsx pattern could match foo against foo.jsx text. The substring that % matches (foo in the example) is called a stem1.

There could be only one % in a pattern. If you have several of them, only the first one would be the wildcard, all others would be treated as regular characters.

To return to our example with .jsx files, using % in both PATTERN & REPLACEMENT arguments yields to desired result:

$ ./gmake-patsubst %.jsx %.js "foo.jsx bar.jsx"
foo.js bar.js

When REPLACEMENT contains a % character, it is replaced by the stem that matched the % in PATTERN.

Using the character % only in patterns is rarely useful, unless you want to replicate Make’s $(filter-out) function:

$ ./gmake-patsubst %.jsx "" "foo.jsx bar.js"
bar.js

Which is the equivalent of

$(filter-out %.jsx, foo.jsx bar.js)

If there is no % in PATTERN but there is % in REPLACEMENT, patsubst resorts to the case of a simple, exact substitution that we saw before.

$ ./gmake-patsubst foo.jsx % "foo.jsx bar.jsx"
% bar.jsx

Now, to return to our first example from Abstract:

$(patsubst src/%.js, build/%.js, ./src/foo.js)

Why didn’t it work out?

Putting together all we’ve learned so far, here is the high-level algorithm of what patsubst does:

  1. It searches for the % in PATTERN & REPLACEMENT. If found, it cuts off everything before %. Let’s call such a cut-out part pattern-prefix (src/) & replacement-prefix (build/). It leaves us with .js & (again) .js correspondingly. Let’s call those parts pattern-suffix & replacement-suffix.

  2. Splits TEXT into chunks. In our case there is nothing to split, for we have only 1 file name (a string w/o spaces): ./src/foo.js.

  3. If there is no % in PATTERN it does a simple substitution for each chunk & returns the result.

  4. If there indeed was % in PATTERN, it (for each chunk):

    4.1. (a) Makes sure that pattern-prefix is a substring of the chunk. In JavaScript it would look like:

     CHUNK.slice(0, PATTERN_PREFIX.length) === PATTERN_PREFIX
    

    It’s false in our example, for src/ != ./src/.

    (b) Makes sure that pattern-suffix is a substring of the chunk. In JavaScript it would look like:

     CHUNK.slice(-PATTERN_SUFFIX.length) === PATTERN_SUFFIX
    

    It’s true in our example, for .js == .js.

    4.2. If the subitem #4.1 is false (our case!) it just returns an unmodified copy of the original chunk.

    4.3. Iff2 both (a) & (b) in the subitem #4.1 were indeed true, it cuts-out pattern-prefix & pattern-suffix from the chunk, transforming it to a stem.

    4.4. Concatenates replacement-prefix + stem + replacement-suffix.

  5. Joins all the chunks (modified of unmodified) with a space & returns the result.

As you see, the algo is simple enough, but probably is not exactly similar to what you may have imagined after reading the Make documentation.

In conclusion, hopefully now you can explain the result of patsubst evaluation below (why only src/baz.js was transformed correctly):

$ ./gmake-patsubst src/%.js build/%.js "./src/foo.js src/bar.jsx src/baz.js"
./src/foo.js src/bar.jsx build/baz.js

The nodejs version of the patsubst can be found here. Note that it’s a simple example & it must not be held as a reference.

  1. (For non-English speakers like yours trully) The noun stem means several things: 1) (in linguistics) a form of a word after all affixes are removed; 2) (in botany) a slender structure that supports a plant.

  2. A quote from the Emacs manual: ‘“Iff” means “if and only if”. […] Try to avoid using this term in documentation, since many are unfamiliar with it and mistake it for a typo.