Friday, May 11, 2018

Writing a podcast client in GNU Make

Why? First, because I wanted parallel downloads & my old podcacher didn't support that. Second, because it sounded like a joke.

The result is gmakepod.

Evidently, some ingredients for such a client are practically impossible to write using plain Make (like xml parser). Would the client then be considered a truly Make program? E.g, there is a clever bash json parser, but when you look in its src you see that it shies now away from awk & grep.

At least we can try to write in Make as many components as possible, even if (when) they become a bottleneck. Again, why? 1

Overview

Using Make means constructing a proper DAG. gmakepod uses 6 vertices: run.download.mk.files.new.files.enclosures.feeds. All except the 1st one are file targets.

target desc
.feeds (phony) parse a config file to extract feeds names & urls
.enclosures fetch & parse each feed to extract enclosures urls
.files generate a proper output file name for each url
.files.new check if we've already downloaded a url in the past, filter out
.download.mk generate a makefile, where we list all the rules for all the enclosures
run (default) run the makefile

Every time a user runs gmakepod, it remakes all those files anew.

Config file

A user needs to keep a list of feed subscriptions somewhere. The 1st thing that comes to mind is to use a list of newline-separated urls, but what if we want to have diff options for each feed? E.g., a enclosures filter of some sort? We can just add 'options' to the end of line (Idk, like url=http://example.com!filer.type=audio) but then we need to choose a record sep that isn't a space, which means we the escaping of the record sep in urls or living w/ the notion 'no ! char is allowed' or similar nonsense.

The next question is: how does a makefile process a single record? It turns out, we can eval foo=bar inside of a recipe, so if we pass

make -f feed_parse.mk 'url=http://example.com!filer.type=audio'

where feed_parse.mk looks like

parse-record = ... # replace ! w/ a newline
%:
$(eval $(call parse-record,$*))
@echo $(url)
@echo $(filter.type)

then every 'option' becomes a variable! This sounds good but is actually a gimcrack.

Make will think that url=http://example.com!filer.type=audio is a variable override & complain about missing targets. To ameliorate that we can prefix the line w/ # or :. Sounds easy, but than we need to slice the line in parse-record macro. This is the easiest job in any lang except Make--you won't do it correctly w/o invoking awk or any other external tool.

If we use an external tool for parsing a mere config line, why use a self-inflicted parody of the config file instead of a human-readable one?

Ini format would perfectly fit. E.g.,

[JS Party]
url = https://changelog.com/jsparty/feed

lines are self-explanatory to anyone who has seen a computer once. We can use Ruby, for example, to convert the lines to :name=JS_Party!url=http://changelog.com... or even better to

:\{\".name\":\"JS_Party\",\".url\":\"https://changelog.com/jsparty/feed\"\}

(notice the amount of shell escaping) & use Ruby again in makefile to transform that json to name=val pairs, that we eval in the recipe later on.

How do we pass them to the makefile? If we escape each line correctly, xargs will suffice:

ruby ini-parse.rb subs.ini | xargs make -f feed-parse.mk

Parsing xml

Ruby, of course, has a full fledged rss parser in its stdlib, but do we need it? A fancy podcast client (that tracks your every inhalation & exhalation) would display all metadata from an rss it can obtain, but I don't want the fancy podcast client, what I want what's most important to me, is that I have a guarantee is a program that reliably downloads the last N enclosures from a list of feeds.

Thus the minimal parser looks like

$ curl -s https://emacsel.com/mp3.xml | \
nokogiri -e 'puts $_.css("enclosure,link[rel=\"enclosure\"]").\
map{|e| e["url"] || e["href"]}' \
| head -2
https://cdn.emacsel.com/episodes/emacsel-ep7.mp3
https://cdn.emacsel.com/episodes/emacsel-ep6.mp3

Options

One of the obviously helpful user options is the number of enclosures he wants to download. E.g, when the user types

$ gmakepod g=emacs e=5

the client produces .files file that has a list of 5 shell-escaped json 'records'. e=5 option could also appear in an .ini file. To distinguish options passed from the CL from options read from the .ini, we prefix options from the .ini w/ a dot. The opt macro is used to get the final value:

opt = $(or $($1),$(.$1),$2)

E.g.: $(call opt,e,2) checks the CL opt first, then the .ini opt, &, as a last resort, returns the def value 2.

Output file names

Not every enclosure url has a nice path name. What file name should we assign to an .mp3 from the url below?

https://play.podtrac.com/npr-510289/npr.mc.tritondigital.com/NPR_510289/media/anon.npr-mp3/npr/pmoney/2018/05/20180504_pmoney_pmpod839v2.mp3?orgId=1&d=1606&p=510289&story=608577210&t=podcast&e=608577210&ft=pod&f=510289

Maybe we can use the URI path, 20180504_pmoney_pmpod839v2.mp3 in this case. Is it possible to extract it in pure Make?

In the most extreme case, the uri path may not be even unique. Say a feed has 2 entries & each article has 1 enclosure, than they both may have the same path name:

<entry>
<title>Foo</title>
<link rel="enclosure" type="audio/mpeg" length="1234"
href="http://example.com/podcast?episode=2"/>
<id>2ba2a6ee-52fb-11e8-9176-000c2945132f</id>
</entry>

<entry>
<title>Bar</title>
<link rel="enclosure" type="audio/mpeg" length="5678"
href="http://example.com/podcast?episode=1"/>
<id>3f66b198-52fb-11e8-a2a8-000c2945132f</id>
</entry>

In addition, the output name must be 'safe' in terms of Make. This means no spaces or $, :, %, ?, *, [, ~, \, # chars.

All of this leads us to another use of Ruby in Make stead. We extract the uri path from the url, strip out the extension, prefix the path w/ a name of a feed (listed in the .ini), append a random string + an extension name, so the output file from the above url looks similar to:

media/NPR_Planet_Money/20180504_pmoney_pmpod839v2.84a8b172.mp3

A homework question: what should we do if a uri path lacks a file extension?

History

If we successfully downloaded an enclosure, there is rarely a need to download it again. A 'real' podcast client would look at id/guid (the date is usually useless) to determine if the entry has any updated enclosures; our Mickey Mouse parser relies on urls only.

Make certainly doesn't have any key/value store. We could try employing the sqlite CL interface or dig out gdbm or just append a url+'\n' to some history.txt file.

The last one is a tempting one, for grep is uber-fast. As the history file becomes a shared resource, we might get ourselves in trouble during parallel downloads, though. lockfile rubygem provides a CL wrapper around a user specified command, hence can protect our 'db':

rlock history.lock -- ruby -e 'IO.write "history.txt", ARGV[0]+"\n", mode: "a"' 'http://example.com/file.mp3'

It works similarly to flock(1), but supposedly is more portable.

Makefile generation

The last but one step is to generate a makefile named .download.mk. After we collected all enclosure urls, we write to the .mk file a set of rules like

media/Foobar_Podcast/file.84a8b172.mp3
@mkdir -p $(dir $@)
curl 'http://example.com/file.mp3' > $@
@rlock history.lock -- ruby -e 'IO.write "history.txt", ARGV[0]+"\n", mode: "a"' 'http://example.com/file.mp3'

Our last step is to run

make -f .download.mk -k -j1 -Oline

The number of jobs is 1 by default, but is controllable via the CL param (gmakepod j=4).


  1. Image src: Fried-Tomato 

No comments:

Post a Comment