Monday, April 2, 2018

Node FeedParser & Transform streams

FeedParser is itself a Transform stream that operates in object mode. Nevertheless, in the majority of examples it appears at the end of a pipeline, e.g.:

let fp = new FeedParser()
fp.on('readable', () => {
    // get the data
})
my_readable_stream.pipe(fp)

Say we want to get first 2 headlines from an rss:

$ curl -s https://emacsel.com/mp3.xml | node headlines1.js | head -2
Episode 7 - Jorgen Schäfer
Episode 6 - Charles Lowell

Let's use FeedParser as god hath intended it as a transform stream that reads the rss from the stdin & writes the articles to another transform stream that grabs only the headlines &, in turn, pipes them to the stdout:

$ cat headlines1.js
let Transform = require('stream').Transform
let FeedParser = require('feedparser')

class Filter extends Transform {
    constructor() {
      super()
      this._writableState.objectMode = true // we can eat objects
    }
    _transform(input, encoding, done) {
      this.push(input.title + '\n')
      done()
    }
}

process.stdin.pipe(new FeedParser()).pipe(new Filter()).pipe(process.stdout)

(Type npm i feedparser before running the example.)

This works, although it throws an EPIPE error, because head command abruptly closes the stdout descriptor, while the script tries to write into it.

You may add smthg like

process.stdout.on('error', (e) => e.code !== 'EPIPE' && console.error(e))

to silence it, but it's better to use pump module (npm i pump) & catch all errors from all the streams. There's a chance that the module will be added to the node core, so get used to it already.

$ diff -u1 headlines1.js headlines3.js | tail -n+4
 let FeedParser = require('feedparser')
+let pump = require('pump')

@@ -14,2 +15,3 @@

-process.stdin.pipe(new FeedParser()).pipe(new Filter()).pipe(process.stdout)
+pump(process.stdin, new FeedParser(), new Filter(), process.stdout,
+     err => err && err.code !== 'EPIPE' && console.error(err))

Now, what if we want to control the exact number of articles our Filter stream receives? I.e., if an rss is many MBs long & we want only n articles from it? First, we add a CL parameter to our script:

$ cat headlines4.js
let Transform = require('stream').Transform
let FeedParser = require('feedparser')
let pump = require('pump')

class Filter extends Transform {
    constructor(articles_max) {
      super()
      this._writableState.objectMode = true // we can eat objects
      this.articles_max = articles_max
      this.articles_count = 0
    }
    _transform(input, encoding, done) {
      if (this.articles_count++ < this.articles_max) {
          this.push(input.title + '\n')
      } else {
          console.error('ignore', this.articles_count)
      }
      done()
    }
}

let articles_max = Number(process.argv.slice(2)) || 1
pump(process.stdin, new FeedParser(), new Filter(articles_max), process.stdout,
     err => err && err.code !== 'EPIPE' && console.error(err))

Although this works too, it still downloads & parses the articles we don't want:

$ curl -s https://emacsel.com/mp3.xml | node headlines4.js 2
Episode 7 - Jorgen Schäfer
Episode 6 - Charles Lowell
ignore 3
ignore 4
ignore 5
ignore 6
ignore 7

Unfortunately to be able to 'unpipe' a readable stream (from the Filter standpoint it's the FeedParser instance) we have to have a ref to it, & I don't know a way to get such a ref from a Transform stream within, except via explicitly passing a pointer:

$ diff -u headlines4.js headlines5.js | tail -n+4
 let pump = require('pump')

 class Filter extends Transform {
-    constructor(articles_max) {
+    constructor(articles_max, feedparser) {
      super()
      this._writableState.objectMode = true // we can eat objects
      this.articles_max = articles_max
      this.articles_count = 0
+
+     if (feedparser) {
+         this.once('unpipe', () => {
+             this.end()      // ensure 'finish' event gets emited
+         })
+     }
+     this.feedparser = feedparser
     }
     _transform(input, encoding, done) {
      if (this.articles_count++ < this.articles_max) {
          this.push(input.title + '\n')
      } else {
-         console.error('ignore', this.articles_count)
+         console.error('stop on', this.articles_count)
+         if (this.feedparser) this.feedparser.unpipe(this)
      }
      done()
     }
 }

 let articles_max = Number(process.argv.slice(2)) || 1
-pump(process.stdin, new FeedParser(), new Filter(articles_max), process.stdout,
+let fp = new FeedParser()
+pump(process.stdin, fp, new Filter(articles_max, fp), process.stdout,
      err => err && err.code !== 'EPIPE' && console.error(err))

Test:

$ curl -s https://emacsel.com/mp3.xml | node headlines5.js 2
Episode 7 - Jorgen Schäfer
Episode 6 - Charles Lowell
stop on 3

Grab the gist w/ a final version here.

Friday, March 9, 2018

YA introduction to GNU Make

I don't want to convert this blog to a Make-propaganda outlet, but here's my another take on that versatile tool + a bunch of HN comments. Enjoy.

Saturday, February 24, 2018

A shopping hours calculator

Say you have a small mom&pop online shop that sell widgets.

If you have (a) dedicated personnel that calls customers on the phone to confirm an order w/ its shipping details, & (b) such a 'department' usually work regular hours & isn't available 24/7.

(This is the exact scheme to which the vast majority of Ukrainian online shops still adhere to.)

This is how a shop gets its 1st bad review: it's Friday evening 7 o'clock, a client places an order for a widget & waits for a call that doesn't come until the Monday morning. The enraged client then may even try to call the shop during the weekend not realising it's closed.

One of the possible solutions to this is to have a note (on a page where customers review their orders) that at this very moment the person that can confirm/discuss their order is offline.

How do you calculate that? It's an easy task if you work the same hours every day of the year but for a small online shops it's often not the case. Not only there are a handful of the official gov holidays, some holidays have moveable dates (Easter), sometimes a holiday falls on a weekend & must be transferred to the next working day. What if your customer department has a lunch break like everyone else? A client should be able to see that the call they're so eagerly waiting for is going to come in an hour, not in 2 minutes.

So I wrote a small JS library to help w/ that: https://github.com/gromnitsky/shopping-hours. You can use it on either server or client side. The basic idea is: we fetch a .txt file (a calendar) & check for a status "is the shop open" that simultaneously tells us when the shop will be open/closed. We fetch the cal only once & do the checking however often we care.

The calendar DSL looks like this:

-/-                 9:00-13:00,14:00-18:00
1/1                 0:0-0:0                   o   new year
easter_orthodox     0:0-0:0                   o
fri.4/11            6:30-23:00                -   black friday
sat/-               10:30-17:00
sun/-               0:0-0:0

On a client side, you can test it w/ adding to .html: <script src="shopping_hours.min.js"></script> & to .js:

async function getcal(url) {
    let cal = await fetch(url).then( r => r.text())
    return shopping_hours(cal)  // parse the calendar
}

getcal('calendar1.txt').then(sh => {
    console.log(sh.business())
})

which outputs smthg like {status: "open", next: Sat Feb 24 2018 17:00:00 GMT+0200 (EET)}. See the github page for the details.

Thursday, January 18, 2018

There is no price for good advice

From The Design & Evolution of C++ by Bjarne Stroustrup:

'In 1982 when I first planned Cfront, I wanted to use a recursive descent parser because I had experience writing and maintaining such a beast, because I liked such parsers' ability to produce good error messages, and because I liked the idea of having the full power of a general-purpose programming language available when decisions had to be made in the parser.

However, being a conscientious young computer scientist I asked the experts. Al Aho and Steve Johnson were in the Computer Science Research Center and they, primarily Steve, convinced me that writing a parser by hand was most old-fashioned, would be an inefficient use of my time, would almost certainly result in a hard-to-understand and hard-to-maintain parser, and would be prone to unsystematic and therefore unreliable error recovery. The right way was to use an LALR(1) parser generator, so I used Al and Steve's YACC.

For most projects, it would have been the right choice. For almost every project writing an experimental language from scratch, it would have been the right choice. For most people, it would have been the right choice. In retrospect, for me and C++ it was a bad mistake.

C++ was not a new experimental language, it was an almost compatible superset of C - and at the time nobody had been able to write an LALR(1) grammar for C. The LALR(1) grammar used by ANSI C was constructed by Tom Pennello about a year and a half later - far too late to benefit me and C++. Even Steve Johnson's PCC, which was the preeminent C compiler at the time, cheated at details that were to prove troublesome to C++ parser writers. For example, PCC didn't handle redundant parentheses correctly so that int(x); wasn't accepted as a declaration of x.

Worse, it seems that some people have a natural affinity to some parser strategies and others work much better with other strategies. My bias towards topdown parsing has shown itself many times over the years in the form of constructs that are hard to fit into a YACC grammar. To this day [1993], Cfront has a YACC parser supplemented by much lexical trickery relying on recursive descent techniques. On the other hand, it is possible to write an efficient and reasonably nice recursive descent parser for C++. Several modern C++ compilers use recursive descent.'