Wednesday, January 6, 2021

Twitter stats using gnuplot, json & make

Twitter allows to download a subset of user's activites as a zip archive. Unfortunately, there's no useful visualisations of the provided data, except for a simple list of tweets with a date filtering.

For example, what I expected to find but there were no signs of it:

  1. a graph of activities over time;
  2. a list of:
    1. the most popular tweets;
    2. users, to whow I reply the most.

Inside the archive there is data/tweet.js file that contains an array (assigned to a global variable) of "tweet" objects:

window.YTD.tweet.part0 = [ {
"tweet" : {
"retweeted" : false,
"source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"favorite_count" : "2",
"id" : "12345",
"created_at" : "Sat Jun 23 16:52:42 +0000 2012",
"full_text" : "hello",
"lang" : "en",
...
}
}, ...]

The array is already json-formatted, hence it's trivial to convert it to a proper json for filtering with json(1) tool.

Say we want a list of top 5 languages in thich tweets were written. A small makefile:

$ cat lang.mk
lang: tweets.json
json -a tweet.lang < $< | $(aggregate) | $(sort)
tweets.json: $(i)
unzip -qc $< data/tweet.js | sed 1d | cat <(echo [{) - > $@

aggregate = awk '{r[$$0] += 1} END {for (k in r) print k, r[k]}'
sort = sort -k2 -n | column -t
SHELL := bash -o pipefail

yields to:

$ make -f lang.mk i=1.zip | tail -5
cs 16
und 286
ru 333
en 460
uk 1075

(1.zip is the archive that Twitter permits us to download.)

To draw activity bars, the same technique is applied: we extract a date from each tweet object & aggregate results by a day:

2020-12-31 5
2021-01-03 10
2021-01-04 5

This can be fed to gnuplot:

$ make -f plot.mk i=1.zip activity.svg

This makefile has an embedded gnuplot script:

$ cat plot.mk
include lang.mk

%.svg: dates.txt
cat <(echo "$$plotscript") $< | gnuplot - > $@

dates.txt: tweets.json
json -e 'd = new Date(this.tweet.created_at); p = s => ("0"+s).slice(-2); this.tweet.date = [d.getFullYear(), p(d.getMonth()+1), p(d.getDate())].join`-`' -a tweet.date < $< | $(aggregate) > $@

export define plotscript =
set term svg background "white"
set grid

set xdata time
set timefmt "%Y-%m-%d"
set format x "%Y-%m"

set xtics rotate by 60 right

set style fill solid
set boxwidth 1

plot "-" using 1:2 with boxes title ""
endef

To list users, to whom one replies the most, is quite simple:

$ cat users.mk
users: tweets.json
json -e 'this.users = this.tweet.entities.user_mentions.map( v => v.screen_name).join`\n`' -a users < $< | $(aggregate) | $(sort)

include lang.mk

I'm not much of a tweeter:

$ make -f users.mk i=1.zip | tail -5
<redacted> 41
<redacted> 49
<redacted> 60
<redacted> 210
<redacted> 656

Printing the most popular tweets is more cumbersome. We need to:

  1. calculate the rating of each tweet (by a such a complex foumula as favorite_count + retweet_count);
  2. sort all the tweet objects;
  3. slice N tweet objects.

A Make recipe for it is a little too long to show here, but you can grab a makefile that contains the recipe + all the recipes shown above.

No comments:

Post a Comment