Monday, February 8, 2016

Ruby mail & Base64 Content Transfer Encoding

If you need to parse emails that for some reason still use prehistoric charsets (like koi8-u), mail gem fails to decode bodies of such messages properly.

$ cat message.koi8u.mbox
From alice@example.com Mon Feb  8 22:26:51 2016
From: alice@example.com
To: bob@example.net
Subject: Kings
Date: Mon, 08 Feb 2016 20:26:51 +0000
MIME-Version: 1.0
Message-Id: <1@example.com>
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset=koi8-u

7sHE18/SpiDX1sUg083F0svMzywgpiwg1NjNz8Ag0M/XydTJyiwK5NKmzcGk
LCDT1c3VpCC2pNLV 08HMyc0uCvcgy8XE0s/Xycgg0MHMwdTByCwgzc/XIM7
F08HNz9fJ1MnKLArkwdfJxCDQz8jPxNbB pCCmLCDPIMPB0iDOxdPJ1MnKLA
rzwc0g08/CpiDHz9fP0snU2DogIvEuLi4g7ckg0M/XxczJzSEK
$ irb
2.1.3 :001 > require 'mail'
true
2.1.3 :002 > m = Mail.read 'message.koi8u.mbox'
[...]
2.1.3 :003 > m.body.decoded
"\xEE\xC1\xC4\xD7\xCF\xD2\xA6 [...]\n"
2.1.3 :004 > m.body.decoded.encoding
#<Encoding:ASCII-8BIT>

I.e., the result is total garbage.

But as we can obtain a charset name from Mail::Message#charset method, we can just manually convert the string to UTF-8:

2.1.3 :005 > m.body.decoded.force_encoding(m.charset).encode 'utf-8'
"Надворі вже смеркло, і, тьмою повитий,\n
Дрімає, сумує Ієрусалим.\n
В кедрових палатах, мов несамовитий,\n
Давид походжає і, о цар неситий,\n
Сам собі говорить: \"Я... Ми повелим!\n"

No comments:

Post a Comment