Using MIME::Parser fails to decode some emails

Question

I'm a perl novice trying to figure out MIME::Parser to decode mime parts of an email. I mostly have it working, but there's either a deficiency in the code, or other problem that is causing the message to not be decoded properly.

These are emails received from the Ubuntu security mailing list. Somehow they produce weird Â characters throughout the text, while reading the email with alpine seems to decode it just fine.

Here is one snippet from the email after it's been decoded:

Â Felix Wilhelm, Fermin J. Serna, Gabriel Campana and Kevin Hamacher
Â discovered that Dnsmasq incorrectly handled DNS requests. A remote
Â attacker could use this issue to cause Dnsmasq to crash, resulting in
Â a denial of service, or possibly execute arbitrary code.Â 
Â (CVE-2017-14491)`

Here is the code snippet I'm using for this:

use MIME::Parser;
use MIME::Entity;
use MIME::WordDecoder;
use MIME::Tools;
use MIME::Decoder;
use Email::MIME;
my $parser = MIME::Parser->new;
$parser->extract_uuencode(1);
$parser->extract_nested_messages(1);
$parser->output_to_core(1);
my $buf;
while(<STDIN> ){
        $buf .= $_;
}
my $entity = $parser->parse_data($buf);
my $subject = $entity->head->get('Subject');
my $from = $entity->head->get('From');
my $AdvDate = $entity->head->get('Date');
my @mailData;
my $msg = Email::MIME->new($buf);
 $msg->walk_parts(sub {
     my ($part) = @_;
     #warn($part->content_type . ": " . $part->subparts);
     if (($part->content_type =~ /text\/plain$/i) && !@mailData) { 
        #print $part->body;
        @mailData = split( '\n', $part->body);
     }
     elsif (($part->content_type =~ /text\/plain; charset=\"?utf-8\"?/i) && !@mailData) { 
        #print $part->body;
        @mailData = split( '\n', $part->body);
     }
     elsif (($part->content_type =~ /text\/plain; charset=\"?us-ascii\"?/i) && !@mailData) { 
        #print $part->body;
        @mailData = split( '\n', $part->body);
     }
     elsif (($part->content_type =~ /text\/plain; charset=\"?windows-1252\"?/i) && !@mailData) { 
        #print $part->body;
        @mailData = split( '\n', $part->body);
     }
     elsif (($part->content_type =~ /text\/plain; charset=\"?iso-8859-1\"?/i) && !@mailData) { 
        #print $part->body;
        @mailData = split( '\n', $part->body);
     }
 });

Later I do various operations on $buf before writing it to a database.

I've placed a copy of one of the emails that exhibit this problem here

https://pastebin.com/raw/2csUvWup

Please let me know what other information I can provide to properly decode this email.

score 0 · Answer 1 · answered Oct 05 '17 at 04:26

0

Unfortunately, the example you link to does not match the example you embed in your question. Also, your code does not show where and how exactly the output is done, i.e. you don't provide a Minimal, Complete, and Verifiable example but instead only show fragments which might indicate what you are doing but don't actually show what you are doing.

Based on this I can only guess what the problem is but not verify this guess. My guess is that the problem lies in your use of Email::MIME::body instead of Email::MIME::body_str. As documented body "decodes and returns the body of the object as a byte string" while body_str "decodes both the Content-Transfer-Encoding layer of the body (like the body method) as well as the charset encoding of the body (unlike the body method), returning a Unicode string".

In other words, body provides you with the raw octets of the UTF-8 encoded message, body_str instead provides you with the characters. And the last one is probably what you actually want.

answered Oct 05 '17 at 04:26

Steffen Ullrich

90,680
7
99
140

My apologies. Thanks so much for your offer to help. Here is the proper pastebin. https://pastebin.com/raw/hB7N3h8a I've also now tried using body_str and it also did not work. It looks like I can't now edit my original post. Here is a pastebin to the script I've created. https://pastebin.com/f1pZBvep – Alex Regan Oct 05 '17 at 16:37
@AlexRegan: I have no idea where you got the snippet from, but since the data are utf8 you have to use a utf8 output, i.e. for STDOUT `binmode(STDOUT, ":utf8");`. And you have to read the file then with utf8 capable software. – Steffen Ullrich Oct 05 '17 at 18:03
I don't understand. This is code that I wrote. So as to not confuse the matter even more, I excluded the functions that write this data to a database. It knows nothing about utf8. I thought the purpose of MIME::Parser was to decode the attachment type to standard text. That is what I need. – Alex Regan Oct 05 '17 at 18:22
@AlexRegan: the data are properly decoded. If I understand you correctly you put the data into a database and then wonder what you end up with inside the database. If the database knows nothing about utf8 you have a problem since it should know the encoding of the data (i.e. us-ascii, utf8, latin1, utf-16,...) because it cannot properly handle the data. See also https://stackoverflow.com/questions/983778/how-can-i-handle-unicode-with-perls-dbi – Steffen Ullrich Oct 05 '17 at 18:38
Okay, thank you. So while this particular email was utf8, I can't guarantee that all are in that format. Will that be a problem? – Alex Regan Oct 06 '17 at 01:02
1

@AlexRegan: by using `body_str` every encoding of the mail will be properly decoded into Perls internal character encoding, which is more or less utf-8. But you need to be able to store this properly in your DB. – Steffen Ullrich Oct 06 '17 at 05:32

Using MIME::Parser fails to decode some emails

1 Answers1