-1

I have one email with html format and need to download it and need to make one csv semicolon field separator result to a new file.

Example of the email recieved:

Content-Type: text/html; charset=UTF-8
<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st= yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">DH</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">Mon</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
(UTC)</th></tr><tr><th style=3D"padding: 8px;">Arael Amarel</th><th style=
=3D"padding: 8px;">30549214</th><th style=3D"padding: 8px;">099981496</th><=
th style=3D"padding: 8px;">43</th><th style=3D"padding: 8px;">-</th><th sty=
le=3D"padding: 8px;">2019-07-11T10:06:34.311Z</th></tr><tr><th style=3D"pad=
ding: 8px;background-color: #dddddd">MATIN TARDEI</th><th style=3D"padding=
: 8px;background-color: #dddddd">45159820</th><th style=3D"padding: 8px;bac=
kground-color: #dddddd">094432451</th><th style=3D"padding: 
8px;background-=
color: #dddddd">32</th><th style=3D"padding: 8px;background-color: #dddddd"=
-</th><th style=3D"padding: 8px;background-color: #dddddd">2019-07- 
11T10:2=
8:41.198Z</th></tr>

Needed csv output:

Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z

If i open this mail on Client there make the table all ok, but I think it´s there a problem of format with procmail if I put in .html file this content (saved by procmail) of procmail and open it it´s make impossible to process the content if I look this content all the end of line are marked with a "=" wich means a lot of problems, furtermore they are some serveral problems in the aligment of the table and other stuff which make it a nightmare to process the content to extact.

I had maked one procmailrc with the filter to convert the html format to plain procmailrc file:

MAILDIR=/new/mail/htmlconvert
:0
* ^Content-Type: text/html.*;
{
:0c
$MAILDIR/converted/
:0fwb
| `which html2text`
:0fwh
| `which formail` -i "Content-Type: text/plain; charset=UTF-8"
}

This is a try number 1, didn't work the converter uses I tough html2text converter if I use html2text directly from the file originated de result is:

html2text

===============================================================================
 1px solid #dddddd;border-collapse: collapse;text-align: left;">
px;background-color: #cce6ff">NAME
px;background-color: #cce6ff">CI
= px;background-color: #cce6ff">DH
px;backgro= und-color: #cce6ff">FG
px;background-color: #c= ce6ff">Mon
px;background-color: #cce6ff">DATE= (UTC)
px;">Arael Amarel
px;">30549214
px;">099981496
<= th style=3D"padding: 8px;">43
px;">-
px;">2019-07-11T10:06:34.311Z
px;background-color: #dddddd">MATIN TARDEI
 8px;background-color: #dddddd">45159820
px;bac= kground-color: #dddddd">094432451
px;background-= color: #dddddd">32
px;background-color: #dddddd"= >-
px;background-color: #dddddd">2019-07-11T10:2= 8:41.198Z
px;">

Already tryied lynx -dump -force-html to the file and the result is´t nothing good to reach the format csv output.

html2text -nobs (file)

Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z

Update: I have applied the solution of tripleee to the procmailrc, however the format of the mail is still the same of the original source, the qprint didn't change the format with this change. However have tried to make it directly to the file and works fine. The actual solution:

qprint -d -n <1563019338.1197_0.localhost.localdomain |
html2text -style pretty |
awk '/^-------------------------------------------------------------------------------/{p=1}p'

The - line is the separator of the body of the mail and the before content, this shows out:

-------------------------------------------------------------------------------

NAME         CI       CD   FG  HJ DATE (UTC)
Yaiaa Fereeira        52104575 097325303 20    -     2019-07-12T10:46:24.716Z
Gabtiel Aosta Sclavi   42445135 098322361 42    -     2019-07-12T11:07:36.110Z

Need now to make this content to the csv out, I thought it will be more easy to the first part but want to automate it to the procmail to do it with the mail download.

The result of procmail changing the procmailrc is the mail with the body still having the "=" as line end, but in the header have:

Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8 

Update The email result source with qprint in the procrc

Return-Path: 
Delivered-To: 
Return-path: 
Envelope-to: 
Delivery-date: Sat, 13 Jul 2019 08:03:48 -0300
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Date: Sat, 13 Jul 2019 11:03:02 +0000 (UTC)
From: 
Mime-Version: 1.0
To: 
Message-ID: 
Subject:Fri Jul 12 2019
X-Spam-Flag: NO

<b>Fri Jul 12 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">NAME</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">CD</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">HJ</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
 (UTC)</th></tr><tr><th style=3D"padding: 8px;">Yaiaa Fereeira</th><th st=
yle=3D"padding: 8px;">52104575</th><th style=3D"padding: 8px;">097325303</t=
h><th style=3D"padding: 8px;">20</th><th style=3D"padding: 8px;">-</th><th =
style=3D"padding: 8px;">2019-07-12T10:46:24.716Z</th></tr>

I have the log in the stdin because procmail can`t write logfile as you can see in this log detail:

1 message for aaa@aaa.com at aaa.com (25330 octets).
reading message aaa@aaa.com@aaa.com:1 of 1 (25330 octets)........................procmail: Error while writing to "/info/in/log"
procmail: [20191] Mon Jul 15 08:55:34 2019
procmail: Assigning "FORMAIL=/usr/bin/formail"
procmail: Assigning "QPRINT=/usr/local/bin/qprint"
procmail: Match on "^Content-Type: text/html;"
procmail: Assigning "LASTFOLDER=converted/new/1563191734.20191_0.localhost.localdomain"
 Subject: Sun Jul 14 2019
  Folder: converted/new/1563191734.20191_0.localhost.localdomain          24985
procmail: Executing " qprint -d -n | html2text -nobs "
procmail: Executing " formail -I "Content-Type: text/html; charset=UTF-8"
procmail: Skipped "Mail"
procmail: Skipped "/"
From aaaaaa.com@aaa.com  Mon Jul 15 08:55:34 2019
 Subject: Sun Jul 14 2019
  Folder: **Bounced**                                                     24985
fetchmail: MDA returned nonzero status 73
 not flushed
  • Could you please add `LOGFILE=procmail.log` and `VERBOSE=yes` to the `.procmailrc` file and update the question to include the resulting `procmail.log` from processing a representative sample message? See also http://www.iki.fi/era/mail/procmail-debug.html – tripleee Jul 13 '19 at 15:04
  • Do you want the message itself to be updated, or the resulting CSV to be saved to an external file? Your recipe does the former, but the latter would seem significantly more useful. – tripleee Jul 13 '19 at 15:06
  • No, I dont need the message change to the new format, only need to Extract the data in the complex html format to one csv clear and semicolon field separated. – User1234141414 Jul 13 '19 at 15:16
  • Tripleee haved updated with the log . – User1234141414 Jul 15 '19 at 12:15
  • The log indicate that the filtering succeeded but you never saved the message anywhere. There's an unpaired `Mail` which probably should have an`:0` before it. – tripleee Jul 15 '19 at 12:19
  • Fiddling with variables to indicate the full path to your binaries is an antipattern, though not quite as crazy as the `which` stuff. Just make sure you have `/usr/local/bin` and `/usr/bin` in your `PATH` instead. – tripleee Jul 15 '19 at 12:20
  • But as indicated near the end of my answer, you probably want to remove the `fw` flags and instead add `>>output.csv` after `text2html -nobs` – tripleee Jul 15 '19 at 12:22
  • No the Mail/ unpaired are alone outside the {} where is all this sentences with the :0 expression will try to remove it.. – User1234141414 Jul 15 '19 at 12:27
  • What? The lone `Mail` is a syntax error. If you want it to save to a folder named `Mail` then you need to add a preamble line `:0` before it. – tripleee Jul 15 '19 at 12:28
  • Removed the Mail/ at the end of the rc file, in the log procmail: Skipped "Mail" procmail: Skipped "/" are not showed more, but the mail its the same of before isn`t touched. – User1234141414 Jul 15 '19 at 12:39
  • No.. in the converted folder are fine.. don`t want to save to this folder.. – User1234141414 Jul 15 '19 at 12:40
  • Did you see the second answer I posted? I keep telling you the same things over and over. – tripleee Jul 15 '19 at 12:43
  • I Dont understand the Folder: **Bounced** showed by procmail. this mail are in "inbox folder" – User1234141414 Jul 15 '19 at 12:43
  • Yes tripleee I saw you telled me maybe need to remove the fw and add the >> output.csv, but.. want to keep the original mail converted in the file, to make the conversion later, maybe in other sentence.. – User1234141414 Jul 15 '19 at 12:45
  • The `**Bounced**` message just means you used `procmail -m` and it didn't save the message anywhere. If you take out `-m` it will save to the end of your default inbox, which you probably want to avoid while experimenting. – tripleee Jul 15 '19 at 12:45
  • Ok, going to change the -m thank you. – User1234141414 Jul 15 '19 at 13:06

1 Answers1

0

The sample in your post does not look like a valid email body at all. I'm guessing it's a body part within a MIME message with Content-type: text/html (as vaguely indicated) and Content-transfer-encoding: quoted-printabe. The latter is what introduces the = escapes which you regard as problematic. Decoding them is actually fairly trivial, but how exactly to do that from Procmail depends on the overall composition of the containing message, and the utilities available to you. Unfortunately, Procmail itself has no idea about MIME structures, so you'll have to rely on external tools.

As an aside the `which ...` commands in your recipe are completely redundant. For which to work, the utilities which you are looking for need to be in your PATH ... which means Procmail can find them without which.

If something is not in Procmail's default PATH, simply update PATH near the top of your .procmailrc file. This should also remove the need to use variables like $FORMAIL etc. Just use formail and make sure it's available on Procmail's PATH.

For your recipe to work, the MIME structure needs to be a single-part message. If that is indeed the case, and your html2text is otherwise correct, the only fix you need is to decode the content-transfer-encoding before piping through that. Assuming you have qprint, and with the superfluous which calls removed, that leaves

:0
* ^Content-Type: text/html.*;
{
  :0c  # no need to spell out $MAILDIR/ prefix
  converted/
  :0fwb
  | qprint -d | html2text
  :0fwh
  | formail -i "Content-Type: text/plain; charset=UTF-8" \
        -i "Content-transfer-encoding: 8bit"
}

If in fact the MIME body structure is more complex, perhaps edit your question to include the actual email source instead of your current ad-lib paraphrase.

In other words, and in some more detail, if your input message looks like

From: sender <sender@example.net>
To: you <you@example.org>
Subject: HTML table
MIME-Version: 1.0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color:....

then the recipe above should basically work. But on the other hand, if your actual message is more like

From: sender <sender@example.net>
To: you <you@example.org>
Subject: HTML table
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=0xdeadbeef

This is a multi-part MIME message.

--0xdeadbeef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color:....

--0xdeadbeef--

then the first condition will not match (the headers don't contain Content-type: text/html), but the actions inside the block will also need to be updated in several places because the MIME wrapping around the HTML body part needs to be unwrapped or somehow otherwise restructured. Here is a really quick and dirty attempt at solving this.

:0
* ^Content-Type: multipart/mixed
{
  :0c  # no need to spell out $MAILDIR/ prefix
  converted/
  :0fwb
  | perl -0777 -pe 's/=([0-9A-F]{2})/ chr(oct("0x$1"))/ge; \
    s/=\n//g; \
    s%</table>.*%%s; \
    s%.*<table[^<>]*>%%s; \
    s%<tr[^<>]*><t[dh][^<>]*>%\n%g; \
    s%<t[dh][^<>]*>%;%g; \
    s%</t[rdh]>%%g; \
    s%^\n+%%;'
  :0fwh
  | formail -i "Content-Type: text/plain; charset=UTF-8" \
        -i "Content-transfer-encoding: 8bit"
}

With minor adaptations, it should work for the single-part variation, too. But you should realize that the Perl script is a really rough cut, not a proper HTML parser.

The f flag causes Procmail to replace the input message with the output from the pipeline. The formail call is then necessary because the original MIME headers are no longer correct after you have replaced the original content with content of a different type and with a different encoding. If you just want to extract the CSV data into an external file instead, the latter can be skipped and the former can be simplified to just

:0
* ^Content-type: text/html
{
  :0c
  converted/
  :0b  # no w flag necessary either once we drop f
  | qprint -d | html2text >>result.csv
}

where again we assume a single-part MIME message as input. Whether to overwrite the output file instead of appending (or perhaps write to a different CSV file each time) will depend on your specific use case, and how often you expect to receive these messages.


(Not in particular an endorsement of qprint; there are many comparable utilities, but nothing particularly ubiquitous. It's unfortunate that the GNU Coreutils maintainers steadfastly refuse to include a similar utility.)

tripleee
  • 139,311
  • 24
  • 207
  • 268
  • tripleee I think im not the only one with this problem, take a look of https://www.linuxquestions.org/questions/linux-general-1/convert-html-emails-to-plain-text-emails-150252/ there they made a script who I have tested without any positive result.. – User1234141414 Jul 12 '19 at 18:10
  • So *that's* where you picked up the `which` craziness. No, you are definitely not alone, but the HTML you are trying to convert from is almost certainly different from what they are processing. – tripleee Jul 12 '19 at 18:16
  • Tripleee testing your solution I find the email was at first view "converted" but the body of the mail are the same of the source input.. Result: Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8 – User1234141414 Jul 13 '19 at 11:57
  • Fri Jul 12 2019
    NAMECICDFGABDATE= (UTC)
    Yanrnaa Ferrsra
    – User1234141414 Jul 13 '19 at 12:00
  • yle=3D"padding: 8px;">5210457509731540320-2019-07-12T10:46:24.716Z – User1234141414 Jul 13 '19 at 12:00
  • Please [edit] your question to provide an actual sample email. As you can see, posting samples in comments is pretty useless. – tripleee Jul 13 '19 at 12:06
  • Tripleee I tryied qprint -d to the file of the mail and works fine to remove the last "=" maked before html2text -nobs and worked fine, I don`t know how to make it work with procmail. – User1234141414 Jul 13 '19 at 12:07
  • It can probably be solved if you can [edit](/posts/57010927/edit) the question to provide the actual source of the sample message. This is the third time I ask you for this. – tripleee Jul 13 '19 at 12:21
  • Sorry tripleee! Im right now go to the edit the question to show you what I have.. maybe I have luck with the solve of this problem. – User1234141414 Jul 13 '19 at 12:25
  • Tripleee already maked the changes to the original post editing it, sorry but I´m new using to ask in stack site. – User1234141414 Jul 13 '19 at 13:19
  • I'm sorry, this update does not help at all. See e.g. https://stackoverflow.com/questions/50248333/how-to-put-some-text-into-procmail-forwarded-e-mail for an example of how to show a minimal but useful message with full headers and body. – tripleee Jul 13 '19 at 13:25
  • Tripleee thanks very useful tips there, but I dont need to "inject" data and make any changes to the email, because dont need to forward or do some with the mail, only need to out to one csv the body of the mail with the contents coming in html table. – User1234141414 Jul 13 '19 at 13:42
  • The linked *question* exhibits an example of how to share email source in a useful way, which you have still not succeeded in. I'm afraid your English skills may not be sufficient to bring this issue to a successful conclusion. – tripleee Jul 13 '19 at 13:44
  • You are right my english skills are not the best, but there are one partial solution to my problem, i´m afraid if I try this make the problem more complex.. but it´s only a try. – User1234141414 Jul 13 '19 at 13:50
  • I don´t know if can make a standard output like > in the cli from the prompt, as I showed in the edited question. I saw there in your link that I can use awk in the procrc.. but understand fine there can make changes to the internal part of the original mail in the procmail action but isnt clear if can output the content to external file. – User1234141414 Jul 13 '19 at 13:55
  • Triplee Im in the review the link have you lefted in the other question : http://porkmail.org/era/procmail/mini-faq.html#recipe-block maybe here it´s all that I need? – User1234141414 Jul 13 '19 at 14:11
  • In terms of Procmail, yes, that's all you need, and that's what you have already. What remains is to show what your actual message source looks like, so that we can modify the correct part of the email. There are three or four things we have to guess at the moment, but which could be sorted if you finally provided the full source of a representative message. – tripleee Jul 13 '19 at 14:26
  • Im in the go to show in the question the result of your solution, but I haved expressed in the question, the format of the mails are untouched with this, I think some of the part that make the execution of qprint do nothing to the body, maybe only do some to the header? or maybe it´s the procmail :0fwb definition? I haved already tested changing this for other :0fwh without result.. the message keep the "=" at the end of the line. – User1234141414 Jul 13 '19 at 14:35
  • See updated answer now. I hope at the very least it should show you in more detail what to explore going forward. – tripleee Jul 13 '19 at 14:49