0

I'm using a script to gather data from an oracle DB using SQL. (I need to add, that I'm neither the author of the script nor really good at shell scripting or SQL, but I can read, understand and modify the script to an extend.)

The result is put into a file, which has the extension .csv, so the recipients can open the file within Excel.
The result file includes German special Charakters (ä,ö,ü), which are displayed properly if I view it in Linux or open it with a text editor (e.g. Notepad++).
The Problem is, that Excel seems to expect Windows-1252 as format. The result text should be encoded in UTF-8.

checking with file -i I get the following result:
anmeldungen_phonenumber_readable.csv: text/plain charset=utf-8

I'm trying to convert the file to Windows-1252 using iconv:
iconv -f UTF8 -t WINDOWS-1252 < anmeldungen_phonenumber_readable.csv > converted.csv

But that doesn't solves the problem. I'm not sure where I'm doing something wrong and definitely need a hand.

Best, Andrés

This is the script without any conversion.

#!/bin/sh

# TEMP_DIR="/data/cronjobs/phonenumber_readable"
TEMP_DIR="/home/sccmcont/testbed/temp/"


# Loesche die Files im temporären Verzeichnis
   rm -f $TEMP_DIR*

MAIL_EMPFAENGER_LIST="XX@XX"
MAIL_COPY_LIST="XX@XX"

# der Header der CSV-Datei

HEADER="Datum;\
Formular-ID;\
Sitekey;\
Geschlecht;\
Formular-Land;\
Formular-Sprache;\
Campagnen-ID;\
Vorname;\
Nachname;\
Bevorzugte Kontaktart;\
E-Mail;\
Telefon;\
KontaktTelefon;\
KontaktErlaubt;\
Nachricht;"

# echo $HEADER

# Die Connect-Informationen SCHWEB-Datenbank
LOCAL_DATABASE="XXXX"
LOCAL_CONNECT="schemaX/password@${LOCAL_DATABASE}"

# Setze das Oracle-Environment fuer Oracle:
ORACLE_HOME="/home/oracle/oracle/product/11.2.0/client_1" ; export ORACLE_HOME
TNS_ADMIN="$ORACLE_HOME/network/admin" ; export TNS_ADMIN
ORA_NLS10="$ORACLE_HOME/nls/data" ; export ORA_NLS10

ORACLE_SID="$LOCAL_DATABASE" ; export ORACLE_SID
ORA_WORLD="$LOCAL_DATABASE" ; export ORA_WORLD

PATH="$ORACLE_HOME/bin:$PATH" ; export PATH

# Sprachumgebung
NLS_LANG="GERMAN_GERMANY.UTF8" ; export NLS_LANG
LC_ALL="de_DE.UTF8" ; export LC_ALL

# Start
echo "`date`:"
echo "'$0' gestartet"
   # wechsle in das TEMP-Verzeichnis
   # cd $TEMP_DIR

   echo "Mails an $MAIL_EMPFAENGER_LIST"
   echo "Kopie an $MAIL_COPY_LIST"

   # Hole die Daten aus der Datenbank

   # Baue das SQL-Script zusammen
   SQL_SCRIPT=`mktemp -p $TEMP_DIR sql_script.XXXXXXXXXX`
   SQL_SPOOL_FILE=`mktemp -p $TEMP_DIR sql_spool.XXXXXXXXXX`

   # hier landen die Ergebnisse
   RESULT_FILE_NAME="anmeldungen_phonenumber_readable.csv"
   RESULT_FILE="$TEMP_DIR/$RESULT_FILE_NAME"

   echo "... erzeuge SQL-Script zur Selektion aller Formulardaten"

   echo "column TREFFER  format A1000" >> $SQL_SCRIPT
   echo "select " >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(distinct 'TREFFER='||wf.SUBMIT_DATE||'|'||wf.form_id||'|'||wf.SITEKEY)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='gender'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='form_country'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='form_language'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='camp'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='givenname'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='surname'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='contact'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='email'  THEN  PVALUE END)||'|'''||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='phone'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='phone-when'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='phone_mail_post_contact_allowed'  THEN  PVALUE END)||'|'||" >> $SQL_SCRIPT
   echo "WMSYS.WM_CONCAT(CASE WHEN PKEY='message' THEN replace(replace(PVALUE,CHR(13),''),CHR(10),' ') END)||'|' as TREFFER " >> $SQL_SCRIPT

  echo "from m_webform_data wfd1 " >> $SQL_SCRIPT
  echo "JOIN M_WEBFORM wf ON wfd1.form_id = wf.form_id" >> $SQL_SCRIPT
  echo "AND wf.NAME = 'form_de_de_sliding_systems_morethanaview_contact_interest'" >> $SQL_SCRIPT
  echo "group by wfd1.FORM_ID order by  wfd1.FORM_ID;" >> $SQL_SCRIPT

   # Hier passiert die eigentliche Arbeit
   # Verbinde mit der Datenbank und fuehre das SQL-Kommando aus
   echo "... starte SQL-Script"
   #
   sqlplus /nolog << EOF
      connect $LOCAL_CONNECT
      alter session set NLS_DATE_FORMAT = 'dd.mm.yyyy';
      SET ECHO OFF
      SET NEWPAGE 0
      SET SPACE 0
      SET PAGESIZE 0
      SET FEEDBACK off
      SET HEADING OFF
      SET TRIMSPOOL ON
      set linesize 32767
      SET LONG 50000
      SET PAGESIZE 80
      spool $SQL_SPOOL_FILE
      @$SQL_SCRIPT
      spool off
   exit
EOF

   # Formatiere den Output-File
   # ersetze ; durch , und | durch ; usw.
   echo $HEADER > $RESULT_FILE
   grep TREFFER= $SQL_SPOOL_FILE | cut -f2 -d'=' | tr ';' ',' | tr '|' ';' >> $RESULT_FILE


   # Schreibe eine Mail
   MAIL_FILE=`mktemp -p $TEMP_DIR mail_recy.XXXXXXXXXX`
   MAIL_BOUNDARY=`date +'__boundary_%d.%m.%Y-%H:%M__'`

   for EMPFAENGER in $MAIL_EMPFAENGER_LIST
   do
      echo "To: $EMPFAENGER" >> $MAIL_FILE
   done

   for CC_EMPFAENGER in $MAIL_COPY_LIST
   do
      echo "Cc: $CC_EMPFAENGER" >> $MAIL_FILE
   done



   echo "Subject: Anfragen 'More than a view / Interest' `date +'%d.%m.%Y - %H:%M'` Uhr" >> $MAIL_FILE
   echo "MIME-Version: 1.0" >> $MAIL_FILE
   echo "Content-Type: multipart/alternative; boundary=$MAIL_BOUNDARY" >> $MAIL_FILE
   echo "" >> $MAIL_FILE
   echo "--$MAIL_BOUNDARY" >> $MAIL_FILE
   echo "Content-Type: text/plain; charset=utf-8" >> $MAIL_FILE
   echo "" >> $MAIL_FILE
   echo "Sehr geehrte Damen und Herren," >> $MAIL_FILE
   echo "" >> $MAIL_FILE
   echo "als Anlage zu dieser Mail erhalten Sie die aktuellen Anfragen des Webspecial 'More than a view / Interest'." >> $MAIL_FILE
   echo "Sie koennen die Datei mit Excel oeffnen." >> $MAIL_FILE
   echo "" >> $MAIL_FILE
   echo "(diese Mail wurde automatisch erstellt)" >> $MAIL_FILE
   echo "" >> $MAIL_FILE

   # der Dateianhang
   echo "--$MAIL_BOUNDARY" >> $MAIL_FILE
   echo "Content-Type: text/plain" >> $MAIL_FILE
   echo "Content-Disposition: attachement; filename=$RESULT_FILE_NAME" >> $MAIL_FILE
   echo "" >> $MAIL_FILE
   # haenge die Spoolfiles an die Mail

   cat $RESULT_FILE >> $MAIL_FILE
   echo "--${MAIL_BOUNDARY}--" >> $MAIL_FILE

  echo "Files: Mail: $MAIL_FILE, Script: $SQL_SCRIPT, Spool: $SQL_SPOOL_FILE, Resultat: $RESULT_FILE"

   # sende die Mail
   /usr/sbin/sendmail -t < $MAIL_FILE

   echo "Mail wurde verschickt"

# Fertig
echo "`date`:"
echo "'$0' beendet"

EDIT: I have changed part of the encoding to I can switch between 8bit and 64bit

  echo "--$MAIL_BOUNDARY" >> $MAIL_FILE
   echo "Content-Type: text/plain" >> $MAIL_FILE
   echo "Content-Disposition: attachement; filename=$RESULT_FILE_NAME" >> $MAIL_FILE

  #Auswählen mit welcher Kodierung die für den Anhang verwendet wird.

   # 8Bit
    #echo "Content-Transfer-Encoding: 8bit" >> $MAIL_FILE

   # 64Bit
    echo "Content-Transfer-Encoding: base64"
    base64 "$RESULT_FILE" 

Result file on unix, before sending it: "wir benötigen für einen Anbau"
Result file from mail with 8bit enconding opended in notepad++: "wir ben??tigen f??r einen Anbau"
Result file from mail with 8bit enconding opended in Excel: "wir ben??tigen f??r einen Anbau"

Result file on unix, before sending it: "wir benötigen für einen Anbau"
Result file from mail with 64bit enconding opended in notepad++: "wir benötigen für einen Anbau"
Result file from mail with 64bit enconding opended in Excel: "wir benötigen für einen Anbau"

hexdump hexdump -C shows ".." instead of special characters (e.g. ä,ö,ü,ß)
00005250 20 53 63 68 c3 bc 63 6f 2c 20 20 77 69 72 20 62 | Sch..co, wir b|
00005260 65 6e c3 b6 74 69 67 65 6e 20 66 c3 bc 72 20 65 |en..tigen f..r e|
00005270 69 6e 65 6e 20 41 6e 62 61 75 20 75 2e 41 2e 20 |inen Anbau u.A. |

2. EDIT: This is the part with the conversion:

 # Konvertieren zu Windows-1252
   iconv -f UTF8 -t WINDOWS-1252 < $RESULT_FILE > $RESULT_FILE_2
 
 # der Dateianhang
   echo "--$MAIL_BOUNDARY" >> $MAIL_FILE
   echo "Content-Type: text/plain" >> $MAIL_FILE
   echo "Content-Disposition: attachement; filename=$RESULT_FILE_NAME_2" >> $MAIL_FILE

  #Auswählen mit welcher Kodierung die für den Anhang verwendet wird. 

   # 8Bit
    #echo "Content-Transfer-Encoding: 8bit" >> $MAIL_FILE

   # 64Bit
    echo "Content-Transfer-Encoding: base64"
    base64 "$RESULT_FILE_2" > /dev/null
    
   echo "" >> $MAIL_FILE
   # haenge die Spoolfiles an die Mail

   cat $RESULT_FILE_2 >> $MAIL_FILE
   echo "--${MAIL_BOUNDARY}--" >> $MAIL_FILE

Result file on unix, before sending it, but after conversion: "* wir ben366tigen f374r einen Anbau*"
Result file from mail with 8bit enconding opended in notepad++: "wir ben?tigen f?r einen Anbau"
Result file from mail with 8bit enconding opended in Excel: "wir ben?tigen f?r einen Anbau"

Result file on unix, before sending it, but after conversion: "* wir ben366tigen f374r einen Anbau*"
Result file from mail with 64bit enconding opended in notepad++: "wir ben�tigen f�r einen Anbau"
Result file from mail with 64bit enconding opended in Excel: "wir ben�tigen f�r einen Anbau"

hexdump hexdump -C of the converted file shows "." instead of special characters (e.g. ä,ö,ü,ß)
00005140 63 68 fc 63 6f 2c 20 20 77 69 72 20 62 65 6e f6 |ch.co, wir ben.|
00005150 74 69 67 65 6e 20 66 fc 72 20 65 69 6e 65 6e 20 |tigen f.r einen |
00005160 41 6e 62 61 75 20 75 2e 41 2e 20 65 69 6e 20 48 |Anbau u.A. ein H|

  • "But that doesn't solves the problem" Can you expand on that? How does the output you get differ from the output you expect – Joni Jul 08 '20 at 11:20
  • If I can remember, Excel includes a "_File encoding_" selector while opening a `csv` file. See https://superuser.com/questions/280603/how-to-set-character-encoding-when-opening-excel – JosefZ Jul 08 '20 at 17:18
  • @Joni This is an excerpt from the result file. Encode in UTF-8: _Danke und Grüße_ After converting to Windows-1252: _Danke und Gr▒▒e_ Opening the file with Excel will then also result in 'garbage'. Maybe I should try a different encoding? I'll provide as much info as I can, expect customer sensitive data. – Andres Trujillo Jul 09 '20 at 07:43
  • @JosefZ That of course can be a solution, but I'm dealing with people from marketing who have no clue what they are doing and want it as easy as possible. – Andres Trujillo Jul 09 '20 at 07:51
  • You can make a `.CSV` file with a [byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8). If you don't know how to to this from a SQL script then prefix the file content with byte sequence 0xEF,0xBB,0xBF _after_ created but _before_ sending it… Then Excel should understand UTF-8 encoding, I'm sure… I'd **avoid converting the file to Windows-1252** because there could be characters out of CP1252 (e.g. Czech ŕ/Ř/č/Č) in global world nowadays… – JosefZ Jul 09 '20 at 08:23
  • The result you see makes no sense, the block drawing character ▒ isn't even present in [windows-1252](https://en.wikipedia.org/wiki/Windows-1252). it sounds like the file becomes corrupted in transit, can you check what *bytes* the word "Grüße" contains with a hex editor in the resulting file? – Joni Jul 09 '20 at 11:49
  • Does this answer your question? [Is it possible to force Excel recognize UTF-8 CSV files automatically?](https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically) – GSerg Jul 09 '20 at 13:00
  • @Joni, the excerpt I gave you was while using _cat_. When I use _less_ it looks like this: Danke und Gr374337e Not sure this helps. I don't have access to an hex editor atm. Which would you suggest? – Andres Trujillo Jul 10 '20 at 10:00
  • The file you're looking at is encoded in windows-1252: 347 and 337 are the octal values for the code units of ü and ß respectively. You must have run less in an UTF-8 environment: this is how it displays byte sequences that it can't decode into characters. Is this from before or after the file has gone though the email system? – Joni Jul 10 '20 at 11:09
  • @Joni: File convertet from UTF-8 to WINDOWS-1252 and send using _Content-Transfer-Encoding: 8bit_ **Before sending**: Unix (with less): Viele Gr374337e (should be Viele Grüße) **After sending**: Windows (with Notepad++): Viele Gr??e (should be Viele Grüße) – Andres Trujillo Jul 10 '20 at 12:49
  • That suggests the file content is correct before it is sent - `less` just doesn't know how to display it (you can confirm by inspecting the bytes with `hexdump -C filename` or `od -t x1 filename`). The next thing to determine is if the file is corrupted by the mail transfer software or by the email client or something else on the receiving end. Does your email client allow inspecting and downloading the "raw" email? Another thing you may try is changing the content transfer encoding to `base64` instead of `8bit` and attach the file with `base64 $RESULT_FILE >> $MAIL_FILE` – Joni Jul 10 '20 at 13:15
  • @Joni, got sick and will not be able to respond for 2-3 more days. Is there a way I can induce a br or LF here to structure my answer a bit better. I would try to send in a comprehensive view of what the status is (8bit with/without conversion, 64bit, etc.). – Andres Trujillo Jul 13 '20 at 07:23
  • You can edit the question to include all details – Joni Jul 13 '20 at 10:31
  • @Joni, added additional information in the question (at the end) – Andres Trujillo Jul 17 '20 at 10:12
  • `c3 b6`and `c3 bc` are ö and ü encoded in UTF8. The text you see in excel also shows the file is in UTF8. Are you attaching the original file or the file converted to cp1252? – Joni Jul 17 '20 at 11:10
  • @Joni, in this case, these are the original (unconverted) files. I hadn't had time to make the same overview with the converted script as I did with the unconverted ones. I will include that later today (most likely this evening) in my question, as I have a lot of catching up to do on my work and sort stuff out. My code is ok? Haven't been doing much unix work since 2003... – Andres Trujillo Jul 17 '20 at 11:24
  • @Joni: Appended the additional information – Andres Trujillo Jul 17 '20 at 21:12
  • The hex dump shows you are sending the right bytes. Your email client is corrupting the file when you download it as an attachment. If it's not the email client, it's something else in the mail system. `�` is the latin-1 rendering of the UTF-8 "replacement character", so *something* read this file assuming it was UTF-8 and then wrote it out as UTF-8 again. It may help to change the content-type header to `text/plain;charset=latin1` or change it to "generic binary file" `application/octet-stream` instead – Joni Jul 18 '20 at 14:09
  • @Joni, I had to insert a BOM: printf '\xEF\xBB\xBF' > $RESULT_FILE – Andres Trujillo Aug 10 '20 at 07:24
  • @JosefZ: printf '\xEF\xBB\xBF' > $RESULT_FILE That did the trick – Andres Trujillo Aug 10 '20 at 07:24

2 Answers2

0

You're not using the content transfer encoding mime header, which means the mail systems default to "7bit". This encoding only allows ASCII text. Most likely some piece of the email infrastructure is corrupting the attachment.

You can add the content transfer encoding header like this:

echo "Content-Transfer-Encoding: 8bit" >> $MAIL_FILE

You may also need to specify the file's encoding in the content type:

echo "Content-Type: text/plain;charset=latin1" >> $MAIL_FILE
Joni
  • 101,441
  • 12
  • 123
  • 178
  • This helps a bit. German special Charakters like (ä,ö,ü and ß) are now shown as '??' _Danke und Grüße = Danke und Gr????e._ So the 7bit encoding of the transfer is one (or the) issue. – Andres Trujillo Jul 10 '20 at 10:38
0

Adding a BOM did the trick. This forces Excel to open the file as given (UTF-8):

printf '\xEF\xBB\xBF' > $RESULT_FILE
thatguy
  • 13,242
  • 6
  • 19
  • 33