2

I am exporting data in a csv file using ssis. In my ssis package i compress the file in zip format and upload it on a linux server using sftp. The problem is that in the destination file system, the csv files include a ^M character which comes from the dos system.

I found three solutions.

  • First i could set the sftp transfer mode to ascii and not zip the file (i later found out this is only supported by ftp). Considering that my unzipped file is > 3Gb that is not efficient, the upload will take ages.

  • Secondly once transferred i could unzip the file and convert it using dos2unix utility, but again dos2unix is not installed and i am not authorized to install it to the target system.

  • Finally i could use a unix editor like sed to remove ^M from the end of lines. My file is consisted of more than 4 million lines and this would again take ages.

Q: Is there any way to preformat my file in ASCII using ssis, then zip and transfer?

Duck Dodgers
  • 3,126
  • 6
  • 25
  • 38
Bonzay
  • 688
  • 8
  • 23
  • 1
    [`unzip -aoq`](https://linux.die.net/man/1/unzip). `-a` - *convert text files*. – jww Feb 04 '19 at 13:20

2 Answers2

1

I didn't try it, but I thought you could do a CR+LF -> LF conversion just when outputing to the csv file. I looked in this link here

Scroll down to the section "Header row delimiter". It seems that if you choose {LF} as a row delimiter, your resulting .zip file will show correctly in your linux box.

BTW, probably you know, but I have to mention that ^M is the representation of CR in a linux / unix box.

BTW2, in most cases the ^M in linux is not a problem, just some annoying thing.

I hope I could help!

1

While searching on this issue i found a very useful links were they described the cause and possible resolutions of this issue:

Cause

File has been transferred between systems of different types with different newline conventions. For example, Windows-based text editors will have a special carriage return character (CR+LF) at the end of lines to denote a line return or newline, which will be displayed incorrectly in Linux (^M). This can be difficult to spot, as some applications or programs may handle the foreign newline characters properly while others do not. Thus some services may crash or not respond correctly. Often times, this is because the file is created or perhaps even edited on a Microsoft Windows machine and then uploaded or transferred to a Linux server. This typically occurs when a file is transferred from MS-DOS (or MS-Windows) without ASCII or text mode.


Possible resolutions

(1) Using dos2unix command

dos2unix includes utilities to convert text files with DOS or MAC line breaks to Unix line breaks and vice versa. It also includes conversion of UTF-16 to UTF-8.

You can use a similar command via Execute Process Task:

dos2unix filename

(2) Data Flow Task

You can create a Data Flow task that transfer data from Flat File Source into a new Flat File Destination were both Flat File Connection mAnager has the same structure except the Row Delimiter property ({CR}{LF} in Source , {LF} in destination)

(3) Using a Script Task - StreamReader/Writer

You can use a script task with a similar code:

string data = null;
//Open and read the file
using (StreamReader srFileName = new StreamReader(FileName))
    {
        data = srFileName.ReadToEnd();
        data = data.Replace("\r\n","\n");
    }

using (StreamWriter swFileName = new StreamWriter(FileName))
    {
        swFileName.Write(data);
    }

(4) Extract using unzip -a

From the following unzip documentation:

-a

convert text files. Ordinarily all files are extracted exactly as they are stored (as ''binary'' files). The -a option causes files identified by zip as text files (those with the 't' label in zipinfo listings, rather than 'b') to be automatically extracted as such, converting line endings, end-of-file characters and the character set itself as necessary. (For example, Unix files use line feeds (LFs) for end-of-line (EOL) and have no end-of-file (EOF) marker; Macintoshes use carriage returns (CRs) for EOLs; and most PC operating systems use CR+LF for EOLs and control-Z for EOF. In addition, IBM mainframes and the Michigan Terminal System use EBCDIC rather than the more common ASCII character set, and NT supports Unicode.) Note that zip's identification of text files is by no means perfect; some ''text'' files may actually be binary and vice versa. unzip therefore prints ''[text]'' or ''[binary]'' as a visual check for each file it extracts when using the -a option. The -aa option forces all files to be extracted as text, regardless of the supposed file type. On VMS, see also -S.

So you can use the following command to extract text files with changing line endings:

unzip -a filename

Credit to @jww comment


Other Useful links

Community
  • 1
  • 1
Hadi
  • 31,125
  • 9
  • 49
  • 111