4

An idiot customer is generating csv files but one field sometimes has extra commas in (a description field).

Is there a tidy regex to find these bad records and replace the extra commas with something else. A SED command line would be fine.

Example:

A,B,C,This is a description,D,E
F,G,H,This is a description with a comma (,) in it,D,E

I need a SED that can tell that there are too many commas in the line and remove the extra comma from field 4.

We do not have the luxury of telling stupid customer to change their code.

Added

I would not object to a solution that just removes one spurious comma that I have to run multiple times.

OldCurmudgeon
  • 60,862
  • 15
  • 108
  • 197
  • 1
    Two questions: Does each row have a fixed number of fields? And, what fields can have additional commas? – Birei Jan 21 '14 at 18:07
  • I'm taking a wild guess you can't change the separator to pipe or something else? Since to be honest descriptions with commas aren't uncomman. – Jonathan Jan 21 '14 at 22:24
  • @Birei - Yes there is a fixed number of fields in every row. Assume for now only one field can have additional commas but if you can make that adjustable then you'll get a +1 or more. – OldCurmudgeon Jan 21 '14 at 22:47
  • @Jonathan - You are correct - we cannot change the file format. We can only post-process the file. – OldCurmudgeon Jan 21 '14 at 22:48
  • Instead of removing the commans, it would be better to escape the description with quotes: http://tools.ietf.org/html/rfc4180. It would require writing some bash script processing the file line by line. – Danubian Sailor Jan 22 '14 at 09:41
  • @ŁukaszL. - That would be acceptable but you still have the problem of finding the start and end of the field that contains an extra comma. – OldCurmudgeon Jan 22 '14 at 09:52
  • It would work only if the number of columns is fixed. For Javascript, I've matched commas in description for something like that: '(?:^(?:[^,]*,){3})(?:(?:[^,]*(,))*[^,]*)(?:(?:,[^,]*){2}$)', but it's write-only code and it would be better to split the text by commas, take away 3 first and 2 last columns and join+quote the remaining part as description, something like that.... – Danubian Sailor Jan 22 '14 at 09:55
  • @ŁukaszL. - That's very close to what I have got to that nearly works. If you post that as an answer with a little narrative I will up-vote. – OldCurmudgeon Jan 22 '14 at 10:42
  • @OldCurmudgeon I updated my answer with two more solutions, allowing you to specify a whole file and delete the comma or to enclose the description in quots, as Lukasz suggested. – KeyNone Jan 22 '14 at 12:19

4 Answers4

3

Solution 1: single-line, delete ,

Here you go with an SED one-liner:

sed -r 's/([^,],[^,],[^,],)(.*)(,.+,.+)/\1'"$(sed -r 's/([^,],[^,],[^,],)(.*)(,.+,.+)/\2/' <<< $myInput | sed 's/,//g')"'\3/' <<< $myInput

You have to replace <<< $myInput with whatever your actual input is.
As you're working with CSVs you may have to tweak (both occurences of) the regex to match on each line of your CSV sheet.
In case your first three and last two fields are bigger than one char replace [^,] with [^,]*.

Explanation:
We use this regex

/([^,],[^,],[^,],)(.*)(,.+,.+)/

which captures the first (F,G,H,), second (.*) and last part (,D,E) of the string for us.
The first and third capture group will be unchanged, while the second is going to be substitued.
For the substitution we call sed a second (and actually third) time. First we capture only the second group, second we replace every , with nothing (only in the capture group!).

Proof: enter image description here

Of course, if there is no unwanted comma, nothing gets replaced: enter image description here


Solution 2: whole file, line-by-line, delete ,

If you want to specify only a file and the replacement should happen for each line of the file you can use

while read line; do sed -r 's/([^,],[^,],[^,],)(.*)(,.+,.+)/\1'"$(sed -r 's/([^,],[^,],[^,],)(.*)(,.+,.+)/\2/' <<< $line | sed 's/,//g')"'\3/' <<< $line; done < input.txt

where input.txt at the end is - obviously - your file.
I just use the SED-command from above within a while-loop which reads each line of the text. This is necessary because you have to keep track of the line you're reading, as you're calling sed two times on the same input.

enter image description here


Solution 3: whole file, enclose field in "

As @Łukasz L. pointed out in the comments to the OP, according to the RFC1480, which describes the format for CSV-files it would be better to enclose fields which contain a comma in ".
This is more simple than the other solutions:

sed -r 's/([^,],[^,],[^,],)(.*)(,.*,.*)/\1"\2"\3/' input.txt

Again we have the three capturing groups. This allows us to simply wrap the second group in "!

enter image description here

Community
  • 1
  • 1
KeyNone
  • 7,527
  • 4
  • 32
  • 49
  • I like your solution 3 best - I can use the equivalent `((?:[^,],){3})(.*)((?:,.*){2})` in `TextCrawler` and it seems to work fine. – OldCurmudgeon Jan 22 '14 at 13:57
  • 1
    @OldCurmudgeon Sure, it's up to you which regex you finally use, but as far as I know `sed` has no support for non-capturing groups, so I try to group as few as possible when working with `sed`. – KeyNone Jan 22 '14 at 14:00
2

If the column count is fixed, we can try cut out first three and last two columns with lookaheads ?: and match the commas inside the rest of the line (which is the description). I've got something like that:

(?:^(?:[^,]*,){3})(?:(?:[^,]*(,))*[^,]*)(?:(?:,[^,]*){2}$)

The [^,]* is the field (without comma) so (?:^(?:[^,]*,){3}) will cut first 3 columns (including the following comma). The (?:(?:,[^,]*){2}$) will remove the last 2 columns, including trailing comma. (?:(?:[^,]*(,))*[^,]*) matches the inside.

In JavaScript, the whole expression returns the complete description (with commas) as first match, and the commans within it as second. It gives the possibility, depending on Regex engine, either espace and replace the description (if engine gives the ranges of matched expression) or target the (,) expression matching comma with replacement syntax.

I don't have the possibility now to run and test with sed, but that regex should be very near to the solution you need.

Danubian Sailor
  • 21,505
  • 37
  • 137
  • 211
  • My attempt got to `(^(?:[^,]*?,){3})(.*?),(.*?)((?:,[^,\n]*){2})` -> `$1$2$3$4` which is nearly what you have. It peels off `n` columns at start, then requires `xxx,yyy` then `m` columns at the end. It then puts it back together leaving the comma in the middle out. Sadly it seems inconsistent somehow. – OldCurmudgeon Jan 22 '14 at 11:18
0

I tried to solve this with sed, but could not perform the replace inside a match group. Instead, I managed to do it with a ruby one-liner that can be run from the terminal:

cat your_file | ruby -ne '$_.scan(/^(\w+,\w+,\w+,)([^$]+)(,\w,\w)$/).each{|m|puts m[0]+m[1].gsub(",","")+m[2]}'

This assumes that there are always 6 columns and that the 4th is the one that may contain commas.

The code has been tested with ruby 1.8.7, 1.9.1 and 2.1.0.

Cristian Lupascu
  • 34,894
  • 15
  • 87
  • 127
  • I'll have to wait for other's votes on whether this is good - I do not have access to Ruby. Any other free s&r tool would be acceptable - I use [TextCrawler](http://www.digitalvolcano.co.uk/textcrawler.html) a lot so a grep in that or any other free tool would be fine. – OldCurmudgeon Jan 21 '14 at 22:53
  • @OldCurmudgeon I'll leave the answer here; it may help someone. I remember having this exact same problem a couple of years ago. – Cristian Lupascu Jan 22 '14 at 07:25
0

My approach is to calculate which commas to remove based in the total number of columns and which column you have to modify. It accepts three arguments, the input file, the total number of columns and the weird column with extra commas.

Next script does some calculations. When exists extra commas there will be additional columns, so it get where are positioned those extra columns and join them.

#!/usr/bin/env perl 

use warnings;
use strict;
use Text::CSV_XS;

my (@columns);

open my $fh, '<', shift or die;
my ($total_columns, $weird_column) = (shift, shift);

my $csv = Text::CSV_XS->new or die;
while ( my $row = $csv->getline( $fh ) ) { 
    undef @columns;
    if ( @$row == $total_columns ) { 
        @columns = @$row;
        next;
    }   

    my $extra_columns = @$row - $total_columns;
    my $post_columns_index = $weird_column + $extra_columns;
    @columns = ( 
        @$row[0..($weird_column-2)], 
        join( '', @$row[($weird_column-1)..($post_columns_index-1)]),  
        @$row[$post_columns_index..$#$row] 
    );  
}
continue {
    $csv->print( \*STDOUT, \@columns );
    printf "\n";
}

Assuming an input file like:

A,B,C,This is a description,D,E
F,G,H,This is a description with a comma (,) in it,D,E
F,G,H,This is, a description with two commas (,) in it,D,E
F,G,H,This is, a description with, three commas (,) in it,D,E

Run it like:

perl script.pl infile 6 4

That yields:

A,B,C,"This is a description",D,E
F,G,H,"This is a description with a comma () in it",D,E
F,G,H,"This is a description with two commas () in it",D,E
F,G,H,"This is a description with three commas () in it",D,E       

Perhaps it could fail with edge cases (first and last field). I didn't test it in depth, but I hope you get the idea. I tried to do it as general as possible.

Birei
  • 33,968
  • 2
  • 69
  • 79