How to remove a subcolumn from a nested-csv file?

Question

Given a space separated file as such:

0.0 1:0.000000 2:1.000000 3:0.000000 4:0.000000 5:1.000000 6:0.000000 7:1.000000 8:1.000000 9:1.000000 10:1.000000 11:1.000000 12:1.000000 13:1.000000 14:1.000000 15:0.919033 16:1.000000 17:1.000000 18:1.000000 19:1.000000 20:0.000000 21:0.037771
0.0 1:0.000000 2:1.000000 3:0.000000 4:0.000000 5:1.000000 6:0.000000 7:1.000000 8:0.800000 9:0.666667 10:1.000000 11:0.800000 12:0.666667 13:1.000000 14:0.875000 15:0.874574 16:0.848662 17:0.901802 18:0.938795 19:0.903077 20:0.333332 21:0.196682
0.0 1:1.098612 2:1.000000 3:1.000000 4:0.000000 5:1.000000 6:0.000000 7:1.000000 8:0.800000 9:0.500000 10:0.000000 11:0.800000 12:0.500000 13:0.000000 14:0.909091 15:0.780985 16:0.792052 17:0.865396 18:0.863982 19:0.832962 20:0.000000 21:0.069470
0.0 1:0.000000 2:1.000000 3:0.000000 4:0.000000 5:1.000000 6:0.000000 7:1.000000 8:0.923077 9:0.909091 10:0.888889 11:0.923077 12:0.909091 13:0.888889 14:0.943396 15:0.923562 16:0.923871 17:0.949357 18:0.950790 19:0.944919 20:0.142857 21:0.140054

The first columns are all 0.0 and we want to throw that first column away. Then for each column, there's a colon separating the key from its value. And the goal is only to keep the value.

I can do it as such in python:

with io.open(infile, 'r') as fin:
    for line in fin:
        line = line.split()[1:]
        line = '\t'.join([i.split(':')[1] for i in line])
        print line

[out]:

0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    1.000000    1.000000    1.000000    1.0000001.000000    1.000000    1.000000    0.919033    1.000000    1.000000    1.000000    1.000000    0.000000    0.037771
0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    0.800000    0.666667    1.000000    0.8000000.666667    1.000000    0.875000    0.874574    0.848662    0.901802    0.938795    0.903077    0.333332    0.196682
1.098612    1.000000    1.000000    0.000000    1.000000    0.000000    1.000000    0.800000    0.500000    0.000000    0.8000000.500000    0.000000    0.909091    0.780985    0.792052    0.865396    0.863982    0.832962    0.000000    0.069470
0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    0.923077    0.909091    0.888889    0.9230770.909091    0.888889    0.943396    0.923562    0.923871    0.949357    0.950790    0.944919    0.142857    0.140054

But how is the same possible on the unix command line? (maybe with sed, awk, perl -c or even python -c or anything else) Imagine it's a large file, so please don't load the whole file it memory, unless there's an economical reason for that.

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

With GNU sed:

sed 's/^0.0 //;s/[0-9]\+:\([0-9.]\+\)/\1/g' file

Output:

0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.919033 1.000000 1.000000 1.000000 1.000000 0.000000 0.037771
0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.800000 0.666667 1.000000 0.800000 0.666667 1.000000 0.875000 0.874574 0.848662 0.901802 0.938795 0.903077 0.333332 0.196682
1.098612 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.800000 0.500000 0.000000 0.800000 0.500000 0.000000 0.909091 0.780985 0.792052 0.865396 0.863982 0.832962 0.000000 0.069470
0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.923077 0.909091 0.888889 0.923077 0.909091 0.888889 0.943396 0.923562 0.923871 0.949357 0.950790 0.944919 0.142857 0.140054

If you want to edit your file "in place" add sed's option -i.

Two sed scripts separated with ;:

s/^0.0 //: search from start of line (^) for 0.0 followed by a whitespace and replace it by nothing

s/[0-9]\+:\([0-9.]\+\)/\1/g: search for any at least one (+) character from range 0 to 9 followed by : and followed by at least one (+) character from range 0 to 9 or a . and replace it by matching part in round brackets. \1 is the back references to matching part in round brackets. g means global to apply the replacement to all matches to the regexp, not just the first. It is necessary to escape special characters (here: +, (, )) for sed with a \.

Shorter version:

sed 's/^0.0 //;s/[0-9]\+://g' file

See: The Stack Overflow Regular Expressions FAQ

Care to explain the regex a little? – alvas Jan 30 '16 at 07:53 — alvas, Jan 30 '16 at 07:53

score 2 · Answer 2 · answered Jan 30 '16 at 14:25

awk to the rescue!

$ awk '{gsub("[^ ]*:","");sub("[^ ]* ","")}1' file 

0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.919033 1.000000 1.000000 1.000000 1.000000 0.000000 0.037771
0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.800000 0.666667 1.000000 0.800000 0.666667 1.000000 0.875000 0.874574 0.848662 0.901802 0.938795 0.903077 0.333332 0.196682
1.098612 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.800000 0.500000 0.000000 0.800000 0.500000 0.000000 0.909091 0.780985 0.792052 0.865396 0.863982 0.832962 0.000000 0.069470
0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.923077 0.909091 0.888889 0.923077 0.909091 0.888889 0.943396 0.923562 0.923871 0.949357 0.950790 0.944919 0.142857 0.140054

Cynical · Answer 3 · 2016-01-30T09:35:00.903

awk can do that:

// {
    for(i=2; i<=NF; i++)
    {
        split($i, array, ":")
        printf("%s\t", array[2])
    }
    printf("\n")
}

Explanation: // means "for every line" (actually, it matches all possible sequences of characters), for each field from the 2nd to the last (NF), split the ith field on the : and store each part in array (1-based, so array[2] is the second part) and printf it. At the end of each line, print a newline.

Edit

This was my first answer, but I skipped the bit on removing the other parts of each column.

You can use the cut command: for example, if you need to remove just the first column, you can write

cut -c2- yourfile

Explanation: -c lets you choose the column you want to keep, 2- means "from second on".

score 1 · Answer 4 · answered Jan 30 '16 at 23:12

1

Why not use the module:

use Text::CSV;

It already provides the functionality to easily parse a CSV file into a structure, and also go the other way to parse a structure into a CSV file.

You can then select which column you want to keep or remove.

answered Jan 30 '16 at 23:12

Tim Potapov

107
8

This I think would be greatly improved by an example - I don't think `Text::CSV` is needed given the dataset, and is quite a heavy weight way of parsing this file. – Sobrique Jan 31 '16 at 21:47

Sobrique · Answer 5 · 2016-01-31T21:46:21.497

Quite easy with perl:

perl -ne 'print join ( "\t", m/:([\d\.]+)/g ),"\n"' file_to_parse

This:

Iterates line by line (-n wraps it in a while ( <> ) { loop)
Extracts the numeric value after a : with a regex m/:([\d\.]+)/g (and does so repeatedly). I've assumed digits and . but actually you could probably just do m/:(\S+)/g if 'any non whitespace' is ok (as in your example).
your first field doesn't contain a :, so it omits it
prints that, tab separated

Output:

0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    0.919033    1.000000    1.000000    1.000000    1.000000    0.000000    0.037771
0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    0.800000    0.666667    1.000000    0.800000    0.666667    1.000000    0.875000    0.874574    0.848662    0.901802    0.938795    0.903077    0.333332    0.196682
1.098612    1.000000    1.000000    0.000000    1.000000    0.000000    1.000000    0.800000    0.500000    0.000000    0.800000    0.500000    0.000000    0.909091    0.780985    0.792052    0.865396    0.863982    0.832962    0.000000    0.069470
0.000000    1.000000    0.000000    0.000000    1.000000    0.000000    1.000000    0.923077    0.909091    0.888889    0.923077    0.909091    0.888889    0.943396    0.923562    0.923871    0.949357    0.950790    0.944919    0.142857    0.140054

How to remove a subcolumn from a nested-csv file?

5 Answers5