I have a .tsv
and I need to figure out the frequencies variables in a specific column and organize that data in descending order. I run a script in c which downloads a buffer and saves it to a .tsv
file with a date stamp for a name in the same directory as my code. I then open my Terminal and run the following command, per this awesome SO answer:
cat 2016-09-06T10:15:35Z.tsv | awk -F '\t' '{print $1}' * | LC_ALL=C sort | LC_ALL=C uniq -c | LC_ALL=C sort -nr > tst.tsv
To break this apart by pipes, what this does is:
cat
the.tsv
file to get its contents into the pipeawk -F '\t' '{print $1}' *
breaks the file's contents up by tab and pushes the contents of the first column into the pipeLC_ALL=C sort
takes the contents of the pipe and sorts them to have like-values next to one another, then pushes that back into the pipeLC_ALL=C uniq -c
takes the stuff in the pipe and figures our how many times each value occurs and then pushes that back into the pipe (e.g, Max 3, if the name Max shows up 3 times)Finally,
LC_ALL=C sort -nr
sorts the stuff in the pipe again to be in descending order, and then prints it tostdout
, which I pipe into a file.
Here is where things get interesting. If I do all of this in the same directory as the c code which downloaded my .tsv
file to begin with, I get super wacky results which appear to be a mix of my actual .tsv
file, some random corrupted garbage, and the contents of the c code which got it in the first place. Here is an example:
( count ) ( value )
1 fprintf(f, " %s; out meta qt; rel %s; out meta qt; way %s; out meta qt; >; out meta qt;", box_line, box_line, box_line);
1 fclose(f);
1 char* out_file = request_osm("cmd_tmp.txt", true);
1 bag_delete(lines_of_request);
1
1
1
1
1
1??g?
1??g?
1?
1?LXg$E
... etc. Now if you scroll up in that, you also find some correct values, from the .tsv
I was parsing:
( count ) ( value )
1 312639
1 3065411
1 3065376
1 300459
1 2946076
... etc. And if I move my .tsv
into its own folder, and then cd into that folder and run that same command again, it works perfectly.
( count ) ( value )
419362 452999
115770 136420
114149 1380953
72850 93290
51180 587015
45833 209668
31973 64756
31216 97928
30586 1812906
Obviously I have a functional answer to my problem - just put the file in its own folder before parsing it. But I think that this memory corruption suggests there may be some larger issue at hand I should fix now, and I'd rather get on top of it that kick it down the road with a temporary symptomatic patch, so to speak.
I should mention that my c code does use system(cmd)
sometimes.