0

I have a .tsv and I need to figure out the frequencies variables in a specific column and organize that data in descending order. I run a script in c which downloads a buffer and saves it to a .tsv file with a date stamp for a name in the same directory as my code. I then open my Terminal and run the following command, per this awesome SO answer:

cat 2016-09-06T10:15:35Z.tsv | awk -F '\t' '{print $1}' * | LC_ALL=C sort | LC_ALL=C uniq -c | LC_ALL=C sort -nr > tst.tsv

To break this apart by pipes, what this does is:

  1. cat the .tsv file to get its contents into the pipe

  2. awk -F '\t' '{print $1}' * breaks the file's contents up by tab and pushes the contents of the first column into the pipe

  3. LC_ALL=C sort takes the contents of the pipe and sorts them to have like-values next to one another, then pushes that back into the pipe

  4. LC_ALL=C uniq -c takes the stuff in the pipe and figures our how many times each value occurs and then pushes that back into the pipe (e.g, Max 3, if the name Max shows up 3 times)

  5. Finally, LC_ALL=C sort -nr sorts the stuff in the pipe again to be in descending order, and then prints it to stdout, which I pipe into a file.

Here is where things get interesting. If I do all of this in the same directory as the c code which downloaded my .tsv file to begin with, I get super wacky results which appear to be a mix of my actual .tsv file, some random corrupted garbage, and the contents of the c code which got it in the first place. Here is an example:

( count ) ( value )

 1     fprintf(f, " %s; out meta qt; rel %s; out meta qt; way %s; out meta qt; >; out meta qt;", box_line, box_line, box_line);
   1     fclose(f);
   1     char* out_file = request_osm("cmd_tmp.txt", true);
   1     bag_delete(lines_of_request);
   1 
   1 
   1 
   1 

   1 

   1??g?
   1??g?
   1?
   1?LXg$E

... etc. Now if you scroll up in that, you also find some correct values, from the .tsv I was parsing:

( count ) ( value )

   1 312639
   1 3065411
   1 3065376
   1 300459
   1 2946076

... etc. And if I move my .tsv into its own folder, and then cd into that folder and run that same command again, it works perfectly.

( count ) ( value )

419362 452999
115770 136420
114149 1380953
72850 93290
51180 587015
45833 209668
31973 64756
31216 97928
30586 1812906

Obviously I have a functional answer to my problem - just put the file in its own folder before parsing it. But I think that this memory corruption suggests there may be some larger issue at hand I should fix now, and I'd rather get on top of it that kick it down the road with a temporary symptomatic patch, so to speak.

I should mention that my c code does use system(cmd) sometimes.

Community
  • 1
  • 1
Max von Hippel
  • 2,468
  • 2
  • 28
  • 41

1 Answers1

2

The second command is the problem:

awk -F '\t' '{print $1}' *

See the asterisks at the end? It tells awk to process all files in the current directory. Instead, you want to just process standard input (the pipe output).

Just remove the asterisks and it should work.

Codo
  • 64,927
  • 16
  • 144
  • 182