2

I have some files that are named as follows:

 d_Ca-1_O_7.dat
 d_Ca-1_O_8.dat
 d_Ca-1_O_14.dat
 d_Ca-1_O_16.dat
 d_Ca-1_O_10.dat

In each of these files I have this structure:

 abcA_BCdef  1 G   1     2.4733     4.6738    7 O    0 0 0
 ghiJ_KLmno  1 P   1     2.4811     4.6887    7 O    0 0 0
 pqrS_TLxyz  1 L   1     2.4872     4.7000    7 O    0 0 0
 ... 
 (the same scheme)       

I would like to make a bash script that goes over these files, something like:

for {i = 7, 8, 14, 16} in d_Ca-1_O_i.dat 

and converts each file to this format:

 A.BC     2.4733     #  0 0 0
 J.KL     2.4811     #  0 0 0
 S.TL     2.4872     #  0 0 0
 ... 
 (the same scheme)       

In which in every line:

1) First column: we reduce the same bit of the beginning, the same bit of the end

2) First column: replace a _ by a .

2) Remove 2nd, 3rd, 4th, 6th, 7th, 8th columns

4) add a # at the beginning of each line of 9th column

I would appreciate very much some help

DavidC.
  • 619
  • 5
  • 20

1 Answers1

2

Assuming that your input is tab separated, here is a GNU Awk script:

script.awk:

BEGIN { OFS=FS="\t"}
      { strange = gensub(/^.*(.)_(..).*$/,"\\1.\\2","",$1)
        print strange, $5, "#" $9 }

Use it like this inside your for loop in your bash: awk -f script.awk yourfile

E.g. something like:

for i in 7 8 14 16 
do 
  awk -f script.awk "d_Ca-1_O_${i}.dat"
done

For the transformation of the first field, the script takes one char to the left and two chars to the right of an underscore. The underscore is converted to a dot, all other chars from field one are discarded.

Lars Fischer
  • 7,817
  • 3
  • 24
  • 31
  • @ Lars Fischer: Thank you very much for this, but only the 1st and "#" columns are printed. The 5th column is not printed, as well as the `0 0 0` columns – DavidC. May 04 '16 at 21:13
  • 1
    @DavidC. If file is not tab separated, delete the `OFS=FS="\t"` and adjust numbering as required (e.g. the three "0" will be three fields `$9` `$10` and `$11`). – Lars Fischer May 04 '16 at 21:44
  • Thank you very much for this, but I would very much like to understand the syntax, otherwise I am not learning anything.... Could you please expand on each of the symbols the `/^.*(.)_(..).*$/,"\\1.\\2","",$1` command does? Thank you so much – DavidC. May 06 '16 at 18:16
  • 1
    @DavidC. That is called a **regular expression**, a good starting point is http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075 , assume the `\\1` and `\\2` are just `\1` and `\2` (the backslash has to be escaped, but that takes place outside of the regex engine). – Lars Fischer May 06 '16 at 18:33
  • Thanks a lot... I have marked now this answer with a tick and a +1 for your comment. These days I've been learning awk intensively... please click on http://stackoverflow.com/questions/37104901/in-awk-search-for-some-certain-columns-of-a-curent-line to see my progress ( I believe... see the large script I made) but unfortunately a new stuck... Thanks a lot! – DavidC. May 08 '16 at 21:20