0

I am working in bash. I am trying to find unique barcodes within strings in a .txt file. Each string can contain 3 barcodes. I want to identify and label each unique configuration that contains my barcodes of interest.

This is my starting reads.txt file that contains the strings I want to evaluate.

ABCD1
EFGH2
ABGH1
EFCD2

As an example the barcodes contained in ABCD1 are AB, CD and 1.

My desired result is to identify only srings ABCD1 and EFGH2 and to store each of them respectively as result.1.txt and result.2.txt

Bellow is my attempt.

# Add the barcode sequences to a bash array
declare -a BARCODES1=(AB EF)
declare -a BARCODES2=(CD GH)
declare -a BARCODES3=(1 2)

# Initialize counter
count=1

# Search for the barcode sequences in the reads.txt file
rm ROUND*
rm result*

for barcode in "${BARCODES1[@]}";
    do
    grep "$barcode1" reads.txt > ROUND1_MATCHES.txt

        for barcode2 in "${BARCODES2[@]}";
        do
        grep "$barcode2" ROUND1_MATCHES.txt > ROUND2_MATCHES.txt

           for barcode3 in "${BARCODES3[@]}";
            do
            grep "$barcode3" ROUND2_MATCHES.txt > ROUND3_MATCHES.txt

                if [ -s ROUND3_MATCHES.txt ]
                then
                mv ROUND3_MATCHES.txt result.$count.txt
                fi

            count=`expr $count + 1`
            done
        done
    done

Strangely this code outputs too many results files. Running head results* gives me the following.

==> result.1.txt <==
ABCD1

==> result.2.txt <==
EFCD2

==> result.3.txt <==
ABGH1

==> result.4.txt <==
EFGH2

==> result.5.txt <==
ABCD1

==> result.6.txt <==
EFCD2

==> result.7.txt <==
ABGH1

==> result.8.txt <==
EFGH2

The desired result would have been

==> result.1.txt <==
ABCD1

==> result.2.txt <==
EFCD2
Paul
  • 530
  • 1
  • 3
  • 18
  • Unrelated, but why in bash specifically? – Dave Newton Sep 13 '18 at 14:13
  • I used bash because it works nicely on the linux cluster environment that I am using and because i am more comfortable writing bash scripts than other languages (i am still a beginner). Certainly something in python could be made to work. Is there an obvious advantage (speed etc) that I am missing out on by choosing to use bash? – Paul Sep 13 '18 at 14:21
  • ¯\_(ツ)_/¯ Don't know, just seems overly-complex and disk-heavy to keep grepping/etc instead of using a more general-purpose language with better string support etc. – Dave Newton Sep 13 '18 at 14:24
  • Using `grep` inside a nested inner loop (starting a whole new program, reading the input file from the very beginning, etc) is indeed a serious code smell. That's not necessarily a problem with bash, though, as opposed to a problem with how it's being applied. – Charles Duffy Sep 13 '18 at 15:18

1 Answers1

0

You just want to iterate over the indices of the arrays:

for index in "${!BARCODES1[@]}"; do
    echo "${BARCODES1[index]}${BARCODES2[index]}${BARCODES3[index]}"
done
ABCD1
EFGH2

With 3 nested loops, count gets incremented 2 * 2 *2 = 8 times


It's a bit unclear what you're trying to do: If you're trying to generate the cross product of (AB,EF) and (CD,GH) and (1,2), you can do

$ printf "%s\n" {AB,EF}{CD,GH}{1,2}
ABCD1
ABCD2
ABGH1
ABGH2
EFCD1
EFCD2
EFGH1
EFGH2

And then, if you are trying to extract the lines in reads.txt that match one of those strings, then

$ grep -xFf <( printf "%s\n" {AB,EF}{CD,GH}{1,2} ) reads.txt
ABCD1
EFGH2
ABGH1
EFCD2
glenn jackman
  • 207,528
  • 33
  • 187
  • 305