2

I have two text files that have similar formatting. The first (732KB):

>lib_1749;size=599;
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTAGACTGTCACTGACACTGATGCTCGAAAGTGTGGGTATCAAACA
--
>lib_2235;size=456;
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
>lib_13686;size=69;
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCAGGCAAGTCAGAAGTGAAAGCCCGGGGCTCAACCCCGGGGCTGGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAACA
--

The second (5.26GB):

>Stool268_1 HWI-ST155_0605:1:1101:1194:2070#CTGTCTCTCCTA
TACGGAGGATGCGAGCGTTATCCGGATTTACTGGGTTTAAAGGGAGCGCAGACGGGACGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCCTAAAACTGCTAGCGGTGAAATGCTTAGATATCGGGAGGAACTCCGGTTGCGAAGGCAGCATACTGGACTGCAACTGACGCTGATGCTCGAAAGTGTGGGTATCAAACAGG
--

Note the key difference is the header for each entry (lib_1749 vs. Stool268_1). What I need is to create a mapping file between the headers of one file and the headers of the second using the sequence (e.g., TACGGAGGATGCGAGCGTTATCCGGAT...) as a key.

Note as one final complication the mapping is not going to be 1-to-1 there will be multiple entries of the form Stool****** for each entry of lib****. This is because the length of the key in the first file was trimmed to have 200 characters but in the second file it can be longer.

For smaller files I would just do something like this in python but I often have trouble because these files are so big and cannot be read into memory at one time. Usually I try unix utilities but in this case I cannot think of how to accomplish this.

Thank you!

jds
  • 554
  • 3
  • 15

2 Answers2

1

In my opinion, the easiest way would be to use BLAST+...

Set up the larger file as a BLAST database and use the smaller file as the query...

Then just write a small script to analyse the output - I.e. Take the top hit or two to create the mapping file.

BTW. You might find SequenceServer (Google it) helpful in setting up a custom Blast database and your BLAST environment...

Ismail Moghul
  • 2,605
  • 2
  • 19
  • 24
0

BioPython should be able to read in large FASTA files.

from Bio import SeqIO
from collections import defaultdict

mapping = defaultdict(list)

for stool_record in SeqIO.parse('stool.fasta', 'fasta'):
    stool_seq = str(stool_record.seq)

    for lib_record in SeqIO.parse('libs.fasta', 'fasta'):
        lib_seq = str(lib_record.seq)

        if stool_seq.startswith(lib_seq):
            mapping[lib_record.id.split(';')[0]].append(stool_record.id)
BioGeek
  • 19,132
  • 21
  • 75
  • 123