Merge two large text files by common row to one mapping file

Question

I have two text files that have similar formatting. The first (732KB):

>lib_1749;size=599;
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTAGACTGTCACTGACACTGATGCTCGAAAGTGTGGGTATCAAACA
--
>lib_2235;size=456;
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
>lib_13686;size=69;
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCAGGCAAGTCAGAAGTGAAAGCCCGGGGCTCAACCCCGGGGCTGGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAACA
--

The second (5.26GB):

>Stool268_1 HWI-ST155_0605:1:1101:1194:2070#CTGTCTCTCCTA
TACGGAGGATGCGAGCGTTATCCGGATTTACTGGGTTTAAAGGGAGCGCAGACGGGACGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCCTAAAACTGCTAGCGGTGAAATGCTTAGATATCGGGAGGAACTCCGGTTGCGAAGGCAGCATACTGGACTGCAACTGACGCTGATGCTCGAAAGTGTGGGTATCAAACAGG
--

Note the key difference is the header for each entry (lib_1749 vs. Stool268_1). What I need is to create a mapping file between the headers of one file and the headers of the second using the sequence (e.g., TACGGAGGATGCGAGCGTTATCCGGAT...) as a key.

Note as one final complication the mapping is not going to be 1-to-1 there will be multiple entries of the form Stool****** for each entry of lib****. This is because the length of the key in the first file was trimmed to have 200 characters but in the second file it can be longer.

For smaller files I would just do something like this in python but I often have trouble because these files are so big and cannot be read into memory at one time. Usually I try unix utilities but in this case I cannot think of how to accomplish this.

Thank you!

You can read a large file [one line at a time](https://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python) in Python. — Peter Wood, Feb 25 '15 at 20:51
This would be eaily accomplised with something like MySQL or SQLite — heathobrien, Feb 26 '15 at 09:43

score 1 · Answer 1 · answered Feb 27 '15 at 00:01

In my opinion, the easiest way would be to use BLAST+...

Set up the larger file as a BLAST database and use the smaller file as the query...

Then just write a small script to analyse the output - I.e. Take the top hit or two to create the mapping file.

BTW. You might find SequenceServer (Google it) helpful in setting up a custom Blast database and your BLAST environment...

score 0 · Accepted Answer · answered May 26 '15 at 08:18

BioPython should be able to read in large FASTA files.

from Bio import SeqIO
from collections import defaultdict

mapping = defaultdict(list)

for stool_record in SeqIO.parse('stool.fasta', 'fasta'):
    stool_seq = str(stool_record.seq)

    for lib_record in SeqIO.parse('libs.fasta', 'fasta'):
        lib_seq = str(lib_record.seq)

        if stool_seq.startswith(lib_seq):
            mapping[lib_record.id.split(';')[0]].append(stool_record.id)

Merge two large text files by common row to one mapping file

2 Answers2