4

I'm writing a python script and want to pass the query sequence information into blastn as a string variable rather than a FASTA format file if possible.

I used Biopython's SeqIO to store several transcript names as key and its sequences as the associated value.

So it looks something like this

transcripts = dict()
for record in SeqIO.parse("transcript_sequences.fasta", "fasta"):
transcripts[record.name] = record.seq

print transcripts

So the dictionary looks like this

{'var_F': Seq('CTTCATTCTCGTTTAGCGGCTGCTCGTGGAAATTTCGAAAAAATCTGAAACTAG...TGC', SingleLetterAlphabet())}

Now I want to parse in the sequence information within the dictionary into the blast query and subject.

subprocess.call("blastn -query " + transcript['var_F'] + ".txt" + " -subject " + transcript['var_B'] + " -outfmt 6 > tmp_blast.txt", shell=True)  

I know blast only takes in either filename or string for the file location, but I just wanted to know if there is a workaround on this.

Thanks in advance for reading my first question :P

2 Answers2

4

From the BioPython module for BLAST:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc90

It would appear that you are correct in that biopython BLAST does not support SeqIO object or biological sequence as parameter for BLAST function call, or as you perform with subprocess.call() of the BLAST binary. The only input sequence parameter accepted is a file name. From the tutorial:

>>> from Bio.Blast.Applications import NcbiblastxCommandline
>>> help(NcbiblastxCommandline)
...
>>> blastx_cline = NcbiblastxCommandline(query="opuntia.fasta", db="nr", evalue=0.001,
...                                      outfmt=5, out="opuntia.xml")
>>> blastx_cline
NcbiblastxCommandline(cmd='blastx', out='opuntia.xml', outfmt=5, query='opuntia.fasta',
db='nr', evalue=0.001)
>>> print(blastx_cline)
blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db nr -evalue 0.001
>>> stdout, stderr = blastx_cline()

So, your only option is to use the actual FASTA file as input. If you want to query one sequence at a time, you would need to save each sequence to a file. However, I would recommend against this, unless you have reason to do so. I think BLAST might perform faster if all query sequences are in the same file. Also you can read the resulting BLAST output using BioPython to iterate over the results for each query, see:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc92

Example taken from the above link:

If instead you ran BLAST some other way, and have the BLAST output (in XML format) in the file my_blast.xml, all you need to do is to open the file for reading:

>>> result_handle = open("my_blast.xml")
>>> from Bio.Blast import NCBIXML
>>> blast_record = NCBIXML.read(result_handle)

or, if you have lots of results (i.e., multiple query sequences):

>>> from Bio.Blast import NCBIXML
>>> blast_records = NCBIXML.parse(result_handle)

Just like Bio.SeqIO and Bio.AlignIO (see Chapters 5 and 6), we have a pair of input functions, read and parse, where read is for when you have exactly one object, and parse is an iterator for when you can have lots of objects – but instead of getting SeqRecord or MultipleSeqAlignment objects, we get BLAST record objects.

To be able to handle the situation where the BLAST file may be huge, containing thousands of results, NCBIXML.parse() returns an iterator. In plain English, an iterator allows you to step through the BLAST output, retrieving BLAST records one by one for each BLAST search result:

>>> from Bio.Blast import NCBIXML
>>> blast_records = NCBIXML.parse(result_handle)
>>> blast_record = next(blast_records)
# ... do something with blast_record
>>> blast_record = next(blast_records)
# ... do something with blast_record
>>> blast_record = next(blast_records)
# ... do something with blast_record
>>> blast_record = next(blast_records)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
# No further records
Vince
  • 3,080
  • 2
  • 20
  • 40
1

You can pass the sequence as a string in fasta format to standard input when you run the blast command.

query_string = '>' + record.description + '\n' + str(record.seq)
blast_command = Bio.Blast.Applications.NcbiblastnCommandline(cmd='blastn', out=record.id + '.xml', outfmt=5, db=database_name, evalue=0.001)
stdout, stderr = blast_command(stdin=query_string)