-1

I have created a dictionary of contigs and their lengths in file1. I also have file2 which is blast output in tabular format, which contains contig alignments (but not all of them) and some additional information like where match starts and finishes etc. In order to calculate query and subject coverage, I need to associate those lengths from file1, to length in file2. How to do that? Thanks

user3224522
  • 853
  • 6
  • 16

2 Answers2

1

Assuming file1 is:

contig1 134
contig2 354
contig3 345

Your script would look like

import re

contigDict={}
with open('file1') as c1:
    text=c1.readlines()
for line in text:
    key,value = line.split()
    contigDict[key]=value
with open('file2') as c2:
    scrambled_text=c2.read()

contigs = re.findall(r'contig\d+',scrambled_text)
output = {}
for contig in contigs:
    output[contig]=contigDict[contig]
with open('file3',w) as w:
    for key in output.keys():
        w.write(key+'\t'+output[key]+'\n')
Ashoka Lella
  • 6,128
  • 1
  • 26
  • 36
  • thank you very much, but may be I didnt express myself well, I try again. I have a file with list of contigs and their lengths, contig1 134 contig2 354 contig3 345... contig 200000 320 in file 1. Now in my file 2 I have contigs, but disodered and repetitive, lets say, contig3, contig3, contig4, contig 7, contig 65 contig65 and so on without lengths. So I want to retrieve lengths from file1 and associate to a corresponding contig in file2. – user3224522 Jan 22 '14 at 18:15
  • what do you mean by last result only? Isn't it iterating for the whole document? – Ashoka Lella Jan 23 '14 at 11:07
  • for some reason it didnt iterate,but I made it working now..thank you it works perfectly!Just wanted to ask if instead of 'contig' I have protein name i.e. tr|B5TK38|B5TK38_TRIDB, different for different proteins obviously,how can I search for it in re.findall?is it possible? – user3224522 Jan 23 '14 at 11:28
  • Sure, google for python regex searching – Ashoka Lella Jan 23 '14 at 11:44
  • perfect,thanks a lot! – user3224522 Jan 23 '14 at 11:58
0

this is working

import re
r=open('result.txt','w')
subjectDict={}
with open('file1.txt') as c1:
    text=c1.readlines()
for line in text:
    key,value = line.split()
    subjectDict[key]=value
with open('file2.txt') as c2:
    lines=c2.readlines()
for line in lines:
    new_list=re.split(r'\t+',line)
    s_name=new_list[0]
    subjects = re.findall(r'contig\d+',s_name)
    output = {}
    for subject in subjects:
        output[subject]=subjectDict[subject]
        r.writelines(subjectDict[subject]+'\n')
user3224522
  • 853
  • 6
  • 16