I have created a dictionary of contigs and their lengths in file1. I also have file2 which is blast output in tabular format, which contains contig alignments (but not all of them) and some additional information like where match starts and finishes etc. In order to calculate query and subject coverage, I need to associate those lengths from file1, to length in file2. How to do that? Thanks
Asked
Active
Viewed 75 times
-1
-
1an example of input and desired output pls – Guy Gavriely Jan 22 '14 at 17:19
-
It would be great if you could post samples of your file1 and file2 to give a better idea. – Ashoka Lella Jan 22 '14 at 17:33
2 Answers
1
Assuming file1 is:
contig1 134
contig2 354
contig3 345
Your script would look like
import re
contigDict={}
with open('file1') as c1:
text=c1.readlines()
for line in text:
key,value = line.split()
contigDict[key]=value
with open('file2') as c2:
scrambled_text=c2.read()
contigs = re.findall(r'contig\d+',scrambled_text)
output = {}
for contig in contigs:
output[contig]=contigDict[contig]
with open('file3',w) as w:
for key in output.keys():
w.write(key+'\t'+output[key]+'\n')
Ashoka Lella
- 6,128
- 1
- 26
- 36
-
thank you very much, but may be I didnt express myself well, I try again. I have a file with list of contigs and their lengths, contig1 134 contig2 354 contig3 345... contig 200000 320 in file 1. Now in my file 2 I have contigs, but disodered and repetitive, lets say, contig3, contig3, contig4, contig 7, contig 65 contig65 and so on without lengths. So I want to retrieve lengths from file1 and associate to a corresponding contig in file2. – user3224522 Jan 22 '14 at 18:15
-
what do you mean by last result only? Isn't it iterating for the whole document? – Ashoka Lella Jan 23 '14 at 11:07
-
for some reason it didnt iterate,but I made it working now..thank you it works perfectly!Just wanted to ask if instead of 'contig' I have protein name i.e. tr|B5TK38|B5TK38_TRIDB, different for different proteins obviously,how can I search for it in re.findall?is it possible? – user3224522 Jan 23 '14 at 11:28
-
-
0
this is working
import re
r=open('result.txt','w')
subjectDict={}
with open('file1.txt') as c1:
text=c1.readlines()
for line in text:
key,value = line.split()
subjectDict[key]=value
with open('file2.txt') as c2:
lines=c2.readlines()
for line in lines:
new_list=re.split(r'\t+',line)
s_name=new_list[0]
subjects = re.findall(r'contig\d+',s_name)
output = {}
for subject in subjects:
output[subject]=subjectDict[subject]
r.writelines(subjectDict[subject]+'\n')
user3224522
- 853
- 6
- 16