0

I have a pandas dataframe, and a list of lists (each sublist has 3 items [name, seq, qual]). I want to check if the first item in the list of lists matches the name column of the df, and if it does I want to create new columns with item[1] and item[2] added to the dataframe.

To set it up:

reads = [['read1', 'ACTG', 'FFFF'], ['read2', 'TTTT', 'FF:F'], ['read3', 'ATGC', 'F:FF']]
df = pd.DataFrame(reads, columns=['ReadName', 'Sequence1', 'Qual1'])
reads2 = [['read3', 'CGCG', 'F::F'], ['read1', 'TGTG', 'F:FF'], ['read2', 'AAAA', 'FFFF']]

What I've tried:

for item in reads2:
    if item[0] in df['ReadName']:
        df['Sequence2'] = item[1]
        df['Qual2'] = item[2]

but the resultant df looks like:

  ReadName Sequence1 Qual1 Sequence2 Qual2
0    read1      ACTG  FFFF      CGCG  F::F
1    read2      TTTT  FF:F      CGCG  F::F
2    read3      ATGC  F:FF      CGCG  F::F

So it is only adding the first item from the list of lists to all the rows in the df. I would expect it to look like:

  ReadName Sequence1 Qual1 Sequence2 Qual2
0    read1      ACTG  FFFF      TGTG  F:FF
1    read2      TTTT  FF:F      AAAA  FFFF
2    read3      ATGC  F:FF      CGCG  F::F
d_kennetz
  • 4,555
  • 5
  • 16
  • 35
  • 2
    Each time the condition `if` is met, you assign the values to the whole columns with e.g. `df['Sequence2'] = item[1]`. what you want to do looks like a merge. try `df.merge(pd.DataFrame(reads2, columns=['ReadName', 'Sequence2', 'Qual2']))` – Ben.T Jun 20 '19 at 19:05
  • 1
    Does `df.merge(pd.DataFrame(reads2, columns=df.columns), on='ReadName')` not do what you want? – cs95 Jun 20 '19 at 19:14
  • 1
    @cs95 It's been a long day at the office. I need to go home. I like the dup target! – d_kennetz Jun 20 '19 at 19:49

2 Answers2

1

use .loc to update row

for item in reads2:

    if item[0] in list(df['ReadName']):

        df.loc[df['ReadName']==item[0],'Sequence2'] = item[1]
        df.loc[df['ReadName']==item[0],'Qual2'] = item[2]
tawab_shakeel
  • 3,345
  • 5
  • 23
1

Use a left join:

df2 = pd.DataFrame(reads2, columns = ['ReadName', 'Sequence2', 'Qual2'])
df.merge(df2, on='ReadName')

Looks like:

    ReadName    Sequence1   Qual1   Sequence2   Qual2
0   read1   ACTG    FFFF    TGTG    F:FF
1   read2   TTTT    FF:F    AAAA    FFFF
2   read3   ATGC    F:FF    CGCG    F::F
Brian Tompsett - 汤莱恩
  • 5,195
  • 62
  • 50
  • 120
ludan
  • 26
  • 5