2

I have a text file like this:

444537110                         3 11112111022002200022022111121222002...

The final field in the input file is 50k characters in length and is only ever 0,1 or 2. I want a one hot encoded version of this final field. So my expected result is a dataframe like this:

id          chip   g1_0 g1_1 g1_2 g2_0 g2_1 g2_2 g3_0 g3_1 g3_2 g4_0 ... 
444537110   3      0    1    0    0    1    0    0    1    0    0

I have created an initial dataframe by reading in the input file:

df = pd.read_csv('test.txt', index_col=0, sep='\s+', header=None, names = ['chip', 'genos'])

This creates a dataframe with 3 columns as:

id        chip  genos
444537110    3  1111211102200220000022022111121222000200022002...

I thought I might be able to create initial individual columns using something like below and then using the pandas get_dummies function for the one hot encoding but I have been unable to create the individual columns. I have tried

[c for c in df['genos'].str]

but this is not separating the characters

I have looked at a similar question and answer here: How can I one hot encode in Python?

but this only deals with one hot encoding and does not deal with the added complication of splitting a very large column

daragh
  • 400
  • 5
  • 17
  • guessing you might need `df['genos'].str.get_dummies()`, not sure with the data provided though – anky Jun 24 '19 at 09:18
  • Just tried that suggestion and it didn't work. It returned a dataframe with one column with the genos as the column title and just one value – daragh Jun 24 '19 at 09:24
  • @daragh could you pretend your last column is only 3 chars long instead of 50k and then post some multi line sample inputs and the full desired OHE output? This will make your intentions much clearer. Because as it stands, it sounds like you want to OHE a field that could contain3 to the power of 50k different values which seems like a bad idea (i.e. waaaaay too many columns to be useful) – Dan Jun 24 '19 at 09:35
  • @Dan I am feeding the resulting dataframe to a neural network so I do expect 150k columns – daragh Jun 24 '19 at 09:46

3 Answers3

1

First create DataFrame with convert string to list and call get_dummies:

df1 = pd.DataFrame([list(x) for x in df['genos']], index=df.index).add_prefix('g')
df2 = pd.get_dummies(df1)

If need add new columnt to original (if posible some combination missing) use DataFrame.reindex by splitted columns with _ and by all combination created by MultiIndex.from_product:

df1 = pd.DataFrame([list(x) for x in df.pop('genos')], index=df.index).add_prefix('g')
df2 = pd.get_dummies(df1)

splitted = df2.columns.str.split('_')
df2.columns = [splitted.str[0].astype(int) + 1, splitted.str[1].astype(int)]
#
mux = pd.MultiIndex.from_product([df2.columns.get_level_values(0), [0,1,2]])
df2 = df2.reindex(mux, axis=1, fill_value=0)
df2.columns = [f'g{a}_{b}' for a, b in df2.columns]
print (df2)
   g1_0  g1_1  g1_2  g2_0  g2_1  g2_2  g3_0  g3_1  g3_2  g4_0  ...  g32_2  \
0     0     1     0     0     1     0     0     1     0     0  ...      1   

   g33_0  g33_1  g33_2  g34_0  g34_1  g34_2  g35_0  g35_1  g35_2  
0      1      0      0      1      0      0      0      0      1  

[1 rows x 105 columns]
jezrael
  • 629,482
  • 62
  • 918
  • 895
1

Having in mind @Dan's comment to your question regarding the fact that you would end with 50k*3 columns, you could get your desired output by doing so:

string ="444537110 3 11112111022002200022022111121222002"
df = pd.DataFrame([string.split(" ")],columns=['id','chip','genos'])
max_number_of_genes = int(df.genos.apply(lambda x : len([y for y in x])).max())

#Create columns 
for gene in range(1,max_number_of_genes+1):
    for y in range(4):
        df['g{}_{}'.format(gene, y)] = 0

#Iterating over genos values 
for row_number, row in df.iterrows():
    genos = [int(x) for x in df.iloc[row_number, 2]]
    for gene_number, gene in enumerate(genos):     
        df.loc[row_number, 'g{}_{}'.format(gene_number+1, gene)] = 1 

print(df)

Output

+----+------------+-------+--------------------------------------+-------+-------+-------+-------+-------+-------+-------+------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+
|    |    id      | chip  |                genos                 | g1_0  | g1_1  | g1_2  | g1_3  | g2_0  | g2_1  | g2_2  | ...  | g33_2  | g33_3  | g34_0  | g34_1  | g34_2  | g34_3  | g35_0  | g35_1  | g35_2  | g35_3 |
+----+------------+-------+--------------------------------------+-------+-------+-------+-------+-------+-------+-------+------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+
| 0  | 444537110  |    3  | 11112111022002200022022111121222002  |    0  |    1  |    0  |    0  |    0  |    1  |    0  | ...  |     0  |     0  |     1  |     0  |     0  |     0  |     0  |     0  |     1  |     0 |
+----+------------+-------+--------------------------------------+-------+-------+-------+-------+-------+-------+-------+------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+
Sebastien D
  • 3,945
  • 3
  • 14
  • 38
0

If you're only splitting 50k characters you could go raw Python (for readability):

>>> a,b,c = zip(*[{0:(1,0,0),1:(0,1,0),2:(0,0,1)}[int(c)] for c in df['genos'][0]])
>>> a,b,c
((0, 0, 0, 0, 0, 0, ...), (1, 1, 1, 1, 0, 1, ...), (0, 0, 0, 0, 1, 0, ...))
Jonas Byström
  • 22,233
  • 21
  • 93
  • 137