I have a text file like this:
444537110 3 11112111022002200022022111121222002...
The final field in the input file is 50k characters in length and is only ever 0,1 or 2. I want a one hot encoded version of this final field. So my expected result is a dataframe like this:
id chip g1_0 g1_1 g1_2 g2_0 g2_1 g2_2 g3_0 g3_1 g3_2 g4_0 ...
444537110 3 0 1 0 0 1 0 0 1 0 0
I have created an initial dataframe by reading in the input file:
df = pd.read_csv('test.txt', index_col=0, sep='\s+', header=None, names = ['chip', 'genos'])
This creates a dataframe with 3 columns as:
id chip genos
444537110 3 1111211102200220000022022111121222000200022002...
I thought I might be able to create initial individual columns using something like below and then using the pandas get_dummies function for the one hot encoding but I have been unable to create the individual columns. I have tried
[c for c in df['genos'].str]
but this is not separating the characters
I have looked at a similar question and answer here: How can I one hot encode in Python?
but this only deals with one hot encoding and does not deal with the added complication of splitting a very large column