3

I have a survey to analyze that was completed by participants on SurveyMonkey. Unfortunately, the way the data are organized is not ideal, in that each categorical response for each question has its own column.

Here, for example, are the first few lines of one of the responses in the dataframe:

     How long have you been participating in the Garden Awards Program?  \
0                                           One year                   
1                                                NaN                   
2                                                NaN                   
3                                                NaN                   
4                                                NaN                   

  Unnamed: 10 Unnamed: 11      Unnamed: 12  \
0   2-3 years   4-5 years  5 or more years   
1         NaN         NaN              NaN   
2         NaN   4-5 years              NaN   
3   2-3 years         NaN              NaN   
4         NaN         NaN  5 or more years   

  How did you initially learn of the Garden Awards Program?  \
0              I nominated my garden to be evaluated          
1                                                NaN          
2              I nominated my garden to be evaluated          
3                                                NaN          
4                                                NaN          

                                         Unnamed: 14  etc...
0  A friend or family member nominated my garden ...  
1  A friend or family member nominated my garden ...  
2                                                NaN  
3                                                NaN  
4                                                NaN  

This question, How long have you been participating in the Garden Awards Program?, has valid responses: one year, 2-3 years, etc., and are all found on the first row as a key to which column holds which value. This is the first problem. (Similarly for How did you initially learn of the Garden Awards Program?, where valid responses are: I nominated my garden to be evaluated, A friend or family member nominated my garden, etc.).

The second problem is that the attached columns for each categorical response all are Unnamed: N, where N is as many columns as there are categories associated with all questions.

Before I start remapping and flattening/collapsing the columns into a single one per question, I was wondering if there was any other way of dealing with survey data presented like this using Pandas. All my searches pointed to the SurveyMonkey API, but I don't see how that would be useful.

I am guessing that I will need to flatten the columns, and thus, if anyone could suggest a method, that would be great. I'm thinking that there is a way to keep grabbing all columns belonging to a categorical response by grabbing an adjacent column until Unnamed is no longer in the column name, but I am clueless how to do this.

horcle_buzz
  • 1,741
  • 1
  • 20
  • 45
  • 1
    Could you post the first few lines of the DataFrame as text instead of an image? – Peter Leimbigler Mar 30 '18 at 20:35
  • Done as requested. A major issue is that there are a lot of these grouped categorical response columns that each span several columns per question and thus remapping these would be a royal PITA. – horcle_buzz Mar 30 '18 at 22:45

1 Answers1

4

I will use the following DataFrame (which can be downloaded as CSV from here):

     Q1 Unnamed: 2 Unnamed: 3    Q2 Unnamed: 5 Unnamed: 6    Q3 Unnamed: 7 Unnamed: 8
0  A1-A       A1-B       A1-C  A2-A       A2-B       A2-C  A3-A       A4-B       A3-C
1  A1-A        NaN        NaN   NaN       A2-B        NaN   NaN        NaN       A3-C
2   NaN       A1-B        NaN  A2-A        NaN        NaN   NaN       A4-B        NaN
3   NaN        NaN       A1-C   NaN       A2-B        NaN  A3-A        NaN        NaN
4   NaN       A1-B        NaN   NaN        NaN       A2-C   NaN        NaN       A3-C
5  A1-A        NaN        NaN   NaN       A2-B        NaN  A3-A        NaN        NaN

Key assumptions:

  1. Every column whose name DOES NOT start with Unnamed is actually the title of a question
  2. The columns between question titles represent options for the question on the left end of the column interval

Solution overview:

  1. Find indices of where each question starts and ends
  2. Flatten each question to a single column (pd.Series)
  3. Merge the question columns back together

Implementation (part 1):

indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]

You can see that iterating for the over the slices like below you get a single DataFrame corresponding to each question:

for q in slices:
    print(df.iloc[:, q])  # Use `display` if using Jupyter

Implementation (part 2-3):

def parse_response(s):
    try:
        return s[~s.isnull()][0]
    except IndexError:
        return np.nan

data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions

Output:

     Q1    Q2    Q3
1  A1-A  A2-B  A3-C
2  A1-B  A2-A  A4-B
3  A1-C  A2-B  A3-A
4  A1-B  A2-C  A3-C
5  A1-A  A2-B  A3-A
Gustavo Bezerra
  • 7,184
  • 1
  • 32
  • 41
  • Nice! I was thinking along these lines, but this is perfect! – horcle_buzz Mar 31 '18 at 14:59
  • Only issue is that assumption 3 does not necessarily hold. I am getting an error of `('index out of bounds', 'occurred at index 1')` in the first step of your `implementation (part 2-3)` on both my data set and can reproduce in yours, when all values for a response in a row are NaN, IOW, there is no non-null response for a particular survey item on a particular survey. – horcle_buzz Mar 31 '18 at 15:48
  • I'll try using an `np.where` clause in the step where you do the list comprehension, so that if all items are `null`, then I will set the value to something like `no response` else then it would be `~s.isnull()][0]` or something along those lines... – horcle_buzz Mar 31 '18 at 16:03
  • 1
    @horcle_buzz: Instead of using `np.where`, you can just make a more generic function. I edited the post, replacing the lambda function with a regular named function `parse_response` that covers the case where no valid response is available. – Gustavo Bezerra Apr 01 '18 at 23:59
  • Gustavo Bezzerra, thanks for this. I did come up with a completely different solution that I will post later (not using `np.where`). – horcle_buzz Apr 02 '18 at 00:37
  • Given that your modification is a ton faster than mine, I will leave yours as the sole answer. In a nutshell, what I did was to create a list of dataframes, iterate through each dataframe looking for rows where all column values are all null using `nans = lambda df: df[df.isnull().all(axis=1)]` and then update the first column/row cell, by index of null row, to 'No data. It was very inefficient, by `O(n^3)` – horcle_buzz Apr 02 '18 at 16:45