Home > Net >  How to clean survey data in pandas
How to clean survey data in pandas

Time:07-17

Input:

enter image description here Output:

enter image description here

here's the data:

d = {'Morning': ["Didn't answer", "Didn't answer", "Didn't answer", 'Morning', "Didn't answer"], 'Afternoon': ["Didn't answer", 'Afternoon', "Didn't answer", 'Afternoon', "Didn't answer"], 'Night': ["Didn't answer", 'Night', "Didn't answer", 'Night', 'Night'], 'Sporadic': ["Didn't answer", "Didn't answer", 'Sporadic', "Didn't answer", "Didn't answer"], 'Constant': ["Didn't answer", "Didn't answer", "Didn't answer", 'Constant', "Didn't answer"]}

         Morning      Afternoon          Night       Sporadic       Constant
0  Didn't answer  Didn't answer  Didn't answer  Didn't answer  Didn't answer
1  Didn't answer      Afternoon          Night  Didn't answer  Didn't answer
2  Didn't answer  Didn't answer  Didn't answer       Sporadic  Didn't answer
3        Morning      Afternoon          Night  Didn't answer       Constant
4  Didn't answer  Didn't answer          Night  Didn't answer  Didn't answer

I want the output to be:

d = {"Time of the day": ["Didn't answer", "['Afternoon', 'Night']", "Sporadic", "['Morning', 'Afternoon', 'Night', 'Constant']", "Night"]}

                                 Time of the day
0                                  Didn't answer
1                         ['Afternoon', 'Night']
2                                       Sporadic
3  ['Morning', 'Afternoon', 'Night', 'Constant']
4                                          Night

so if there's no answer in every column in a row, the value in the new data frame would be "Didn't answer" and if there's at least one answer like "night", the value in the new data frame would be "night" and if there are multiple answers like " Morning", "Night" the value in the new data frame would be a list of the answers

CodePudding user response:

There are probably prettier ways to do this but this would be one way:

def extract_valid_cols(row):
    # collect all non-"Didn't answer"-values in row
    valid_cols = [col for col in row if col != "Didn't answer"]  
    # return those, except when there are none, then return "Didn't answer"
    return valid_cols if valid_cols else ["Didn't answer"]

# apply the defined function row-wise
df["Time of the day"] = df.apply(extract_valid_cols, axis=1)

>          Morning  ...                        Time of the day
  0  Didn't answer  ...                        [Didn't answer]
  1  Didn't answer  ...                     [Afternoon, Night]
  2  Didn't answer  ...                             [Sporadic]
  3        Morning  ...  [Morning, Afternoon, Night, Constant]
  4  Didn't answer  ...                                [Night]

Putting everything in lists for consistency but if you really want single values then you can add a check if length of list is 1 and if it is just return that item.

CodePudding user response:

You can use:

df["ToD"] = (df.replace("Didn't answer", np.nan).stack().groupby(level=0)
               .apply(lambda x: [i for i in x] if len(x) > 1 else x.iloc[0])
               .reindex(df.index, fill_value="Didn't answer"))

Output:

>>> df["ToD"]
0                            Didn't answer
1                       [Afternoon, Night]
2                                 Sporadic
3    [Morning, Afternoon, Night, Constant]
4                                    Night
Name: ToD, dtype: object

CodePudding user response:

here is one way

data = pd.DataFrame(d)

def summarize(x):
    if len(x.unique())==1:
        if x.unique()==["Didn't answer"]:
            return "Didn't answer"
    else:
        x = list(x.unique())
        x.remove("Didn't answer")
        return x

data.apply(summarize, axis  = 1)

Then you get:

0                            Didn't answer
1                       [Afternoon, Night]
2                               [Sporadic]
3    [Morning, Afternoon, Night, Constant]
4                                  [Night]
dtype: object

CodePudding user response:

You can use pd.apply to get the data in the format you expect :

def get_values(row):
    row_set = set(row.values)
    if len(row_set) == 1:
        return (list(row_set)[0])
    elif("Didn't answer" in row_set):
        row_set.remove("Didn't answer")
        if(len(row_set) == 1):
            return (list(row_set)[0])
    return(list(row_set))

df["Time of the day"] = df.apply(get_values,axis=1)

This gives us the expected output :

0                            Didn't answer
1                       [Night, Afternoon]
2                                 Sporadic
3    [Night, Morning, Afternoon, Constant]
4                                    Night
  • Related