Pandas Merge to Determine if Value exists-CodePudding

I have checklist form data coming into a spreadsheet that I am trying to determine if a specific value was not checked and provide info based off that. My first thought was to have a master list/df where all the form values are then do a left/right merge on each response to determine values that weren't there.

Sample data and script. Expecting nan/NA for the missing Address response in the second set, with 6 rows instead of the original 5.

import pandas as pd
df_responses = pd.DataFrame({'Timestamp' : ['2022-09-21 10:39:40.676','2022-09-22 10:28:57.753'],
                            'Email': ['[email protected]', '[email protected]'],
                            'Responses' : [['Name', 'Address', 'City'],['Name','City']]})

df_responses_expanded = df_responses.explode('Responses')

possible_responses = ['Name','Address','City']

df_possible_responses = pd.DataFrame({'responses_all':possible_responses})

df_check_responses = pd.merge(df_possible_responses,df_responses_expanded,how='left',left_on='responses_all',right_on='Responses')

CodePudding user response：

You can do this by constructing your own MultiIndex and performing a .reindex operation.

import pandas as pd

index = pd.MultiIndex.from_product(
    [df_responses.index, possible_responses], names=[None, 'Responses']
)

out = (
    df_responses_expanded
    .set_index('Responses', append=True)
    .reindex(index)
)

print(out)
                           Timestamp        Email
  Responses                                      
0 Name       2022-09-21 10:39:40.676  [email protected]
  Address    2022-09-21 10:39:40.676  [email protected]
  City       2022-09-21 10:39:40.676  [email protected]
1 Name       2022-09-22 10:28:57.753  [email protected]
  Address                        NaN          NaN
  City       2022-09-22 10:28:57.753  [email protected]

CodePudding user response：

here is one way to do it

#identify the missing expected values and assign to separate DF
df2=df['Responses'].apply(lambda x: list(set(possible_responses).difference(x)) ).to_frame()

#explode the list values
df2=df2.explode('Responses')

# concat the two DF
pd.concat([df.explode('Responses'), 
           df2.loc[df2['Responses'].notna()]], axis=0)

    Timestamp                   Email           Responses
0   2022-09-21 10:39:40.676     [email protected]     Name
0   2022-09-21 10:39:40.676     [email protected]     Address
0   2022-09-21 10:39:40.676     [email protected]     City
1   2022-09-22 10:28:57.753     [email protected]     Name
1   2022-09-22 10:28:57.753     [email protected]     City
1          NaN                  NaN             Address