I have following dataframe in python
meddra_id meddra_label soc cross_ref soc_term
2 10000081 Abdominal pain 10017947 http://snomed.info/id/21522001 Gastrointestinal disorders
3 10017999 Gastrointestinal pain 10017947 http://snomed.info/id/21522001 Gastrointestinal disorders
15 10000340 Abstains from alcohol 10041244 http://snomed.info/id/105542008 Social circumstances
35 10001022 Acute psychosis 10037175 http://snomed.info/id/69322001 Psychiatric disorders
36 10061920 Psychotic disorder 10037175 http://snomed.info/id/69322001 Psychiatric disorders
I would like to aggregate the values in 'meddra_id, meddra_label, soc and soc_term' columns using group by another column 'cross_ref' (and exclude the rows where there is single 'meddra_id' associated to 'cross_ref').
The expected output is:
meddra_id meddra_label soc cross_ref soc_term
10000081,10017999 Abdominal pain,Gastrointestinal pain 10017947 http://snomed.info/id/21522001 Gastrointestinal disorders
10001022,10061920 Acute psychosis,Psychotic disorder 10037175 http://snomed.info/id/69322001 Psychiatric disorders
I am trying following lines of code.
df_terms = df.groupby('cross_ref').filter(lambda g: len(g) > 1).drop_duplicates(subset=['meddra_id', 'meddra_label', 'soc', 'soc_term'], keep="first")
#aggregate the values
df_terms = df_terms.groupby('cross_ref')['meddra_id', 'meddra_label', 'soc', 'soc_term'].agg(' , '.join).reset_index()
When I try to aggegate the value, the column 'soc_term' is not showed up in the new dataframe (df_terms)
Any help is highly appreciated.
CodePudding user response:
Use agg
to join the values in the different columns:
df_grouped = df.groupby('cross_ref') #group as you did
df_filtered = df_grouped.filter(lambda g: len(g['meddra_id'].unique()) > 1) # filter it for single values
df_aggregated = df_filtered.groupby('cross_ref').agg({
'meddra_id': ', '.join,
'meddra_label': ', '.join,
'soc': lambda x: ', '.join(map(str, x)), # convert float values to strings
'soc_term': lambda x: ', '.join(map(str, x)) # convert float values to strings
}).reset_index() #aggregate to join values in the different columns via a comma