Pandas dataframe: group by some columns and create lists from the remaining ones-CodePudding

My sample dataframe contains:

id1	id2	text	metadata
HP:001	DO:001	text_1	metadata_1
HP:001	DO:001	text_2	metadata_2

I am trying to generate a new table, grouped by id1 and id2, where text and metadata will contain a list of terms from their respective records:

id1	id2	text	metadata
HP:001	DO:001	['text_1', 'text_2']	['metadata_1', 'metadata_2']

I tried to use groupby with apply and reset_index, but I get:

df = pd.DataFrame(data={"id1": ["HP:001", "HP:001"], 
                   "id2": ["DO:001", "DO:001"], 
                   "text": ["text_1", "text_2"], 
                   "metadata": ["metadata_1", "metadata_2"]})

outcome = df.groupby(["id1", "id2"]).apply(list)

The outcome is:

id1     id2   
HP:001  DO:001    [id1, id2, text, metadata]
dtype: object

I can parse the outcome into a dataFrame with reset_index, but I don't understand why I get a list of column names rather than their content as the outcome.

CodePudding user response：

Change apply to agg

outcome = df.groupby(["id1", "id2"]).agg(list)#.reset_index()

outcome
Out[372]: 
                           text                  metadata
id1    id2                                               
HP:001 DO:001  [text_1, text_2]  [metadata_1, metadata_2]

CodePudding user response：

With df.groupby(["id1", "id2"]).apply(list), you can view it as df.groupby(["id1", "id2"]).apply(lambda group: list(group)) where group is a DataFrame. list(DataFrame) returns the column names.