My sample dataframe contains:
id1 | id2 | text | metadata |
---|---|---|---|
HP:001 | DO:001 | text_1 | metadata_1 |
HP:001 | DO:001 | text_2 | metadata_2 |
I am trying to generate a new table, grouped by id1
and id2
, where text
and metadata
will contain a list of terms from their respective records:
id1 | id2 | text | metadata |
---|---|---|---|
HP:001 | DO:001 | ['text_1', 'text_2'] | ['metadata_1', 'metadata_2'] |
I tried to use groupby with apply and reset_index, but I get:
df = pd.DataFrame(data={"id1": ["HP:001", "HP:001"],
"id2": ["DO:001", "DO:001"],
"text": ["text_1", "text_2"],
"metadata": ["metadata_1", "metadata_2"]})
outcome = df.groupby(["id1", "id2"]).apply(list)
The outcome is:
id1 id2
HP:001 DO:001 [id1, id2, text, metadata]
dtype: object
I can parse the outcome into a dataFrame with reset_index
, but I don't understand why I get a list of column names rather than their content as the outcome.
CodePudding user response:
Change apply to agg
outcome = df.groupby(["id1", "id2"]).agg(list)#.reset_index()
outcome
Out[372]:
text metadata
id1 id2
HP:001 DO:001 [text_1, text_2] [metadata_1, metadata_2]
CodePudding user response:
With df.groupby(["id1", "id2"]).apply(list)
, you can view it as df.groupby(["id1", "id2"]).apply(lambda group: list(group))
where group
is a DataFrame. list(DataFrame)
returns the column names.