I aggregate a columns and a get the sum of array column.
df2 = pd.DataFrame([[1,'IT', np.array([2, 5, 3])],
[1, 'IT', np.array([2, 5, 3])],
[1,'Sport', np.array([2, 5, 3, 5, 3])],
[2,'Sport', np.array([2, 5, 3])],
[2, 'IT', np.array([2, 5, 3])],
[2, 'Sport',np.array([2, 5, 3, 5, 3])]
],
columns=['doc_id','type', 'topic_dist'])
grouped = df2.groupby(['doc_id','type'])
aggregate = list((k, v["topic_dist"].apply(pd.Series).sum().to_list()) for k, v in grouped)
df_results = pd.DataFrame(aggregate, columns=['grouped_columns','topic_dist'])
and a get this result.
grouped_columns topic_dist
0 (1, IT) [4, 10, 6]
1 (1, Sport) [2, 5, 3, 5, 3]
2 (2, IT) [2, 5, 3]
3 (2, Sport) [4.0, 10.0, 6.0, 5.0, 3.0]
expected result
doc_id type topic_dist
0 1 IT [4, 10, 6]
1 1 Sport [2, 5, 3, 5, 3]
2 2 IT [2, 5, 3]
3 2 Sport [4.0, 10.0, 6.0, 5.0, 3.0]
any ideas to split the grouped Columns?
CodePudding user response:
You can achieve this by setting the index using pd.MultiIndex.from_tuples
as follows:
df.index = pd.MultiIndex.from_tuples(df.grouped_columns, names=['doc_id', 'type'])
df = df.drop('grouped_columns', axis=1)
Or, if you would like them to be regular columns instead of a multiindex:
df[['doc_id', 'type']] = df.grouped_columns.tolist()
df = df.drop('grouped_columns', axis=1)
Note that I am using df
as shorthand here for df_results
to improve readability.
CodePudding user response:
just add index from k tuple
thanks