Home > OS >  Split grouped columns datarame
Split grouped columns datarame

Time:06-18

I aggregate a columns and a get the sum of array column.

df2 = pd.DataFrame([[1,'IT',  np.array([2, 5, 3])],
                   [1, 'IT', np.array([2, 5, 3])],
                   [1,'Sport', np.array([2, 5, 3, 5, 3])],
                   [2,'Sport',  np.array([2, 5, 3])],
                   [2, 'IT', np.array([2, 5, 3])],
                   [2, 'Sport',np.array([2, 5, 3, 5, 3])]
                   ],
                 columns=['doc_id','type', 'topic_dist'])
grouped = df2.groupby(['doc_id','type'])
aggregate = list((k, v["topic_dist"].apply(pd.Series).sum().to_list()) for k, v in grouped) 
df_results = pd.DataFrame(aggregate, columns=['grouped_columns','topic_dist'])

and a get this result.

  grouped_columns                  topic_dist
0         (1, IT)                  [4, 10, 6]
1      (1, Sport)             [2, 5, 3, 5, 3]
2         (2, IT)                   [2, 5, 3]
3      (2, Sport)  [4.0, 10.0, 6.0, 5.0, 3.0]

expected result

  doc_id  type       topic_dist
0    1     IT        [4, 10, 6]
1    1     Sport     [2, 5, 3, 5, 3]
2    2     IT        [2, 5, 3]
3    2     Sport     [4.0, 10.0, 6.0, 5.0, 3.0]

any ideas to split the grouped Columns?

CodePudding user response:

You can achieve this by setting the index using pd.MultiIndex.from_tuples as follows:

df.index = pd.MultiIndex.from_tuples(df.grouped_columns, names=['doc_id', 'type'])
df = df.drop('grouped_columns', axis=1)

Or, if you would like them to be regular columns instead of a multiindex:

df[['doc_id', 'type']] = df.grouped_columns.tolist()
df = df.drop('grouped_columns', axis=1)

Note that I am using df as shorthand here for df_results to improve readability.

CodePudding user response:

just add index from k tuple

thanks

  • Related