After adding cluster topics to a dataframe in pandas I get a result which looks like this:
[{'Document': 'Lorem', 'Topic': 0},
{'Document': 'Ipsum', 'Topic': 0},
{'Document': 'Dolor', 'Topic': 0},
{'Document': 'Sit', 'Topic': 1},
{'Document': 'Amet', 'Topic': 1},
As a table, Document and Topic are the headers:
| Document | Topic |
|---------------------|------------------|
| Lorem | 0 |
| Ipsum | 0 |
What I would like to do is have the unique topics be the headers and the Documents the values, for example:
| 0 | 1 |
|---------------------|------|
| Lorem | Sit |
| Ipsum | Amet |
I've tried a lot of hacky solutions to this, and all of them involve using for loops and leaving pandas, so I would really like to know what the correct way of doing this in pandas would be.
The closest I've gotten in pandas itself is this:
df.groupby("Topic").agg(list)
However this groups each Document in an array according to the topic so like this:
| Document | Topic |
|---------------------|---------|
| [Lorem, Ipsum] | 0 |
| [Sit, Amet] | 1 |
Which is not much better than what I started with.
Thanks so much!
CodePudding user response:
Assuming a list of dictionaries as input, use pd.json_normalize
combined with pivot
:
(pd.json_normalize(l)
.assign(row=lambda d: d.groupby('Topic').cumcount())
.pivot('row', 'Topic', 'Document')
.dropna()
)
output:
Topic 0 1
row
0 Lorem Sit
1 Ipsum Amet