Home > Software design >  How do I group topic clustered data in pandas by topic?
How do I group topic clustered data in pandas by topic?

Time:09-27

After adding cluster topics to a dataframe in pandas I get a result which looks like this:

[{'Document': 'Lorem', 'Topic': 0},
 {'Document': 'Ipsum', 'Topic': 0},
 {'Document': 'Dolor', 'Topic': 0},
 {'Document': 'Sit', 'Topic': 1},
 {'Document': 'Amet', 'Topic': 1},

As a table, Document and Topic are the headers:

|      Document       |     Topic        |
|---------------------|------------------|
|          Lorem      |         0        |
|          Ipsum      |         0        |

What I would like to do is have the unique topics be the headers and the Documents the values, for example:

|      0              |   1  |
|---------------------|------|
|          Lorem      | Sit  |
|          Ipsum      | Amet |

I've tried a lot of hacky solutions to this, and all of them involve using for loops and leaving pandas, so I would really like to know what the correct way of doing this in pandas would be.

The closest I've gotten in pandas itself is this:

df.groupby("Topic").agg(list)

However this groups each Document in an array according to the topic so like this:

|      Document       |   Topic |
|---------------------|---------| 
|  [Lorem, Ipsum]     | 0       |
|  [Sit, Amet]        | 1       |

Which is not much better than what I started with.

Thanks so much!

CodePudding user response:

Assuming a list of dictionaries as input, use pd.json_normalize combined with pivot:

(pd.json_normalize(l)
   .assign(row=lambda d: d.groupby('Topic').cumcount())
   .pivot('row', 'Topic', 'Document')
   .dropna()
)

output:

Topic      0     1
row               
0      Lorem   Sit
1      Ipsum  Amet
  • Related