Home > front end >  Add column to groupby dataframe based on length of sliced column
Add column to groupby dataframe based on length of sliced column

Time:05-20

I've looked at other questions regarding similar topics, but I have yet found an answer. So I have a dataframe that looks as follows:

>>> df = pd.DataFrame(np.array([[1, 1, 1, 1, 2, 2, 2], [0, 0, 0, 1, 0, 0, 1], ['some text', 'other text', 'more text', 'new text', 'text sample', 'sample', 'sample text'], ['kw1, kw2', 'kw1, kw2, kw3', 'kw1', 'kw1, kw2, kw3, kw4', 'kw1', 'kw1, kw2, kw3', 'kw1, kw2']), columns=['value', 'cluster', 'text', 'keywords'])
>>> result = df.groupby(['value', 'cluster', 'text']).keywords.sum().to_frame()
>>> result =
value      cluster      text      keywords
  1           0      some text       kw1, kw2
                     other text      kw1, kw2, kw3
                     more text       kw1
              1      new text        kw1, kw2, kw3, kw4
  2           0      text sample     kw1
                     sample          kw1, kw2, kw3
              1      sample text     kw1, kw2

What I would like to do next based on my above mentioned question is to add a column with a list of lists to this existing dataframe based on the sliced columns value and cluster, instead of the text or keyword columns.

>>> summary = [['some, summary'], ['this, too, summ'], ['kws, of, summ'], ['summ, based, kw']]

I basically want to match each item of the summary list to a cluster. So it would look like this:

value      cluster   summary            text       keywords
  1           0      some, summary    some text       kw1, kw2
                                      other text      kw1, kw2, kw3
                                      more text       kw1
              1      this, too, summ   new text        kw1, kw2, kw3, kw4
  2           0      kws, of, summ    text sample     kw1
                                      sample          kw1, kw2, kw3
              1      summ, based, kw  sample text     kw1, kw2

Obviously the "simple" adding of a column to a dataframe does not work because the length of the list of lists to be added is not matching the dataframe length.

result['summary'] = summary
ValueError: Length of values (4) does not match length of index (7)

I also tried to use the following method, but it just adds together all keywords from one cluster, which is not exactly what I am looking for.

result['summary'] = result.groupby(['value', cluster']).transform('sum')

Is there any way to solve this?

CodePudding user response:

Use Index.get_level_values, join with summary , add to MultiIndex:

a = result.index.get_level_values('value').astype(str)
b = result.index.get_level_values('cluster').astype(str)
result['summary'] = 'summary '   a   '.'   b

result= result.set_index('summary', append=True).reorder_levels([0,1,3,2])
print (result)
                                                 keywords
value cluster summary     text                           
1     0       summary 1.0 more text                   kw1
                          other text        kw1, kw2, kw3
                          some text              kw1, kw2
      1       summary 1.1 new text     kw1, kw2, kw3, kw4
2     0       summary 2.0 sample            kw1, kw2, kw3
                          text sample                 kw1
      1       summary 2.1 sample text            kw1, kw2

EDIT: If length of list is same like number of groups is possible use:

print (len(summary) == result.groupby(['value','cluster']).ngroup().max()   1)
True

Create dict by enumerate with select first value of one element list:

d = {k: v[0] for k, v in enumerate(summary)}
print (d)
{0: 'some, summary', 1: 'this, too, summ', 2: 'kws, of, summ', 3: 'summ, based, kw'}

Use GroupBy.ngroup with mapping by Series.map, then use DataFrame.set_index with DataFrame.reorder_levels:

result['summary'] = result.groupby(['value','cluster']).ngroup().map(d)
result= result.set_index('summary', append=True).reorder_levels([0,1,3,2])
print (result)
                                                     keywords
value cluster summary         text                           
1     0       some, summary   more text                   kw1
                              other text        kw1, kw2, kw3
                              some text              kw1, kw2
      1       this, too, summ new text     kw1, kw2, kw3, kw4
2     0       kws, of, summ   sample            kw1, kw2, kw3
                              text sample                 kw1
      1       summ, based, kw sample text            kw1, kw2
  • Related