pandas - groupby a column and get the max length of another string column with nulls-CodePudding

I have a pandas DataFrame like this:

  source  text_column
0   a      abcdefghi
1   a      abcde
2   b      qwertyiop
3   c      plmnkoijb
4   a      NaN
5   c      abcde
6   b      qwertyiop
7   b      qazxswedcdcvfr

and I would like to get the length of text_column after grouping source column, like below:

  source  something
    a        9
    b       14
    c        9

Here's what I have tried till now and all of them generate error:

>>> # first creating the group by object
>>> text_group = mydf.groupby(by=['source'])
>>> # now try to get the max length of "text_column" by each "source"
>>> text_group['text_column'].map(len).max()
>>> text_group['text_column'].len().max()
>>> text_group['text_column'].str.len().max()

How do I get the max length of text_column with another column grouped by.

and to avoid creating new question, how do I also get the 2nd biggest length and the respective values(1st and 2nd largest sentences in text_column).

CodePudding user response：

Just get the lengths first with assign and str.len:

df.assign(text_column=df['text_column'].str.len()).groupby('source', as_index=False).max()

  source  text_column
0      a          9.0
1      b         14.0
2      c          9.0
>>>

CodePudding user response：

The easiest solution to me looks sth like this (tested) - you do not actually need a groupby:

df['str_len'] = df.text_column.str.len()
df.sort_values(['str_len'], ascending=False)\
    .drop_duplicates(['source'])\
    .drop(columns='text_column')

  source  str_len
7      b     14.0
0      a      9.0
3      c      9.0

regarding your 2nd question, I think a groupby serves you well:

top_x = 2
df.groupby('source', as_index=False)\
    .apply(lambda sourcedf: sourcedf.sort_values('str_len').nlargest(top_x, columns='str_len', keep='all'))\
    .drop(columns='text_column')

CodePudding user response：

First idea is use lambda function with Series.str.len and max:

df = (df.groupby('source')['text_column']
       .agg(lambda x: x.str.len().max())
       .reset_index(name='something'))
print (df)
  source  something
0      a        9.0
1      b       14.0
2      c        9.0

Or you can first use Series.str.len and then aggregate max:

df = (df['text_column'].str.len()
                       .groupby(df['source'])
                       .max()
                       .reset_index(name='something'))
print (df)

Also if need integers first use DataFrame.dropna:

df = (df.dropna(subset=['text_column'])
        .assign(text_column=lambda x: x['text_column'].str.len())
        .groupby('source', as_index=False)['text_column']
        .max())
print (df)

  source  text_column
0      a            9
1      b           14
2      c            9

EDIT: for first and second top values use DataFrame.sort_values with GroupBy.head:

df1 = (df.dropna(subset=['text_column'])
        .assign(something=lambda x: x['text_column'].str.len())
        .sort_values(['source','something'], ascending=[True, False])
        .groupby('source', as_index=False)
        .head(2))
print (df1)
  source     text_column  something
0      a       abcdefghi          9
1      a           abcde          5
7      b  qazxswedcdcvfr         14
2      b       qwertyiop          9
3      c       plmnkoijb          9
5      c           abcde          5

Alternative solution with SeriesGroupBy.nlargest, obviously slowier:

df1 = (df.dropna(subset=['text_column'])
         .assign(something=lambda x: x['text_column'].str.len())
         .groupby('source')['something']
         .nlargest(2)
         .reset_index(level=1, drop=True)
         .reset_index())
print (df1)

  source  something
0      a          9
1      a          5
2      b         14
3      b          9
4      c          9
5      c          5

Last solution for new columns by top1, top2:

df=df.dropna(subset=['text_column']).assign(something=lambda x: x['text_column'].str.len())

df = df.sort_values(['source','something'], ascending=[True, False])
df['g'] = df.groupby('source').cumcount().add(1)

df = (df[df['g'].le(2)].pivot('source','g','something')
      .add_prefix('top') 
      .rename_axis(index=None, columns=None))
print (df)
   top1  top2
a     9     5
b    14     9
c     9     5