I have a pandas DataFrame like this:
source text_column
0 a abcdefghi
1 a abcde
2 b qwertyiop
3 c plmnkoijb
4 a NaN
5 c abcde
6 b qwertyiop
7 b qazxswedcdcvfr
and I would like to get the length of text_column
after grouping source
column, like below:
source something
a 9
b 14
c 9
Here's what I have tried till now and all of them generate error:
>>> # first creating the group by object
>>> text_group = mydf.groupby(by=['source'])
>>> # now try to get the max length of "text_column" by each "source"
>>> text_group['text_column'].map(len).max()
>>> text_group['text_column'].len().max()
>>> text_group['text_column'].str.len().max()
How do I get the max length of text_column
with another column grouped by.
and to avoid creating new question, how do I also get the 2nd biggest length and the respective values(1st and 2nd largest sentences in text_column
).
CodePudding user response:
Just get the lengths first with assign
and str.len
:
df.assign(text_column=df['text_column'].str.len()).groupby('source', as_index=False).max()
source text_column
0 a 9.0
1 b 14.0
2 c 9.0
>>>
CodePudding user response:
The easiest solution to me looks sth like this (tested) - you do not actually need a groupby:
df['str_len'] = df.text_column.str.len()
df.sort_values(['str_len'], ascending=False)\
.drop_duplicates(['source'])\
.drop(columns='text_column')
source str_len
7 b 14.0
0 a 9.0
3 c 9.0
regarding your 2nd question, I think a groupby serves you well:
top_x = 2
df.groupby('source', as_index=False)\
.apply(lambda sourcedf: sourcedf.sort_values('str_len').nlargest(top_x, columns='str_len', keep='all'))\
.drop(columns='text_column')
CodePudding user response:
First idea is use lambda function with Series.str.len
and max
:
df = (df.groupby('source')['text_column']
.agg(lambda x: x.str.len().max())
.reset_index(name='something'))
print (df)
source something
0 a 9.0
1 b 14.0
2 c 9.0
Or you can first use Series.str.len
and then aggregate max
:
df = (df['text_column'].str.len()
.groupby(df['source'])
.max()
.reset_index(name='something'))
print (df)
Also if need integers first use DataFrame.dropna
:
df = (df.dropna(subset=['text_column'])
.assign(text_column=lambda x: x['text_column'].str.len())
.groupby('source', as_index=False)['text_column']
.max())
print (df)
source text_column
0 a 9
1 b 14
2 c 9
EDIT: for first and second top values use DataFrame.sort_values
with GroupBy.head
:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.sort_values(['source','something'], ascending=[True, False])
.groupby('source', as_index=False)
.head(2))
print (df1)
source text_column something
0 a abcdefghi 9
1 a abcde 5
7 b qazxswedcdcvfr 14
2 b qwertyiop 9
3 c plmnkoijb 9
5 c abcde 5
Alternative solution with SeriesGroupBy.nlargest
, obviously slowier:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.groupby('source')['something']
.nlargest(2)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
source something
0 a 9
1 a 5
2 b 14
3 b 9
4 c 9
5 c 5
Last solution for new columns by top1, top2:
df=df.dropna(subset=['text_column']).assign(something=lambda x: x['text_column'].str.len())
df = df.sort_values(['source','something'], ascending=[True, False])
df['g'] = df.groupby('source').cumcount().add(1)
df = (df[df['g'].le(2)].pivot('source','g','something')
.add_prefix('top')
.rename_axis(index=None, columns=None))
print (df)
top1 top2
a 9 5
b 14 9
c 9 5