Imagine you have the following data:
date items
0 2022-03-04 [6, 7]
1 2022-03-04 [4, 8]
1 2022-03-04 [4, 8]
1 2022-03-04 [4, 8]
1 2022-03-04 [4, 8]
... ... ...
1082 2022-03-03 1
1082 2022-03-03 1
1082 2022-03-03 1
1083 2022-03-11 1
1083 2022-03-11 1
I'd like to reach a state where I have the max length of items
per day
date len
0 2022-03-04 2
2 2022-03-05 3
... ... ...
1083 2022-03-11 1
etc. I dont care for duplicates, I just want the top list sizes by day in the end.
I've tried the following:
df["len"] = max(df.items, key=len)
but this seems to compute not per row, but rather on the column? as I get the following issue:
*** ValueError: Length of values (3) does not match length of index (4002)
Appreciate any assistance you can give
CodePudding user response:
Use Series.str.len
with aggregate max
, if values are not in lists get misisng values in df.items.str.len()
:
df = df.items.str.len().groupby(df['date']).max().reset_index(name='len')
If there is integers and need count like lenghts replace missing values by fillna
:
df = (df.items.str.len().fillna(df['items'])
.groupby(df['date'])
.max()
.reset_index(name='len'))
EDIT: If need row with maximal length of items:
print (df)
date items
0 2022-03-04 [6,7]
1 2022-03-04 [4]
1 2022-03-05 [4,8,4,4,7]
1 2022-03-05 [4,8,4]
1 2022-03-06 [4,8,5,7]
df = df.reset_index(drop=True)
df['len'] = df['items'].str.len().fillna(df['items'])
df = df.loc[df.groupby('date')['len'].idxmax()]
print (df)
date items len
0 2022-03-04 [6, 7] 2
2 2022-03-05 [4, 8, 4, 4, 7] 5
4 2022-03-06 [4, 8, 5, 7] 4