pandas groupby with length of lists-CodePudding

I need display in dataframe columns both the user_id and length of content_id which is a list object. But struggling to do using groupby. Please help in both groupby as well as my question asked at the bottom of this post (how do I get the results along with user_id in dataframe?)

Dataframe types:

df.dtypes

output:

user_id       object
content_id    object
dtype: object

Sample Data:

    user_id     content_id
0   user_18085  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
1   user_16044  [cont_2738_2_49, cont_4482_2_19, cont_4994_18_...
2   user_13110  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
3   user_18909  [cont_3170_2_28]
4   user_15509  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...

Pandas query:

df.groupby('user_id')['content_id'].count().reset_index()

df.groupby(['user_id'])['content_id'].apply(lambda x: get_count(x))

output:

    user_id     content_id
0   user_10013  1
1   user_10034  1
2   user_10042  1

When I tried without grouping, I am getting fine as below -

df['content_id'].apply(lambda x: len(x))


0       11
1        9
2       11
3        1

But, how do I get the results along with user_id in dataframe? Like I want in below format -

user_id   content_id
some xxx  11
some yyy  6

CodePudding user response：

pandas.Groupby returns a grouper element not the contents of each cell. As such it is not possible (without alot of workarounding) to do what you want. Instead you need to simply rewrite the columns (as suggested by @ifly6)

Using

df_agg = df.copy()
df_agg.content_id = df_agg.content_id.apply(len)
df_agg = df_agg.groupby('user_id').sum()

will result in the same dataframe as the Groupby you described.

For completeness sake the instruction for a single groupby would be

df.groupby('user_id').agg(lambda x: x.apply(len).sum())

CodePudding user response：

try converting content_id to a string, split it by comma, then reassemble as a list of lists then count the list items.

data="""index  user_id     content_id
0   user_18085  [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
1   user_16044  [cont_2738_2_49,cont_4482_2_19,cont_4994_18_]
2   user_13110  [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
3   user_18909  [cont_3170_2_28]
4   user_15509  [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
"""
df = pd.read_csv(StringIO(data), sep='\s ')

def convert_to_list(x):
    x=re.sub(r'[\[\]]', '', x)
    lst=list(x.split(','))
    return lst

df['content_id2']= [list() for x in range(len(df.index))]
for key,item in df.iterrows():
    lst=convert_to_list(str(item['content_id']))
    for item in lst:
        df.loc[key,'content_id2'].append(item)
    
def count_items(x):
    return len(x)

df['count'] = df['content_id2'].apply(count_items)
df.drop(['content_id'],axis=1,inplace=True)
df.rename(columns={'content_id2':'content_id'},inplace=True)
print(df)

output:

 index     user_id                                       content_id  count
0      0  user_18085  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19]      3
1      1  user_16044  [cont_2738_2_49, cont_4482_2_19, cont_4994_18_]      3
2      2  user_13110  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19]      3
3      3  user_18909                                 [cont_3170_2_28]      1
4      4  user_15509  [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19]      3