I need display in dataframe columns both the user_id and length of content_id which is a list object. But struggling to do using groupby. Please help in both groupby as well as my question asked at the bottom of this post (how do I get the results along with user_id in dataframe?)
Dataframe types:
df.dtypes
output:
user_id object
content_id object
dtype: object
Sample Data:
user_id content_id
0 user_18085 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
1 user_16044 [cont_2738_2_49, cont_4482_2_19, cont_4994_18_...
2 user_13110 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
3 user_18909 [cont_3170_2_28]
4 user_15509 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
Pandas query:
df.groupby('user_id')['content_id'].count().reset_index()
df.groupby(['user_id'])['content_id'].apply(lambda x: get_count(x))
output:
user_id content_id
0 user_10013 1
1 user_10034 1
2 user_10042 1
When I tried without grouping, I am getting fine as below -
df['content_id'].apply(lambda x: len(x))
0 11
1 9
2 11
3 1
But, how do I get the results along with user_id in dataframe? Like I want in below format -
user_id content_id
some xxx 11
some yyy 6
CodePudding user response:
pandas.Groupby
returns a grouper element not the contents of each cell. As such it is not possible (without alot of workarounding) to do what you want. Instead you need to simply rewrite the columns (as suggested by @ifly6)
Using
df_agg = df.copy()
df_agg.content_id = df_agg.content_id.apply(len)
df_agg = df_agg.groupby('user_id').sum()
will result in the same dataframe as the Groupby
you described.
For completeness sake the instruction for a single groupby would be
df.groupby('user_id').agg(lambda x: x.apply(len).sum())
CodePudding user response:
try converting content_id to a string, split it by comma, then reassemble as a list of lists then count the list items.
data="""index user_id content_id
0 user_18085 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
1 user_16044 [cont_2738_2_49,cont_4482_2_19,cont_4994_18_]
2 user_13110 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
3 user_18909 [cont_3170_2_28]
4 user_15509 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
"""
df = pd.read_csv(StringIO(data), sep='\s ')
def convert_to_list(x):
x=re.sub(r'[\[\]]', '', x)
lst=list(x.split(','))
return lst
df['content_id2']= [list() for x in range(len(df.index))]
for key,item in df.iterrows():
lst=convert_to_list(str(item['content_id']))
for item in lst:
df.loc[key,'content_id2'].append(item)
def count_items(x):
return len(x)
df['count'] = df['content_id2'].apply(count_items)
df.drop(['content_id'],axis=1,inplace=True)
df.rename(columns={'content_id2':'content_id'},inplace=True)
print(df)
output:
index user_id content_id count
0 0 user_18085 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
1 1 user_16044 [cont_2738_2_49, cont_4482_2_19, cont_4994_18_] 3
2 2 user_13110 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
3 3 user_18909 [cont_3170_2_28] 1
4 4 user_15509 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3