I have a dataframe like this:
df:
Collection ID
0 [{'tom': 'one'}, {'tom': 'two'}] 10
1 [{'nick': 'one'}] 10
2 [{'julie': 'one'}] 14
When the 'ID' column has duplicated values, for whichever entry of duplicates, the length of the list value of the column 'Collection' is greater, I want to set the value of a new column 'status' as 1, else 0.
Resultant df should look like: df:
Collection ID status
0 [{'tom': 'one'}, {'tom': 'two'}] 10 1
1 [{'nick': 'one'}] 10 0
2 [{'julie': 'one'}] 14 1
I have tried to go along the np.where function which I have found closest to my problem from Stack Overflow but failing to get an alternative of df['Collection'].str.len()
which will give me the length of the list.
df['status']=np.where(df["Collection"].str.len() > 1, 1, 0)
Thanks in advance.
df to dict value:
{'Collection': {0: [{'tom': 'one'}, {'tom': 'two'}],
1: [{'nick': 'one'}],
2: [{'julie': 'one'}]},
'ID': {0: 10, 1: 10, 2: 14}}
CodePudding user response:
IIUC, you can do:
df.loc[df.assign(l=df['Collection'].apply(len)).groupby('ID').idxmax()['l'], 'status'] = 1
df['status'] = df['status'].fillna(0).astype(int)
In a later version of pandas, probably you need to supply numeric_only=True
in idxmax()
function.
output:
Collection ID status
0 [{'tom': 'one'}, {'tom': 'two'}] 10 1
1 [{'nick': 'one'}] 10 0
2 [{'julie': 'one'}] 14 1
CodePudding user response:
A possible solution:
df['status'] = df['Collection'].map(len)
df['status'] =(df.groupby('ID', sort=False)
.apply(lambda g: 1*g['status'].eq(max(g['status'])))
.reset_index(drop=True))
Output:
Collection ID status
0 [{'tom': 'one'}, {'tom': 'two'}] 10 1
1 [{'nick': 'one'}] 10 0
2 [{'julie': 'one'}] 14 1