I have the below sample data frame
In [1]: df = pd.DataFrame({'Name': {0: 'ahmed',
1: 'ahmed',
2: 'mohamed',
3: 'mohamed',
4: 'hassan',
5: 'ali',
6: 'nice',
7: 'bad'},
'Location': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'C', 5: 'C', 6: 'D', 7: 'D'},
'Prof': {0: 'doc',
1: 'doc',
2: 'eng',
3: 'doc',
4: 'eng',
5: 'eng',
6: 'doc',
7: 'eng'}})
In [2]: df
Out[2]:
Name Location Prof
0 ahmed A doc
1 ahmed A doc
2 mohamed B eng
3 mohamed B doc
4 hassan C eng
5 ali C eng
6 nice D doc
7 bad D eng
I would like to add a new column Relations
which will contain 1-to-1, 1-to-M, M-to-1 or M-to-M relation between profession and both Name and Location combination
I want this
In [3]: df
Out[3]:
Name Location Prof relation
0 ahmed A doc 1_to_1
1 mohamed B eng, doc 1_to_M
2 hassan C eng M_to_1
3 ali C eng M_to_1
4 nice D doc M_to_M
5 bad D eng M_to_M
beside the Relations
column, I only want unique values in ['Prof']
;
df.groupby(['Name', 'Location'])['Prof'].agg(','.join)
Using the above line of code resulted in doc,doc
for the first record, which is something I don't want and I can't remove duplicates before as I have many other columns in the original data that makes both 1st rows not duplicate.
CodePudding user response:
You can use set
in lambda after groupby
like below then you can use merge two DataFrame and get what you want like below:
>>> relation_list = []
>>> for col in ['Name', 'Prof']:
... relation_list.append(dict(df.groupby('Location')[col].apply(lambda x : '1' if len(set(x))==1 else 'M')))
>>> df_rel = pd.DataFrame(relation_list).apply(lambda x : '_to_'.join(x)).to_frame().reset_index().rename({'index': 'Location'}, axis=1)
>>> df_rel
Location 0
0 A 1_to_1
1 B 1_to_M
2 C M_to_1
3 D M_to_M
>>> df_prof = df.groupby(['Name', 'Location'])['Prof'].apply(lambda x : ','.join(set(x))).to_frame().reset_index()
>>> df_prof
Name Location Prof
0 ahmed A doc
1 ali C eng
2 bad D eng
3 hassan C eng
4 mohamed B eng,doc
5 nice D doc
>>> df_final = df_prof.merge(df_rel, on='Location').rename({0: 'relation'}, axis=1)
>>> df_final.sort_values('Location')
Name Location Prof relation
0 ahmed A doc 1_to_1
5 mohamed B eng,doc 1_to_M
1 ali C eng M_to_1
2 hassan C eng M_to_1
3 bad D eng M_to_M
4 nice D doc M_to_M
CodePudding user response:
Step 1: You can set the relation
for Location
as follows:
Get the unique counts of Name
and Prof
for each Location
by .groupby()
and nunique()
.
Set up relation
using np.select()
# Get the unique counts of `Name` and `Prof` for each `Location`
df_count = df.groupby('Location', as_index=False)[['Name', 'Prof']].nunique()
# Set up `relation` using `np.select()`
cond_list = [(df_count['Name'] == 1) & (df_count['Prof'] == 1),
(df_count['Name'] == 1) & (df_count['Prof'] >= 2),
(df_count['Name'] >= 2) & (df_count['Prof'] == 1),
(df_count['Name'] >= 2) & (df_count['Prof'] >= 2)
]
choice_list = ['1_to_1',
'1_to_M',
'M_to_1',
'M_to_M'
]
df_count['relation'] = np.select(cond_list, choice_list)
Step 2: Set the Prof
as follows:
We use .unique()
to get the unique values of Prof
for join
sort=False
in .groupby()
to keep original sequence.
df_prof = (df.groupby(['Name', 'Location'], as_index=False, sort=False)['Prof']
.agg(lambda x: ', '.join(x.unique())))
Step 3: Merge them together
df_final = df_prof.merge(df_count[['Location', 'relation']], on='Location')
Results:
df_count
print(df_count)
Location Name Prof relation
0 A 1 1 1_to_1
1 B 1 2 1_to_M
2 C 2 1 M_to_1
3 D 2 2 M_to_M
df_prof
print(df_prof)
Name Location Prof
0 ahmed A doc
1 mohamed B eng, doc
2 hassan C eng
3 ali C eng
4 nice D doc
5 bad D eng
df_final
print(df_final)
Name Location Prof relation
0 ahmed A doc 1_to_1
1 mohamed B eng, doc 1_to_M
2 hassan C eng M_to_1
3 ali C eng M_to_1
4 nice D doc M_to_M
5 bad D eng M_to_M