I have 2 dataframe as below:
df.head(10)
key program
0 A emp
1 A dep
2 A emp
3 A dep
4 A dep
5 B emp
6 B dep
7 B emp
8 B emp
9 B emp
df1.head()
key program value1 value2
0 A emp 10000 100000
1 A dep 5000 30000
2 B emp 20000 40000
3 B dep 3000 6000
then I merge 2 df by 'key' and 'program'
df_merge = df.merge(df1,how='left',left_on=['key','program'],right_on=['key','program'])
df_merge.head(10)
key program value1 value2
0 A emp 10000 100000
1 A dep 5000 30000
2 A emp 10000 100000
3 A dep 5000 30000
4 A dep 5000 30000
5 B emp 20000 40000
6 B dep 3000 6000
7 B emp 20000 40000
8 B emp 20000 40000
9 B emp 20000 40000
I would like to keep unique value in column 'value1' and 'values' base 'key' and 'program', could you please assist how I can do that ? output expected like below:
key program value1 value2
0 A emp 10000 100000
1 A dep 5000 30000
2 A emp
3 A dep
4 A dep
5 B emp 20000 40000
6 B dep 3000 6000
7 B emp
8 B emp
9 B emp
CodePudding user response:
You can modify your merge
by creating a new index column:
df_merge = (
df.merge(df1, how='left',
left_on=['key', 'program', df.groupby(['key', 'program']).cumcount()],
right_on=['key', 'program', df1.groupby(['key', 'program']).cumcount()])
.drop(columns='key_2')
)
Output:
>>> df_merge
key program value1 value2
0 A emp 10000.0 100000.0
1 A dep 5000.0 30000.0
2 A emp NaN NaN
3 A dep NaN NaN
4 A dep NaN NaN
5 B emp 20000.0 40000.0
6 B dep 3000.0 6000.0
7 B emp NaN NaN
8 B emp NaN NaN
9 B emp NaN NaN
CodePudding user response:
Use DataFrame.duplicated
with set values of columns value1, value2
to empty strings:
df_merge.loc[df_merge.duplicated(['key','value1','value2']), ['value1','value2']] = ''
print (df_merge)
key program value1 value2
0 A emp 10000 100000
1 A dep 5000 30000
2 A emp
3 A dep
4 A dep
5 B emp 20000 40000
6 B dep 3000 6000
7 B emp
8 B emp
9 B emp