I would like to achieve what it's described here: stackoverflow question, but only using standard pandas.
I have two dataframes: Fist
first_employee target_employee relationship
0 Andy Claude 0
1 Andy Frida 20
2 Andy Georgia -10
3 Andy Joan 30
4 Andy Lee -10
5 Andy Pablo -10
6 Andy Vincent 20
7 Claude Frida 0
8 Claude Georgia 90
9 Claude Joan 0
10 Claude Lee 0
11 Claude Pablo 10
12 Claude Vincent 0
13 Frida Georgia 0
14 Frida Joan 0
15 Frida Lee 0
16 Frida Pablo 50
17 Frida Vincent 60
18 Georgia Joan 0
19 Georgia Lee 10
20 Georgia Pablo 0
21 Georgia Vincent 0
22 Joan Lee 70
23 Joan Pablo 0
24 Joan Vincent 10
25 Lee Pablo 0
26 Lee Vincent 0
27 Pablo Vincent -20
Second:
first_employee target_employee book_count
0 Vincent Frida 2
1 Vincent Pablo 1
2 Andy Claude 1
3 Andy Joan 1
4 Andy Pablo 1
5 Andy Lee 1
6 Andy Frida 1
7 Andy Georgia 1
8 Claude Georgia 3
9 Joan Lee 3
10 Pablo Frida 2
I want to join the two dataframes such that my final dataframe is identical to the first one, but it has also the book_count
column with the corresponding values (and NaN if not available).
I already wrote something like: joined_df = first_df.merge(second_df, on = ['first_employee', 'target_employee'], how = 'outer')
and I get:
first_employee target_employee relationship book_count
0 Andy Claude 0.0 1.0
1 Andy Frida 20.0 1.0
2 Andy Georgia -10.0 1.0
3 Andy Joan 30.0 1.0
4 Andy Lee -10.0 1.0
5 Andy Pablo -10.0 1.0
6 Andy Vincent 20.0 NaN
7 Claude Frida 0.0 NaN
8 Claude Georgia 90.0 3.0
9 Claude Joan 0.0 NaN
10 Claude Lee 0.0 NaN
11 Claude Pablo 10.0 NaN
12 Claude Vincent 0.0 NaN
13 Frida Georgia 0.0 NaN
14 Frida Joan 0.0 NaN
15 Frida Lee 0.0 NaN
16 Frida Pablo 50.0 NaN
17 Frida Vincent 60.0 NaN
18 Georgia Joan 0.0 NaN
19 Georgia Lee 10.0 NaN
20 Georgia Pablo 0.0 NaN
21 Georgia Vincent 0.0 NaN
22 Joan Lee 70.0 3.0
23 Joan Pablo 0.0 NaN
24 Joan Vincent 10.0 NaN
25 Lee Pablo 0.0 NaN
26 Lee Vincent 0.0 NaN
27 Pablo Vincent -20.0 NaN
28 Vincent Frida NaN 2.0
29 Vincent Pablo NaN 1.0
30 Pablo Frida NaN 2.0
And it is somewhat close to what I would like to achieve. However, the ordering of the values in the first_employee
and target_employee
it's not relevant, so if in the first dataframe I have (Frida,Vincent)
and in the second (Vincent, Frida)
, these twos should be merged together (what matters are the values, not the column-wise order).
In my resulting dataframe i get three extra rows:
28 Vincent Frida NaN 2.0
29 Vincent Pablo NaN 1.0
30 Pablo Frida NaN 2.0
which are the result of my merging that considers "ordered" values columns-wise to make the join: these 3 extra rows should be merged on the already available couples (Frida, Vincent)
(Pablo, Vincent)
and (Frida, Pablo)
.
Is there a way to do so using only standard pandas
functions? (the question I cited at the beginning uses sqldf
)
CodePudding user response:
I believe this is what you are looking for. Using np.sort
will change the order of the first two columns for each row so it is alphabetical, allowing the merge to work correctly.
cols = ['first_employee','target_employee']
df[cols] = np.sort(df[cols].to_numpy(),axis=1)
df2[cols] = np.sort(df2[cols].to_numpy(),axis=1)
ndf = pd.merge(df,df2,on = cols,how='left')
CodePudding user response:
Create a key
as a sorted tuple with first and target employee then merge on it:
create_key = lambda x: tuple(sorted([x['first_employee'], x['target_employee']]))
out = pd.merge(df1.assign(_key=df1.apply(create_key, axis=1)),
df2.assign(_key=df2.apply(create_key, axis=1)),
on='_key', suffixes=('', '_key'), how='outer') \
.loc[:, lambda x: ~x.columns.str.endswith('_key')]
print(out)
# Output:
first_employee target_employee relationship book_count
0 Andy Claude 0 1.0
1 Andy Frida 20 1.0
2 Andy Georgia -10 1.0
3 Andy Joan 30 1.0
4 Andy Lee -10 1.0
5 Andy Pablo -10 1.0
6 Andy Vincent 20 NaN
7 Claude Frida 0 NaN
8 Claude Georgia 90 3.0
9 Claude Joan 0 NaN
10 Claude Lee 0 NaN
11 Claude Pablo 10 NaN
12 Claude Vincent 0 NaN
13 Frida Georgia 0 NaN
14 Frida Joan 0 NaN
15 Frida Lee 0 NaN
16 Frida Pablo 50 2.0
17 Frida Vincent 60 2.0
18 Georgia Joan 0 NaN
19 Georgia Lee 10 NaN
20 Georgia Pablo 0 NaN
21 Georgia Vincent 0 NaN
22 Joan Lee 70 3.0
23 Joan Pablo 0 NaN
24 Joan Vincent 10 NaN
25 Lee Pablo 0 NaN
26 Lee Vincent 0 NaN
27 Pablo Vincent -20 1.0