I have 2 dataframes
the first is df_my that looks like this
Seq Grp Lvl Distance
0 1 A 1 71
1 2 A 2 50
2 3 F 1 35
...
19 20 A 1 68
20 21 C 3 25
The 2nd dataset is df_Out
Seq Grp Lvl Distance
1 2 J 9 50
2 3 K 6 35
...
20 21 S 8 25
All records with the same Seq in df_Out are in df_My
except Seq for records with Disance > 55
I want to put these 2 dataframes together considering
If record (Seq) in df_Out then it is the to be in my final dataframe
If record (Seq) is not in df_Out but in df_My then the one from df_My is in the final data frame
df_Final will be
Seq Grp Lvl Distance
0 1 A 1 71
1 2 J 9 50
2 3 K 6 35
...
19 20 A 1 68
20 21 S 8 25
How to do that?
CodePudding user response:
One approach could be to append
first dataframe (df_my) into second (df_Out) and then remove duplicates
by keeping the row from the second dataframe (df_Out).
df_Out.append(df_my).drop_duplicates(['Seq'], keep='first').sort_values('Seq')
CodePudding user response:
It should work if you concat them and then drop duplicates, keeping the first:
pd.concat([df_out, df_my]).drop_duplicates(subset='Seq', keep='first').sort_index()
Output:
Seq Grp Lvl Distance
0 1 A 1 71
1 2 J 9 50
2 3 K 6 35
19 20 A 1 68
20 21 S 8 25
CodePudding user response:
The easiest way is to compose the algorithm of the commands that you just described.
- calculate the set of values of Seq that are covered by
dataframe 2
, because you base your decision on which dataframe to use in the merge on this criteria - join the mutually exclusive (w.r.t. values of Seq column) subsets of
dataframe 1
anddataframe 2
- sort if necessary
Code (not teseted):
import numpy as np
seq_values_that_present_in_both_dfs_arr = np.intersect1d(df1['Seq'].unique(), df2['Seq'].unique())
result_df = pd.concat([df1.query('Seq not in @seq_values_that_present_in_both_dfs_arr'),
df2.query('Seq in @seq_values_that_present_in_both_dfs_arr')]).reset_index().sort_values(by=[' Seq'])
I prefer to be explicit with what I want from the result in the code, than rely on obscure implementations of certain commands.