Merging dataframe with its sub dataframe in python-CodePudding

I have 2 dataframes

the first is df_my that looks like this

     Seq   Grp    Lvl   Distance
0    1     A       1    71
1    2     A       2    50
2    3     F       1    35
...
19   20    A       1    68
20   21    C       3    25

The 2nd dataset is df_Out

     Seq   Grp    Lvl   Distance
1    2     J       9    50
2    3     K       6    35
...
20   21    S       8    25

All records with the same Seq in df_Out are in df_My

except Seq for records with Disance > 55

I want to put these 2 dataframes together considering

If record (Seq) in df_Out then it is the to be in my final dataframe

If record (Seq) is not in df_Out but in df_My then the one from df_My is in the final data frame

df_Final will be

     Seq   Grp    Lvl   Distance
0    1     A       1    71
1    2     J       9    50
2    3     K       6    35
...
19   20    A       1    68
20   21    S       8    25

How to do that?

CodePudding user response：

One approach could be to append first dataframe (df_my) into second (df_Out) and then remove duplicates by keeping the row from the second dataframe (df_Out).

df_Out.append(df_my).drop_duplicates(['Seq'], keep='first').sort_values('Seq')

CodePudding user response：

It should work if you concat them and then drop duplicates, keeping the first:

pd.concat([df_out, df_my]).drop_duplicates(subset='Seq', keep='first').sort_index()

Output:

    Seq Grp  Lvl  Distance
0     1   A    1        71
1     2   J    9        50
2     3   K    6        35
19   20   A    1        68
20   21   S    8        25

CodePudding user response：

The easiest way is to compose the algorithm of the commands that you just described.

calculate the set of values of Seq that are covered by dataframe 2, because you base your decision on which dataframe to use in the merge on this criteria
join the mutually exclusive (w.r.t. values of Seq column) subsets of dataframe 1 and dataframe 2
sort if necessary

Code (not teseted):

import numpy as np
seq_values_that_present_in_both_dfs_arr = np.intersect1d(df1['Seq'].unique(), df2['Seq'].unique())
result_df = pd.concat([df1.query('Seq not in @seq_values_that_present_in_both_dfs_arr'),
                       df2.query('Seq     in @seq_values_that_present_in_both_dfs_arr')]).reset_index().sort_values(by=[' Seq'])

I prefer to be explicit with what I want from the result in the code, than rely on obscure implementations of certain commands.