Home > database >  How to outer merge 3 or more datasets based on an id and compare them using one column?
How to outer merge 3 or more datasets based on an id and compare them using one column?

Time:12-15

Let's say we have three datasets with three different years:

ID Text Year
101 abc 1990
102 abd 1990
103 a 1990

And the second dataset that could (or not) contain the IDs from the first year:

ID Text Year
104 bc 1991
101 abc 1991
102 abe 1991

And the third dataset:

ID Text Year
104 bc 1992
105 a 1992

I want somehow to merge these three dataframes add a new column to see if the text between the common IDs (and uncommon IDs) from those consecutive years is the same or not also update the year if we notice there's similarity between text from Year1 and Year2.

Here is the result I want:

ID Text Year Similar
101 abc 1991 true
102 abe 1991 false
103 a 1990 false
104 bc 1992 true
105 a 1992 false

So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID that similar column. The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the Similar column is true (let's say if text from ID 104, year 1991 is the same as the text from ID 104, year 1992).

Thanks

CodePudding user response:

You can use a groupby.agg with lambda g: g.nunique()!=len(g) (if there at least one non duplicate) or lambda g: g.duplicated(keep=False).all() for the column Similar:

dfs = [df1, df2, df3]

out = (pd.concat(dfs)
         .groupby('ID', as_index=False)
         .agg(**{'Text': ('Text', 'last'),
                 'Year': ('Year', 'last'),
                 'Similar': ('Text', lambda g: g.duplicated(keep=False).all())
                })
      )

Output:

    ID Text  Year  Similar
0  101  abc  1991     True
1  102  abe  1991    False
2  103    a  1990    False
3  104   bc  1992     True
4  105    a  1992    False

CodePudding user response:

So, here I am first merging the all df using pandas concat function, in second stage group the id and third applying the lambda function and filtering the last position by -1.

Try following code:

pd.concat([df1,df2,df3]).groupby('id').apply(lambda x: x.iloc[[-1]]).reset_index(drop=True)
  • Related