How to outer merge 3 or more datasets based on an id and compare them using one column?-CodePudding

Let's say we have three datasets with three different years:

ID	Text	Year
101	abc	1990
102	abd	1990
103	a	1990

And the second dataset that could (or not) contain the IDs from the first year:

ID	Text	Year
104	bc	1991
101	abc	1991
102	abe	1991

And the third dataset:

ID	Text	Year
104	bc	1992
105	a	1992

I want somehow to merge these three dataframes add a new column to see if the text between the common IDs (and uncommon IDs) from those consecutive years is the same or not also update the year if we notice there's similarity between text from Year1 and Year2.

Here is the result I want:

ID	Text	Year	Similar
101	abc	1991	true
102	abe	1991	false
103	a	1990	false
104	bc	1992	true
105	a	1992	false

So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID that similar column. The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the Similar column is true (let's say if text from ID 104, year 1991 is the same as the text from ID 104, year 1992).

Thanks

CodePudding user response：

You can use a groupby.agg with lambda g: g.nunique()!=len(g) (if there at least one non duplicate) or lambda g: g.duplicated(keep=False).all() for the column Similar:

dfs = [df1, df2, df3]

out = (pd.concat(dfs)
         .groupby('ID', as_index=False)
         .agg(**{'Text': ('Text', 'last'),
                 'Year': ('Year', 'last'),
                 'Similar': ('Text', lambda g: g.duplicated(keep=False).all())
                })
      )

Output:

    ID Text  Year  Similar
0  101  abc  1991     True
1  102  abe  1991    False
2  103    a  1990    False
3  104   bc  1992     True
4  105    a  1992    False

CodePudding user response：

So, here I am first merging the all df using pandas concat function, in second stage group the id and third applying the lambda function and filtering the last position by -1.

Try following code:

pd.concat([df1,df2,df3]).groupby('id').apply(lambda x: x.iloc[[-1]]).reset_index(drop=True)