Let's say we have three datasets with three different years:
ID | Text | Year |
---|---|---|
101 | abc | 1990 |
102 | abd | 1990 |
103 | a | 1990 |
And the second dataset that could (or not) contain the IDs from the first year:
ID | Text | Year |
---|---|---|
104 | bc | 1991 |
101 | abc | 1991 |
102 | abe | 1991 |
And the third dataset:
ID | Text | Year |
---|---|---|
104 | bc | 1992 |
105 | a | 1992 |
I want somehow to merge these three dataframes add a new column to see if the text between the common IDs (and uncommon IDs) from those consecutive years is the same or not also update the year if we notice there's similarity between text from Year1 and Year2.
Here is the result I want:
ID | Text | Year | Similar |
---|---|---|---|
101 | abc | 1991 | true |
102 | abe | 1991 | false |
103 | a | 1990 | false |
104 | bc | 1992 | true |
105 | a | 1992 | false |
So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID that similar column. The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the Similar column is true (let's say if text from ID 104, year 1991 is the same as the text from ID 104, year 1992).
Thanks
CodePudding user response:
You can use a groupby.agg
with lambda g: g.nunique()!=len(g)
(if there at least one non duplicate) or lambda g: g.duplicated(keep=False).all()
for the column Similar
:
dfs = [df1, df2, df3]
out = (pd.concat(dfs)
.groupby('ID', as_index=False)
.agg(**{'Text': ('Text', 'last'),
'Year': ('Year', 'last'),
'Similar': ('Text', lambda g: g.duplicated(keep=False).all())
})
)
Output:
ID Text Year Similar
0 101 abc 1991 True
1 102 abe 1991 False
2 103 a 1990 False
3 104 bc 1992 True
4 105 a 1992 False
CodePudding user response:
So, here I am first merging the all df using pandas concat function, in second stage group the id and third applying the lambda function and filtering the last position by -1
.
Try following code:
pd.concat([df1,df2,df3]).groupby('id').apply(lambda x: x.iloc[[-1]]).reset_index(drop=True)