How can i take rows with same id and where status has changed in pandas?-CodePudding

I have two dataframes, lets call them df1 and df2. I need to make a third dataframe where are rows from df1 filtered with the next conditions: By comparing the id's of both columns I want the third dataframe to contain the rows and column from df1 where in the df2 rows with same id has different status. I also want the new status as its own column so i can see the old status and the new status

So the dataframes would look something like this:

df1

id	col1	status
1	123456	1
2	123456	1
3	124567	1
4	123456	1
5	543210	1

df2

id	col1	status
1	1234356	1
2	1234456	2
3	1245567	1
4	1234516	2
5	5423210	2

df3(new dataframe)

id	col1	status	newstatus
2	123456	1	2
4	123456	1	2
5	543210	1	2

So as you can see, third dataframe would have columns and data from the df1 and then the new status from df2 as new column. The rows where id's match but the status has not changed, would be ignored.

I know you can merge the dataframes and then some way compare the columns but is that the wisest way or is some other method easier/better?

CodePudding user response：

If need append multiple columns from df2 use DataFrame.merge and then filter by DataFrame.query:

df3 = (df1.merge(df2.drop('col1', axis=1).rename(columns={'status':'newstatus'}), on='id')
          .query('status != newstatus'))

Or if need append only column status use Series.map:

df3 = (df1.assign(newstatus = df1['id'].map(df2.set_index('id')['status']))
          .query('status != newstatus'))
print (df3)
   id    col1  status  newstatus
1   2  123456       1          2
3   4  123456       1          2
4   5  543210       1          2

EDIT: Error InvalidIndexError: Reindexing only valid with uniquely valued Index objects means there are duplicated id values converted to index by .set_index('id'):

print (df2)
   id     col1  status
0   1  1234356       1
1   2  1234456       2
2   3  1245567       1
3   3  1245567      10 < added duplicated id=3 for test
4   4  1234516       2
5   5  5423210       2

First solution use both id:

df3 = (df1.merge(df2.drop('col1', axis=1).rename(columns={'status':'newstatus'}), on='id')
          .query('status != newstatus'))
print (df3)
   id    col1  status  newstatus
1   2  123456       1          2
3   3  124567       1         10
4   4  123456       1          2
5   5  543210       1          2

For mapping need unique id, here is used DataFrame.drop_duplicates:

print (df2.drop_duplicates('id'))
   id     col1  status
0   1  1234356       1
1   2  1234456       2
2   3  1245567       1
4   4  1234516       2
5   5  5423210       2

print (df2.drop_duplicates('id').set_index('id')['status'])
id
1    1
2    2
3    1
4    2
5    2
Name: status, dtype: int64

So now solution working:

df3 = (df1.assign(newstatus = df1['id'].map(df2.drop_duplicates('id').set_index('id')['status']))
          .query('status != newstatus'))
print (df3)
   id    col1  status  newstatus
1   2  123456       1          2
3   4  123456       1          2
4   5  543210       1          2