I have two dataframes, lets call them df1 and df2. I need to make a third dataframe where are rows from df1 filtered with the next conditions: By comparing the id's of both columns I want the third dataframe to contain the rows and column from df1 where in the df2 rows with same id has different status. I also want the new status as its own column so i can see the old status and the new status
So the dataframes would look something like this:
df1
id | col1 | status |
---|---|---|
1 | 123456 | 1 |
2 | 123456 | 1 |
3 | 124567 | 1 |
4 | 123456 | 1 |
5 | 543210 | 1 |
df2
id | col1 | status |
---|---|---|
1 | 1234356 | 1 |
2 | 1234456 | 2 |
3 | 1245567 | 1 |
4 | 1234516 | 2 |
5 | 5423210 | 2 |
df3(new dataframe)
id | col1 | status | newstatus |
---|---|---|---|
2 | 123456 | 1 | 2 |
4 | 123456 | 1 | 2 |
5 | 543210 | 1 | 2 |
So as you can see, third dataframe would have columns and data from the df1 and then the new status from df2 as new column. The rows where id's match but the status has not changed, would be ignored.
I know you can merge the dataframes and then some way compare the columns but is that the wisest way or is some other method easier/better?
CodePudding user response:
If need append multiple columns from df2
use DataFrame.merge
and then filter by DataFrame.query
:
df3 = (df1.merge(df2.drop('col1', axis=1).rename(columns={'status':'newstatus'}), on='id')
.query('status != newstatus'))
Or if need append only column status
use Series.map
:
df3 = (df1.assign(newstatus = df1['id'].map(df2.set_index('id')['status']))
.query('status != newstatus'))
print (df3)
id col1 status newstatus
1 2 123456 1 2
3 4 123456 1 2
4 5 543210 1 2
EDIT: Error InvalidIndexError: Reindexing only valid with uniquely valued Index objects
means there are duplicated id
values converted to index by .set_index('id')
:
print (df2)
id col1 status
0 1 1234356 1
1 2 1234456 2
2 3 1245567 1
3 3 1245567 10 < added duplicated id=3 for test
4 4 1234516 2
5 5 5423210 2
First solution use both id:
df3 = (df1.merge(df2.drop('col1', axis=1).rename(columns={'status':'newstatus'}), on='id')
.query('status != newstatus'))
print (df3)
id col1 status newstatus
1 2 123456 1 2
3 3 124567 1 10
4 4 123456 1 2
5 5 543210 1 2
For mapping need unique id
, here is used DataFrame.drop_duplicates
:
print (df2.drop_duplicates('id'))
id col1 status
0 1 1234356 1
1 2 1234456 2
2 3 1245567 1
4 4 1234516 2
5 5 5423210 2
print (df2.drop_duplicates('id').set_index('id')['status'])
id
1 1
2 2
3 1
4 2
5 2
Name: status, dtype: int64
So now solution working:
df3 = (df1.assign(newstatus = df1['id'].map(df2.drop_duplicates('id').set_index('id')['status']))
.query('status != newstatus'))
print (df3)
id col1 status newstatus
1 2 123456 1 2
3 4 123456 1 2
4 5 543210 1 2