I have two df's, one of which I want to only keep the most recent data, i.e df1 acts as a rolling record, and df2 feeds information to it.
If there is an ID appearing in df1 which isn't appearing in df2, I want to drop it from the new iteration of df1. All new ID's appearing in df2 should be added, and the duplicates should be kept.
Furthermore, I have a running count of the amount of appearances an ID has.
I have been trying to use df.join, I thought the outer method would have been appropriate.
Ideally I want something that's the opposite to df.drop_duplicates
This is my code so far:
grp1 = {'ID': ['1','2','3','4','5'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
}
grp2 = {'ID': ['3','4','5','6','7'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
}
df1 = pd.DataFrame(grp1, columns= ['ID','Shape'])
df2 = pd.DataFrame(grp2, columns= ['ID','Shape'])
df1.loc[df1["ID"].isin(df2["ID"]), 'count'] = 1
df1 = pd.concat([df1, df2]).drop_duplicates('ID').reset_index(drop=True)
The code I have written so far is this, however it is giving me the wrong output
df1 = pd.merge(df1,df2,on='ID',how='right')
output:
ID Shape_x count_x Shape_y count_y
0 3 Square 1.0 Rectangle 1
1 4 Rectangle 1.0 Rectangle 1
2 5 Square 1.0 Square 1
3 6 NaN NaN Rectangle 1
4 7 NaN NaN Square 1
What I am trying to achieve:
df1
ID Shape count
0 1 Rectangle 1
1 2 Rectangle 1
2 3 Square 1
3 4 Rectangle 1
4 5 Square 1
df2
ID Shape count
0 3 Rectangle 1
1 4 Rectangle 1
2 5 Square 1
3 6 Rectangle 1
4 7 Square 1
ideal output:
ID Shape count
0 3 Rectangle 2
1 4 Rectangle 2
2 5 Square 2
3 6 Rectangle 1
4 7 Square 1
CodePudding user response:
You can use the similar .isin()
code for df2
, as follows:
df2.loc[df2["ID"].isin(df1["ID"]), 'count'] = 1
Data Setup
grp1 = {'ID': ['1','2','3','4','5'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
}
grp2 = {'ID': ['3','4','5','6','7'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
}
df1 = pd.DataFrame(grp1, columns= ['ID','Shape'])
df2 = pd.DataFrame(grp2, columns= ['ID','Shape'])
df1['count'] = 1
df2['count'] = 1
print(df1)
ID Shape count
0 1 Rectangle 1
1 2 Rectangle 1
2 3 Square 1
3 4 Rectangle 1
4 5 Square 1
print(df2)
ID Shape count
0 3 Rectangle 1
1 4 Rectangle 1
2 5 Square 1
3 6 Rectangle 1
4 7 Square 1
Then, run the new code:
df2.loc[df2["ID"].isin(df1["ID"]), 'count'] = 1
Result:
print(df2)
ID Shape count
0 3 Rectangle 2
1 4 Rectangle 2
2 5 Square 2
3 6 Rectangle 1
4 7 Square 1
Edit
If you want to approach from df1
, you can use:
grp1 = {'ID': ['1','2','3','4','5'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
}
grp2 = {'ID': ['3','4','5','6','7'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
}
df1 = pd.DataFrame(grp1, columns= ['ID','Shape'])
df2 = pd.DataFrame(grp2, columns= ['ID','Shape'])
df1['count'] = 1
df2['count'] = 1
df1.loc[df1["ID"].isin(df2["ID"]), 'count'] = 1
df1 = pd.concat([df1, df2]).drop_duplicates('ID').reset_index(drop=True)
This yield df1
, as follows:
print(df1)
ID Shape count
0 1 Rectangle 1
1 2 Rectangle 1
2 3 Square 2
3 4 Rectangle 2
4 5 Square 2
5 6 Rectangle 1
6 7 Square 1
Then, to filter out the rows where
an ID appearing in df1 which isn't appearing in df2
we further use our handy friend .isin()
again:
df1.loc[df1['ID'].isin(df2['ID'])]
This filters to keep only those ID
in df1
that is also in df2
.
Result:
ID Shape count
2 3 Square 2
3 4 Rectangle 2
4 5 Square 2
5 6 Rectangle 1
6 7 Square 1
CodePudding user response:
In case you data in count varies, you can do it with aggregation:
res = pd.concat([df1, df2]).groupby(['ID', 'Shape']).sum().reset_index()
res = res.loc[res.ID.isin(df2.ID)]
It gives the following frame:
ID Shape count
2 3 Rectangle 1
3 3 Square 1
4 4 Rectangle 2
5 5 Square 2
6 6 Rectangle 1
7 7 Square 1