I have two Pandas DataFrames:
a =
ID Index Value
1 275 0 00000005
2 1024 27 01
3 1024 23 01
b =
ID Index Value_x Value_y
1 1024 27 01 02
2 1024 23 01 02
I want to get only the different values based on the first three columns, but only preserving the columns of a
- thereby resulting in this:
ID Index Value
1 275 0 00000005
I've tried dropping columns first, and doing a diff that way, but that ends up creating unnecessary copies of the DataFrames (real world arrays are 100k rows).
CodePudding user response:
What you need is anti_join
:
>>> from datar.all import f, anti_join, tribble
>>> a = tribble(
... f.ID, f.Index, f.Value,
... 275, 0, "00000005",
... 1024, 27, "01",
... 1024, 23, "01",
... )
>>> b = tribble(
... f.ID, f.Index, f.Value_x, f.Value_y,
... 1024, 23, "01", "02",
... 1024, 27, "01", "02",
... )
>>>
>>> a >> anti_join(b, by={"ID": "ID", "Index": "Index", "Value": "Value_x"})
ID Index Value
<int64> <int64> <object>
0 275 0 00000005
I am the author of datar
package, which is backed by pandas. Let me know if you have any questions or issues.
CodePudding user response:
You can play with a MultiIndex
to perform this without needing to create temporary DataFrame
or unnecessarily copying data:
# 1
idx_a = pd.MultiIndex.from_frame(a.iloc[:, :3])
idx_b = pd.MultiIndex.from_frame(b.iloc[:, :3])
# 2
idx_diff = idx_a.difference(idx_b)
# 3
result = a.set_index(idx_a).reindex(idx_diff).reset_index(drop=True)
print(result)
ID Index Value
0 275 0 00000005
- Create
MultiIndex
objects from the first 3 columns of eachDataFrame
- Get the index values that appear in
idx_a
but not inidx_b
- Use this difference index to subset our original
DataFrame
(a
)
CodePudding user response:
You could use merge
with indicator=True
and slice on "left_only":
cols = ['ID', 'Index']
c = (a.merge(b[cols], on=cols, how='outer', indicator=True)
.loc[lambda x: x.pop('_merge').eq('left_only')])
output:
ID Index Value
0 275 0 5
CodePudding user response:
Here is one possible way. You can join both dataframes to form one dataframe and then drop the duplicate values.
import pandas as pd
a = [
[275, 0, '00000005'],
[283, 0, '00F0'],
[1024, 27, '01']
]
b = [
[283, 0, '00F0', '00E1'],
[1024, 27, '01', '02']
]
# create dataframe for a and b
df_a = pd.DataFrame(a, columns=['ID', 'Index', 'Value'])
df_b = pd.DataFrame(b, columns=['ID', 'Index', 'Value_x', 'Value_y'])
# get difference of df_a and df_b based on ID and Index
df_diff = pd.concat([df_a[['ID', 'Index']], df_b[['ID', 'Index']]]).drop_duplicates(
subset=['ID', 'Index'], keep=False)
print(df_diff)