Say you have 2 dataframes with the same columns.
But say dataframe A has 10 rows, and dataframe B has 100 rows, but the 10 rows in dataframe A are in dataframe B. The 10 rows may not be in the same row numbers as dataframe B.
How do we determine that those 10 rows in df A are fully contained in df B?
For example.
Say we have this for df A (only using 1 row)
A | B | C
1 | 2 | 3
and df B is:
A | B | C
2 | 5 | 5
3 | 2 | 7
1 | 2 | 3
5 | 1 | 5
How do we check that df A is contained in B? Assume that the rows will always be unique in the sense that there will always be a unique A B combination
CodePudding user response:
Is a Dataframe a subset of another:
You can try solving this using merge and then comparison.
The inner-join of the 2 dataframes would be the same as the smaller dataframe if the second one is a superset for the first.
import pandas as pd
# df1 - smaller dataframe, df2 - larger dataframe
df1 = pd.DataFrame({'A ': [1], ' B ': [2], ' C': [3]})
df2 = pd.DataFrame({'A ': [2, 3, 1, 5], ' B ': [5, 2, 2, 1], ' C': [5, 7, 3, 5]})
df1.merge(df2).shape == df1.shape
True
If you have duplicates, then drop duplicates first -
df1.merge(df2).drop_duplicates().shape == df1.drop_duplicates().shape
More details here.
CodePudding user response:
Convert df2 into a dictionary, and use isin
to check:
df1.isin({key:value.array for key, value in df2.items()}).all(1).squeeze()
True
Another option would be to convert both dataframes to MultiIndexes and use isin
or intersection
- I suspect this may be more expensive computationally than the first option:
A = pd.MultiIndex.from_frame(df1)
B = pd.MultiIndex.from_frame(df2)
A.isin(B).item()
True
# via intersection
A.intersection(B).empty
Out[73]: True