I have one dataframe(df1) which is my raw data from which i want to filter or extract a part of the data. I have another dataframe(df2) which have my filter conditions. The catch here is my filter condition column if blank should skip tht column condition and move to the other column conditions
Example below:
DF1:
City | District | Town | Country | Continent |
---|---|---|---|---|
NY | WASHIN | DC | US | America |
CZCH | SEATLLE | DC | CZCH | Europe |
NY | NJ | DC | US | S America |
NY | WASHIN | NY | US | America |
DF2:(sample filter condition table - this table can have multiple conditions)
City | District | Town | Country | Continent |
---|---|---|---|---|
NY | DC |
Notice that i have left the district, country and continent column blank. As I may or may not use it later. I cannot delete these columns.
OUTPUT DF: should look like this
City | District | Town | Country | Continent |
---|---|---|---|---|
NY | WASHIN | DC | US | America |
NY | NJ | DC | US | S America |
So basically i need a filter condition table which will extract information from the raw data for fields i input in the filter tables. I cannot change/delete columns in DF2. I can only leave the column blank if i dont require the filter condition.
Thanks in advance, Nitz
CodePudding user response:
If DF2
has always one row:
df = df1.merge(df2.dropna(axis=1))
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
1 NY NJ DC US S America
If multiple rows with missing values:
print (df2)
City District Town Country Continent
0 NY NaN DC NaN NaN
1 BO NaN None NaN NaN
First remove missing values with reshape by DataFrame.stack
:
print (df2.stack())
0 City NY
Town DC
1 City BO
dtype: object
Then for each group compare df1
columns if exist in columns names and value from df2
:
m = [df1[list(v.droplevel(0).index)].eq(v.droplevel(0)).all(axis=1)
for k, v in df2.stack().groupby(level=0)]
print (m)
[0 True
1 False
2 True
3 False
dtype: bool, 0 False
1 False
2 False
3 True
dtype: bool]
Use logical_or.reduce
and filter in boolean indexing
:
print (np.logical_or.reduce(m))
[ True False True True]
df = df1[np.logical_or.reduce(m)]
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
2 NY NJ DC US S America
3 BO WASHIN NY US America
CodePudding user response:
Another possible solution, using numpy broadcasting
(it works even when df2
has more than one row):
df1.loc[np.sum(np.sum(
df1.values == df2.values[:, None], axis=2) ==
np.sum(df2.notna().values, axis=1)[:,None], axis=0) == 1]
Output:
City District Town Country Continent
0 NY WASHIN DC US America
2 NY NJ DC US S America