Filter of one dataframe using multiple columns of other dataframe in python-CodePudding

I have one dataframe(df1) which is my raw data from which i want to filter or extract a part of the data. I have another dataframe(df2) which have my filter conditions. The catch here is my filter condition column if blank should skip tht column condition and move to the other column conditions

Example below:

DF1:

City	District	Town	Country	Continent
NY	WASHIN	DC	US	America
CZCH	SEATLLE	DC	CZCH	Europe
NY	NJ	DC	US	S America
NY	WASHIN	NY	US	America

DF2:(sample filter condition table - this table can have multiple conditions)

City	District	Town	Country	Continent
NY		DC

Notice that i have left the district, country and continent column blank. As I may or may not use it later. I cannot delete these columns.

OUTPUT DF: should look like this

City	District	Town	Country	Continent
NY	WASHIN	DC	US	America
NY	NJ	DC	US	S America

So basically i need a filter condition table which will extract information from the raw data for fields i input in the filter tables. I cannot change/delete columns in DF2. I can only leave the column blank if i dont require the filter condition.

Thanks in advance, Nitz

CodePudding user response：

If DF2 has always one row:

df = df1.merge(df2.dropna(axis=1))
print (df)
  City District Town Country  Continent
0   NY   WASHIN   DC      US    America
1   NY       NJ   DC      US  S America

If multiple rows with missing values:

print (df2)
  City  District  Town  Country  Continent
0   NY       NaN    DC      NaN        NaN
1   BO       NaN  None      NaN        NaN

First remove missing values with reshape by DataFrame.stack:

print (df2.stack())
0  City    NY
   Town    DC
1  City    BO
dtype: object

Then for each group compare df1 columns if exist in columns names and value from df2:

m = [df1[list(v.droplevel(0).index)].eq(v.droplevel(0)).all(axis=1)
      for k, v in df2.stack().groupby(level=0)]


print (m)
[0     True
1    False
2     True
3    False
dtype: bool, 0    False
1    False
2    False
3     True
dtype: bool]

Use logical_or.reduce and filter in boolean indexing:

print (np.logical_or.reduce(m))
[ True False  True  True]


df = df1[np.logical_or.reduce(m)]
print (df)
  City District Town Country  Continent
0   NY   WASHIN   DC      US    America
2   NY       NJ   DC      US  S America
3   BO   WASHIN   NY      US    America

CodePudding user response：

Another possible solution, using numpy broadcasting (it works even when df2 has more than one row):

df1.loc[np.sum(np.sum(
        df1.values == df2.values[:, None], axis=2) == 
    np.sum(df2.notna().values, axis=1)[:,None], axis=0) == 1]

Output:

  City District Town Country  Continent
0   NY   WASHIN   DC      US    America
2   NY       NJ   DC      US  S America