Home > Enterprise >  Filter of one dataframe using multiple columns of other dataframe in python
Filter of one dataframe using multiple columns of other dataframe in python

Time:01-27

I have one dataframe(df1) which is my raw data from which i want to filter or extract a part of the data. I have another dataframe(df2) which have my filter conditions. The catch here is my filter condition column if blank should skip tht column condition and move to the other column conditions

Example below:

DF1:

City District Town Country Continent
NY WASHIN DC US America
CZCH SEATLLE DC CZCH Europe
NY NJ DC US S America
NY WASHIN NY US America

DF2:(sample filter condition table - this table can have multiple conditions)

City District Town Country Continent
NY DC

Notice that i have left the district, country and continent column blank. As I may or may not use it later. I cannot delete these columns.

OUTPUT DF: should look like this

City District Town Country Continent
NY WASHIN DC US America
NY NJ DC US S America

So basically i need a filter condition table which will extract information from the raw data for fields i input in the filter tables. I cannot change/delete columns in DF2. I can only leave the column blank if i dont require the filter condition.

Thanks in advance, Nitz

CodePudding user response:

If DF2 has always one row:

df = df1.merge(df2.dropna(axis=1))
print (df)
  City District Town Country  Continent
0   NY   WASHIN   DC      US    America
1   NY       NJ   DC      US  S America

If multiple rows with missing values:

print (df2)
  City  District  Town  Country  Continent
0   NY       NaN    DC      NaN        NaN
1   BO       NaN  None      NaN        NaN

First remove missing values with reshape by DataFrame.stack:

print (df2.stack())
0  City    NY
   Town    DC
1  City    BO
dtype: object

Then for each group compare df1 columns if exist in columns names and value from df2:

m = [df1[list(v.droplevel(0).index)].eq(v.droplevel(0)).all(axis=1)
      for k, v in df2.stack().groupby(level=0)]


print (m)
[0     True
1    False
2     True
3    False
dtype: bool, 0    False
1    False
2    False
3     True
dtype: bool]

Use logical_or.reduce and filter in boolean indexing:

print (np.logical_or.reduce(m))
[ True False  True  True]


df = df1[np.logical_or.reduce(m)]
print (df)
  City District Town Country  Continent
0   NY   WASHIN   DC      US    America
2   NY       NJ   DC      US  S America
3   BO   WASHIN   NY      US    America

CodePudding user response:

Another possible solution, using numpy broadcasting (it works even when df2 has more than one row):

df1.loc[np.sum(np.sum(
        df1.values == df2.values[:, None], axis=2) == 
    np.sum(df2.notna().values, axis=1)[:,None], axis=0) == 1]

Output:

  City District Town Country  Continent
0   NY   WASHIN   DC      US    America
2   NY       NJ   DC      US  S America
  • Related