Home > other >  Elegant way to query dataframe based on nested OR and nested AND
Elegant way to query dataframe based on nested OR and nested AND

Time:06-09

I have a dataframe like as shown below

ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ASEAN,ABQ,2021,1
5,ASEAN,ABE,2021,2
6,ASEAN,ABQ,2021,3
7,UK,ABW,2021,8
8,UK,ABO,2020,1
9,UK,ABR,2019,0

I would like to do the below

a) Filter the dataframe based on Region = UK and (Supplier = ABW or output >=1 or year = 2021)

b) Filter the dataframe based on Region = ANZ and (Supplier = ABC or output >1 or year = 2021)

c) Filter the dataframe based on Region = ASEAN and (Supplier = ABE or output >1 or year = 2021)

So, I tried the below

df_ANZ = df[(df['Region']=='ANZ') & ((df['Supplier']=='ABC') | (df['output']>1) | (df['year']==2021))]
df_UK = df[(df['Region']=='UK') & ((df['Supplier']=='ABW') | (df['output']>=1) | (df['year']==2021))]
df_ASEAN = df[(df['Region']=='ASEAN') & ((df['Supplier']=='ABE') | (df['output']>1) | (df['year']==2021))]
df_ANZ.append(df_UK).append(df_ASEAN)

But the problem is, I have similar such criteria for around 10 regions. Writing 10 lines for each region may not be elegant.

Is there any efficient and elegant way to do this for a big dataframe with 5 million rows?

I expect my output to be like below

ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ASEAN,ABQ,2021,1
5,ASEAN,ABE,2021,2
6,ASEAN,ABQ,2021,3
7,UK,ABW,2021,8

CodePudding user response:

Create tuples for Region with Supplier, so possible first filter in list comprehension and then join mask by OR in np.logical_or.reduce:

tups = [('ANZ','ABC'),('UK','ABW'),('ASEAN','ABE')]

m = [(df['Region']==a) & ((df['Supplier']==b) | (df['output']>1) | (df['year']==2021)) 
      for a, b in tups]

df = df[np.logical_or.reduce(m)]

print (df)
   ID Region Supplier  year  output
0   1    ANZ       AB  2021       1
1   2    ANZ      ABC  2022       1
2   3    ANZ      ABC  2022       1
3   4  ASEAN      ABQ  2021       1
4   5  ASEAN      ABE  2021       2
5   6  ASEAN      ABQ  2021       3
6   7     UK      ABW  2021       8

CodePudding user response:

Another approach using query (you could create your list of tuple directly as well instead of using zip):

regions = ['ANZ', 'UK', 'ASEAN']
suppliers = ['ABC', 'ABW', 'ABE']
min_outputs = [1, 1, 1]
years = [2021, 2021, 2021]

query = '|'.join([f'(Region=="{reg}" & Supplier=="{sup}" | output > {o} | year=={y})'
                    for reg, sup, o, y in zip(regions, suppliers, min_outputs, years)])

df.query(query)
  • Related