I have a dataframe like as shown below
ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ASEAN,ABQ,2021,1
5,ASEAN,ABE,2021,2
6,ASEAN,ABQ,2021,3
7,UK,ABW,2021,8
8,UK,ABO,2020,1
9,UK,ABR,2019,0
I would like to do the below
a) Filter the dataframe based on Region = UK
and (Supplier = ABW
or output >=1 or year = 2021)
b) Filter the dataframe based on Region = ANZ
and (Supplier = ABC
or output >1 or year = 2021)
c) Filter the dataframe based on Region = ASEAN
and (Supplier = ABE
or output >1 or year = 2021)
So, I tried the below
df_ANZ = df[(df['Region']=='ANZ') & ((df['Supplier']=='ABC') | (df['output']>1) | (df['year']==2021))]
df_UK = df[(df['Region']=='UK') & ((df['Supplier']=='ABW') | (df['output']>=1) | (df['year']==2021))]
df_ASEAN = df[(df['Region']=='ASEAN') & ((df['Supplier']=='ABE') | (df['output']>1) | (df['year']==2021))]
df_ANZ.append(df_UK).append(df_ASEAN)
But the problem is, I have similar such criteria for around 10 regions. Writing 10 lines for each region may not be elegant.
Is there any efficient and elegant way to do this for a big dataframe with 5 million rows?
I expect my output to be like below
ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ASEAN,ABQ,2021,1
5,ASEAN,ABE,2021,2
6,ASEAN,ABQ,2021,3
7,UK,ABW,2021,8
CodePudding user response:
Create tuples for Region
with Supplier
, so possible first filter in list comprehension and then join mask by OR
in np.logical_or.reduce
:
tups = [('ANZ','ABC'),('UK','ABW'),('ASEAN','ABE')]
m = [(df['Region']==a) & ((df['Supplier']==b) | (df['output']>1) | (df['year']==2021))
for a, b in tups]
df = df[np.logical_or.reduce(m)]
print (df)
ID Region Supplier year output
0 1 ANZ AB 2021 1
1 2 ANZ ABC 2022 1
2 3 ANZ ABC 2022 1
3 4 ASEAN ABQ 2021 1
4 5 ASEAN ABE 2021 2
5 6 ASEAN ABQ 2021 3
6 7 UK ABW 2021 8
CodePudding user response:
Another approach using query
(you could create your list of tuple directly as well instead of using zip
):
regions = ['ANZ', 'UK', 'ASEAN']
suppliers = ['ABC', 'ABW', 'ABE']
min_outputs = [1, 1, 1]
years = [2021, 2021, 2021]
query = '|'.join([f'(Region=="{reg}" & Supplier=="{sup}" | output > {o} | year=={y})'
for reg, sup, o, y in zip(regions, suppliers, min_outputs, years)])
df.query(query)