Home > OS >  Pandas groupby, assign and to_excel - on loop/repeat
Pandas groupby, assign and to_excel - on loop/repeat

Time:03-17

I have a dataframe like as shown below

import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
                    'year':[2017,2017,2018,2019,2018],
                   'customer': rng.choice(list('ACD'),size=(5)),
                   'region': rng.choice(list('PQRS'),size=(5)),
                   'dumeel': rng.choice(list('QWER'),size=(5)),
                   'dumma': rng.choice((1234),size=(5)),
                   'target_at50': rng.choice([0,1],size=(5)),
                   'target_at60': rng.choice([1,1],size=(5)),
                   'target_at70': rng.choice([0,0],size=(5))})

My objective is to do the below

a) Group columns based on multiple criteria (as shown in below code)

b) Assign a default value based on target column. (ex: if target_at50, then assign value 50, if target_at60, then assign 60. if target_at70, then assign 70)

b) Repeat the same group by criteria for different target column (target_at60, target_at70)

c) write the result of each group statement for each target to a new sheet in the single excel file.

I was trying something like below but it's not efficient or elegant. You can see that am using the same code but repeated thrice for different target columns (target_at50. target_at60, target_at70)

cdf.groupby(['region','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet1')
cdf.groupby(['region','customer','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet2')
cdf.groupby(['region','dumeel','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet3')
cdf.groupby(['region','year','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet4')
cdf.groupby(['region','year','customer','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet5')
cdf.groupby(['region','year','dumeel','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet6')


cdf.groupby(['region','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet7')
cdf.groupby(['region','customer','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet8')
cdf.groupby(['region','dumeel','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet9')
cdf.groupby(['region','year','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet10')
cdf.groupby(['region','year','customer','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet11')
cdf.groupby(['region','year','dumeel','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet12')


cdf.groupby(['region','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet13')
cdf.groupby(['region','customer','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet14')
cdf.groupby(['region','dumeel','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet15')
cdf.groupby(['region','year','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet16')
cdf.groupby(['region','year','customer','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet17')
cdf.groupby(['region','year','dumeel','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet18')

Though all the above group by statements works fine, I would like to know is there any efficient and better approach to perform the above task?

I expect my output to just have an excel with multiple sheets (18 sheets) (which will have the output from 18 groupby statements)

CodePudding user response:

First generate all combinations passed to groupby:

from  itertools import combinations

targets = cdf.filter(like='target').columns
cols = ['customer', 'year', 'dumeel']

tups = [('region', *c, t) for t in targets 
                          for i in range(0, len(cols) 1)
                          for c in combinations(cols, i)]
print (tups)

Ant then write aggregate values in loop:

with pd.ExcelWriter('Values.xlsx') as writer: 
    for i, val in enumerate(tups, 1):
        threshold = re.
        df = (cdf.groupby(val).size().reset_index(name='Count')
                 .sort_values(by=['region','Count'],ascending=False)
                 .assign(threshold=int(re.findall('\d ',val[-1])[0]))

        df.to_excel(writer,sheet_name=f'sheet{i}')
  • Related