Split a dataframe based on values in a column-CodePudding

I want to split a dataframe into quartiles of a specific column.

I have data from 800 companies. One row displays a specific score which ranges from 0 to 100.

I want to split the dataframe in 4 groups (quartiles) with same size (Q1 to Q4, Q4 should contain the companies with the highest scores). So each group should contain 200 companies. How can I divide the companies into 4 equal sized groups according to their score of a specific column (here the last column "ESG Combined Score 2011")? I want to extract the groups to separate sheets in excel (Q1 in a sheet named Q1, Q2 in a sheet named Q2 and so on).

Here is an extract of the data:

 df1

   
    Company Common Name Company Market Capitalization   ESG Combined Score 2011
0   SSR Mining Inc              3.129135e 09             32.817325
1   Fluor Corp                  3.958424e 09             69.467729
2   CBRE Group Inc              2.229251e 10             59.632423
3   Assurant Inc                8.078239e 09             46.492803
4   CME Group Inc               6.269954e 10             42.469682
5   Peabody Energy Corp         3.842130e 09             73.374671

And as an additional question: How can I turn off the scientific notation of the column in the middle? I want it to display with separators.

Thanks for your help

CodePudding user response：

Suppose your dataframe sorted by some values already.

import numpy as np
import pandas as pd

writer = pd.ExcelWriter('splited_df.xlsx', engine='xlsxwriter')

# determine you want to divide into how many parts
n_groups= 4

# make the dataframe slicing list
separator= list(map(int, np.linspace(0,len(df1),n_groups 1)))

for idx in range(len(separator)):
    if idx>=len(separator)-2:
        df1.iloc[separator[idx]:,:].to_excel(writer, sheet_name=f'Sheet{idx 1}')
        break
        
    df1.iloc[separator[idx]:separator[idx 1],:].to_excel(writer, sheet_name=f'Sheet{idx 1}')
    
writer.save()
writer.close()

And to suppress scientific notation, you can refer to this Stackoverflow post

I hope this can satisfy your question.

CodePudding user response：

You will need to sort the dataframe and partition it based on indices corresponding to quantiles:

def partition_quantiles(df, by, quantiles):
    num_samples = len(df)
    df = df.sort_values(by)
    q_idxs = [0, *(int(num_samples * q) for q in quantiles), num_samples   1]

    for q_start, q_end in zip(q_idxs[:-1], q_idxs[1:]):
        yield df.iloc[q_start:q_end]

It will work as follows:

from random import choices
from string import ascii_letters

import numpy as np
import pandas as pd

num_rows = 12
companies = ["".join(choices(ascii_letters, k=10)) for _ in range(num_rows)]
capitalizations = np.random.rand(num_rows) * 1e6
scores = np.random.rand(num_rows) * 1e2

df = pd.DataFrame(
    {
        "company": companies,
        "capital": capitalizations,
        "score": scores,
    }
)

for partition in partition_quantiles(df, "score", [0.25, 0.5, 0.75]):
    print("-" * 40)
    print(partition)

which prints:

----------------------------------------
       company        capital      score
7   QVdnUUiaSV  599523.318607   0.506453
2   CahcnFEMlB  247175.132381  11.201345
10  OpvllkCfWp  203289.934774  36.328395
----------------------------------------
       company        capital      score
6   YzqHvWewOC  774025.801826  49.618631
4   taDrQHvHoB  354491.773921  60.153841
11  JrZmmTvwyD  248947.408524  62.414680
----------------------------------------
      company        capital      score
8  nvkomHSjtP  139345.993291  63.949291
9  soigFZMVjo  666688.879067  64.449568
5  LQSInRRnZd  691896.831968  85.375991
----------------------------------------
      company        capital      score
0  wNMoypFeXN   12712.591339  85.638396
1  XNDqUMDrTb  858545.389446  92.531258
3  okUNZChvsJ  697386.417437  95.398392

CodePudding user response：

You can use numpy's array_split for that:

import numpy as np

dfs = np.array_split(df.sort_values(by=['ESG Combined Score 2011']), 4)
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')

for index, df in enumerate(dfs):
    df.to_excel(writer, sheet_name=f'Sheet{index 1}')

CodePudding user response：

You can use pandas.qcut

Overall solution (edited, to use partial of @RJ Adriaansen solution):

df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')

for i, category in enumerate(pd.Categorical(df['categories']).categories):
    df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')

Input:

df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'], 'score': [1, 2, 3, 4]})


company score
0   A   1
1   B   2
2   C   3
3   D   4

Script:

df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]

Output:

 company score  categories
0   A     1     low
1   B     2     low-mid
2   C     3     mid-high
3   D     4     high

Then separate to different Excel sheets (edited, to use partial of @RJ Adriaansen solution):

writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
        df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')