I want to split a dataframe into quartiles of a specific column.
I have data from 800 companies. One row displays a specific score which ranges from 0 to 100.
I want to split the dataframe in 4 groups (quartiles) with same size (Q1 to Q4, Q4 should contain the companies with the highest scores). So each group should contain 200 companies. How can I divide the companies into 4 equal sized groups according to their score of a specific column (here the last column "ESG Combined Score 2011")? I want to extract the groups to separate sheets in excel (Q1 in a sheet named Q1, Q2 in a sheet named Q2 and so on).
Here is an extract of the data:
df1
Company Common Name Company Market Capitalization ESG Combined Score 2011
0 SSR Mining Inc 3.129135e 09 32.817325
1 Fluor Corp 3.958424e 09 69.467729
2 CBRE Group Inc 2.229251e 10 59.632423
3 Assurant Inc 8.078239e 09 46.492803
4 CME Group Inc 6.269954e 10 42.469682
5 Peabody Energy Corp 3.842130e 09 73.374671
And as an additional question: How can I turn off the scientific notation of the column in the middle? I want it to display with separators.
Thanks for your help
CodePudding user response:
Suppose your dataframe sorted by some values already.
import numpy as np
import pandas as pd
writer = pd.ExcelWriter('splited_df.xlsx', engine='xlsxwriter')
# determine you want to divide into how many parts
n_groups= 4
# make the dataframe slicing list
separator= list(map(int, np.linspace(0,len(df1),n_groups 1)))
for idx in range(len(separator)):
if idx>=len(separator)-2:
df1.iloc[separator[idx]:,:].to_excel(writer, sheet_name=f'Sheet{idx 1}')
break
df1.iloc[separator[idx]:separator[idx 1],:].to_excel(writer, sheet_name=f'Sheet{idx 1}')
writer.save()
writer.close()
And to suppress scientific notation, you can refer to this Stackoverflow post
I hope this can satisfy your question.
CodePudding user response:
You will need to sort the dataframe and partition it based on indices corresponding to quantiles:
def partition_quantiles(df, by, quantiles):
num_samples = len(df)
df = df.sort_values(by)
q_idxs = [0, *(int(num_samples * q) for q in quantiles), num_samples 1]
for q_start, q_end in zip(q_idxs[:-1], q_idxs[1:]):
yield df.iloc[q_start:q_end]
It will work as follows:
from random import choices
from string import ascii_letters
import numpy as np
import pandas as pd
num_rows = 12
companies = ["".join(choices(ascii_letters, k=10)) for _ in range(num_rows)]
capitalizations = np.random.rand(num_rows) * 1e6
scores = np.random.rand(num_rows) * 1e2
df = pd.DataFrame(
{
"company": companies,
"capital": capitalizations,
"score": scores,
}
)
for partition in partition_quantiles(df, "score", [0.25, 0.5, 0.75]):
print("-" * 40)
print(partition)
which prints:
----------------------------------------
company capital score
7 QVdnUUiaSV 599523.318607 0.506453
2 CahcnFEMlB 247175.132381 11.201345
10 OpvllkCfWp 203289.934774 36.328395
----------------------------------------
company capital score
6 YzqHvWewOC 774025.801826 49.618631
4 taDrQHvHoB 354491.773921 60.153841
11 JrZmmTvwyD 248947.408524 62.414680
----------------------------------------
company capital score
8 nvkomHSjtP 139345.993291 63.949291
9 soigFZMVjo 666688.879067 64.449568
5 LQSInRRnZd 691896.831968 85.375991
----------------------------------------
company capital score
0 wNMoypFeXN 12712.591339 85.638396
1 XNDqUMDrTb 858545.389446 92.531258
3 okUNZChvsJ 697386.417437 95.398392
CodePudding user response:
You can use numpy's array_split
for that:
import numpy as np
dfs = np.array_split(df.sort_values(by=['ESG Combined Score 2011']), 4)
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for index, df in enumerate(dfs):
df.to_excel(writer, sheet_name=f'Sheet{index 1}')
CodePudding user response:
You can use pandas.qcut
Overall solution (edited, to use partial of @RJ Adriaansen solution):
df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')
Input:
df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'], 'score': [1, 2, 3, 4]})
company score
0 A 1
1 B 2
2 C 3
3 D 4
Script:
df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
Output:
company score categories
0 A 1 low
1 B 2 low-mid
2 C 3 mid-high
3 D 4 high
Then separate to different Excel sheets (edited, to use partial of @RJ Adriaansen solution):
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')