If for example I have one column data frame pandas.
A 20
B 20
C 15
D 10
E 10
F 8
G 7
H 5
I 5
And I want to get data spread such as then the biggest 75%, 15% and last 10% is
A F H
B G I
C
D
E
Is there pandas function that can make this summary faster ? Do I need to make index as column name ? because I got the value from df.value_counts() from df dataframe.
CodePudding user response:
The exact input and expected output is not fully clear, but assuming this DataFrame as input:
col
A 20
B 20
C 15
D 10
E 10
F 8
G 7
H 5
I 5
You can get a dictionary of the indices using:
import numpy as np
target = [75, 15, 10]
group = pd.cut(df['col'].cumsum(), bins=np.r_[0, np.cumsum(target)], labels=target)
df.index.groupby(group)
output: {75: ['A', 'B', 'C', 'D', 'E'], 15: ['F', 'G'], 10: ['H', 'I']}