Home > Software engineering >  Select the rows with top values until the sum value reach the 30% of total value in Python Pandas
Select the rows with top values until the sum value reach the 30% of total value in Python Pandas

Time:12-30

I'm using Python pandas and have a data frame that is pulled from my CSV file:

ID          Value
123         10
432         14
213         12
'''
214         2
999         43

I was advised using the following code can randomly select some rows with the condition that the sum of the selected values = 30% of the total value. ("close to 30%" works)

out = df.sample(frac=1).loc[lambda d: d['Value'].cumsum().le(d['Value'].sum()*0.3)]

Now I want to sort the rows based on the value and select the top rolls until they add up to 30% of the total value.

Please advise.

CodePudding user response:

You can use pandas.DataFrame.sort_values and then find the mask that are True and select rows from df_sorted with pandas.DataFrame.loc.

df_sorted = df.sort_values('Value').reset_index(drop=True)
m = df_sorted['Value'].cumsum().le(df['Value'].sum()*0.3)
print(df_sorted.loc[m])

Output:

    ID  Value
0  214      2
1  123     10
2  213     12

CodePudding user response:

A possible solution:

df.iloc[
    df['Value']
    .sort_values()
    .cumsum()[lambda x: x <= 0.3 * df['Value'].sum()]
    .index]

Output:

    ID  Value
3  214      2
0  123     10
2  213     12
  • Related