I have a dataframe:
state city score
CA San Francisco 80
CA San Francisco 90
...
NC Raleigh 44
NY New York City 22
I want to do a groupby.head(), but instead of an integer value, I want to select the top 80%, sorted by Score, of each state-city combo.
So if CA, San Francisco has 100 rows, and NC, Raleigh has 20 rows, the final dataframe would have the top 80 score rows for CA, San Francisco, and the top 16 score rows for NC, Raleigh.
So the final result code might look something like:
df.sort_values('score', ascending=False).groupby(['State', 'City']).head(80%)
Thanks!
CodePudding user response:
from io import StringIO
import pandas as pd
# sample data
s = """state,city,score
CA,San Francisco,80
CA,San Francisco,90
CA,San Francisco,30
CA,San Francisco,10
CA,San Francisco,70
CA,San Francisco,60
CA,San Francisco,50
CA,San Francisco,40
NC,Raleigh,44
NC,Raleigh,54
NC,Raleigh,64
NC,Raleigh,14
NY,New York City,22
NY,New York City,12
NY,New York City,32
NY,New York City,42
NY,New York City,52"""
df = pd.read_csv(StringIO(s))
sample = .8 # 80%
# sort the values and create a groupby object
g = df.sort_values('score', ascending=False).groupby(['state', 'city'])
# use list comprehension to iterate over each group
# for each group, calculate what 80% is
# in other words, the length of each group multiplied by .8
# you then use int to round down to the whole number
new_df = pd.concat([data.head(int(len(data)*sample)) for _,data in g])
state city score
1 CA San Francisco 90
0 CA San Francisco 80
4 CA San Francisco 70
5 CA San Francisco 60
6 CA San Francisco 50
7 CA San Francisco 40
10 NC Raleigh 64
9 NC Raleigh 54
8 NC Raleigh 44
16 NY New York City 52
15 NY New York City 42
14 NY New York City 32
12 NY New York City 22
CodePudding user response:
Use nlargest
and compute the number of selected rows per group based on its length, i.e. 0.8 * len(group)
res = (
df.groupby(['State', 'City'], group_keys=False)
.apply(lambda g: g.nlargest(int(0.8*len(g)), "Score"))
)