I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
CodePudding user response:
I would use tail(-3)
to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
CodePudding user response:
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%
CodePudding user response:
For DataFrame with percentages use Series.iloc
with indexing, crate DataFrame by Series.to_frame
, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other (3)'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%