Pandas sum last n rows of df.count() results into one row-CodePudding

I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:

>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})

>> df['category'].value_counts()
z    4
x    4
y    3
u    2
v    1
w    1

>> ??
           count  pct
z              4  27%
x              4  27%
y              3  20%
Other (3)      4  27%

The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!

CodePudding user response：

I would use tail(-3) to get the last values except for the first 3:

counts = df['category'].value_counts()
others = counts.tail(-3)

counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)

counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())

Output:

            count   pct
z               4  27.0
x               4  27.0
y               3  20.0
Others (3)      4  27.0

CodePudding user response：

This snippet

df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)

will give you what you need. Also in a nice format;)

           Count Percent
z              4  26.67%
x              4  26.67%
y              3  20.00%
Other (3)      4  26.67%

CodePudding user response：

For DataFrame with percentages use Series.iloc with indexing, crate DataFrame by Series.to_frame, add new row and new column filled by percentages:

s = df['category'].value_counts()

n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other (3)'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
           count  pct
z              4  27%
x              4  27%
y              3  20%
Other (3)      4  27%