Home > Net >  Obtain the percentage of groupby() count(), save it and plot it
Obtain the percentage of groupby() count(), save it and plot it

Time:01-10

I have a dataframe like the following,

    Time                    Value   Cluster
1   2020-08-11 06:09:59     0       0
2   2020-08-11 06:14:59     0       12
3   2020-08-11 06:19:59     1       103
4   2020-08-11 06:24:59     0       0
5   2020-08-11 06:29:59     0       12

Basically there are 3 columns, the first one is date and time (Time), the second one refers to the group to which the data belongs (Cluster) and the third one (Value) is the result (in binary).

Therefore, what I would like is to be able to make a statistic according to the group that I am in this instant, the probability that the next instant the result is 1 or 0.

I have thought of the following,

  1. Move the result column (Value) one position (one instant of time).
  2. Group the data and count them.

This is the code I have made, for a single cluster,

test_df[(test_df['Cluster'] == 2)].groupby('ValueT1')['Cluster'].count()

With this out:

ValueT1
0    7406
1    7787
Name: Cluster, dtype: int64

My question is, how do you get the percentage from here? Could you make a double bar chart (one next to the other with the set of clusters)?

For the above example I would like to have as a result something similar to the following,

ValueT1
0 48.7%
1 51.2%
Name: Cluster, dtype: int64

And in addition, I would like to have a graph that gathers these percentages for each of the clusters.

Thank you very much.

CodePudding user response:

IIUC, you want to know the percentage of 0 and 1 by cluster. You can use groupby / value_counts to do that.

Setup a MRE

import pandas as pd
import numpy as np

rng = np.random.default_rng(2023)
df = pd.DataFrame({'Value': rng.choice([0, 1], 1000),
                   'Cluster': rng.choice(range(12), 1000)})

Output:

stats = df.groupby('Cluster')['Value'].value_counts(normalize=True).mul(100)
print(stats)

# Output
Cluster  Value
0        1        51.219512
         0        48.780488
1        0        54.545455
         1        45.454545
2        0        54.320988
         1        45.679012
3        1        53.658537
         0        46.341463
4        0        57.647059
         1        42.352941
5        0        51.428571
         1        48.571429
6        0        51.086957
         1        48.913043
7        1        55.128205
         0        44.871795
8        0        50.632911
         1        49.367089
9        1        51.485149
         0        48.514851
10       0        51.250000
         1        48.750000
11       0        58.536585
         1        41.463415
Name: Value, dtype: float64

Update:

import plotly.graph_objects as go

stats = stats.rename('PCT').reset_index()

f1 = stats[stats['Value'] == 0]
f2 = stats[stats['Value'] == 1]
clusters = stats['Cluster'].unique()

fig = go.Figure()
fig.add_trace(go.Bar(x=clusters, y=f1['PCT'], name='0'))
fig.add_trace(go.Bar(x=clusters, y=f2['PCT'], name='1'))
fig.show()

enter image description here

  • Related