Home > Software engineering >  How do I plot the frequency of an event overtime with pandas?I
How do I plot the frequency of an event overtime with pandas?I

Time:10-21

I was trying to plot some data from a pandas dataframe. My table contains 10000ish films and for each of them two info: the year it got published, and a rating from 0 to 3. I am having a hard time trying to plot a graph with the pandas library that shows the number of films that received a particular rating (3 in my case) every year.

I have tried to use .value_counts(), but it didn’t work as i hoped, since I can’t isolate a single value, maintaining the rating linked to its year.

I really hoped i decently explained my problem, since it is the first time i ask help on stack overflow.

This is the code i used to get my dataframe, if it is useful in any way.

import json
import requests
import pandas as pd
import numpy as np

request = requests.get("http://bechdeltest.com/api/v1/getAllMovies").json()
data = pd.DataFrame(request)

P.S. Thank you for the precious help!

CodePudding user response:

You can filter by rating and use Series.value_counts:

s = data.loc[data['rating'].eq(3), 'year'].value_counts()

But there is many years of films:

print (len(s))
108

So for plot I filter only counts greatwer like 30, it is 40 years here:

print (s.gt(30).sum())
40

So filter again and plot:

s[s.gt(30)].plot.bar()

EDIT: Solution with percentages:

s=data.loc[data['rating'].eq(3),'year'].value_counts(normalize=True).sort_index().mul(100)
print (s)  
1899    0.018218
1910    0.018218
1916    0.054655
1917    0.054655
1918    0.054655
  
2018    3.169976
2019    3.188195
2020    2.040445
2021    1.840044
2022    0.765167
Name: year, Length: 108, dtype: float64


print (s[s.gt(3)])  
2007    3.042449
2009    3.588996
2010    3.825833
2011    4.299508
2012    4.153762
2013    4.937147
2014    4.335945
2015    3.771179
2016    3.752960
2017    3.388595
2018    3.169976
2019    3.188195
Name: year, dtype: float64

s[s.gt(3)].plot.bar()

EDIT1: Here is solution for count years vs ratings:

df = pd.crosstab(data['year'], data.rating)
print (df)

rating   0   1   2    3
year                   
1874     1   0   0    0
1877     1   0   0    0
1878     2   0   0    0
1881     1   0   0    0
1883     1   0   0    0
    ..  ..  ..  ...
2018    19  44  24  174
2019    16  47  18  175
2020    10  17  11  112
2021    11  22  13  101
2022     3  14   5   42

[141 rows x 4 columns]

EDIT2:

df = pd.crosstab(data['year'], data.rating, normalize='index').mul(100)
print (df)
rating           0          1         2          3
year                                              
1874    100.000000   0.000000  0.000000   0.000000
1877    100.000000   0.000000  0.000000   0.000000
1878    100.000000   0.000000  0.000000   0.000000
1881    100.000000   0.000000  0.000000   0.000000
1883    100.000000   0.000000  0.000000   0.000000
           ...        ...       ...        ...
2018      7.279693  16.858238  9.195402  66.666667
2019      6.250000  18.359375  7.031250  68.359375
2020      6.666667  11.333333  7.333333  74.666667
2021      7.482993  14.965986  8.843537  68.707483
2022      4.687500  21.875000  7.812500  65.625000

[141 rows x 4 columns]

There is alot of values, here is e.g. filter for column 3 for more like 60% values:

print (df[3].gt(60).sum())
26

df[df[3].gt(60)].plot.bar()
  • Related