I was trying to plot some data from a pandas dataframe. My table contains 10000ish films and for each of them two info: the year it got published, and a rating from 0 to 3. I am having a hard time trying to plot a graph with the pandas library that shows the number of films that received a particular rating (3 in my case) every year.
I have tried to use .value_counts(), but it didn’t work as i hoped, since I can’t isolate a single value, maintaining the rating linked to its year.
I really hoped i decently explained my problem, since it is the first time i ask help on stack overflow.
This is the code i used to get my dataframe, if it is useful in any way.
import json
import requests
import pandas as pd
import numpy as np
request = requests.get("http://bechdeltest.com/api/v1/getAllMovies").json()
data = pd.DataFrame(request)
P.S. Thank you for the precious help!
CodePudding user response:
You can filter by rating
and use Series.value_counts
:
s = data.loc[data['rating'].eq(3), 'year'].value_counts()
But there is many years of films:
print (len(s))
108
So for plot I filter only counts greatwer like 30, it is 40
years here:
print (s.gt(30).sum())
40
So filter again and plot:
s[s.gt(30)].plot.bar()
EDIT: Solution with percentages:
s=data.loc[data['rating'].eq(3),'year'].value_counts(normalize=True).sort_index().mul(100)
print (s)
1899 0.018218
1910 0.018218
1916 0.054655
1917 0.054655
1918 0.054655
2018 3.169976
2019 3.188195
2020 2.040445
2021 1.840044
2022 0.765167
Name: year, Length: 108, dtype: float64
print (s[s.gt(3)])
2007 3.042449
2009 3.588996
2010 3.825833
2011 4.299508
2012 4.153762
2013 4.937147
2014 4.335945
2015 3.771179
2016 3.752960
2017 3.388595
2018 3.169976
2019 3.188195
Name: year, dtype: float64
s[s.gt(3)].plot.bar()
EDIT1: Here is solution for count years
vs ratings
:
df = pd.crosstab(data['year'], data.rating)
print (df)
rating 0 1 2 3
year
1874 1 0 0 0
1877 1 0 0 0
1878 2 0 0 0
1881 1 0 0 0
1883 1 0 0 0
.. .. .. ...
2018 19 44 24 174
2019 16 47 18 175
2020 10 17 11 112
2021 11 22 13 101
2022 3 14 5 42
[141 rows x 4 columns]
EDIT2:
df = pd.crosstab(data['year'], data.rating, normalize='index').mul(100)
print (df)
rating 0 1 2 3
year
1874 100.000000 0.000000 0.000000 0.000000
1877 100.000000 0.000000 0.000000 0.000000
1878 100.000000 0.000000 0.000000 0.000000
1881 100.000000 0.000000 0.000000 0.000000
1883 100.000000 0.000000 0.000000 0.000000
... ... ... ...
2018 7.279693 16.858238 9.195402 66.666667
2019 6.250000 18.359375 7.031250 68.359375
2020 6.666667 11.333333 7.333333 74.666667
2021 7.482993 14.965986 8.843537 68.707483
2022 4.687500 21.875000 7.812500 65.625000
[141 rows x 4 columns]
There is alot of values, here is e.g. filter for column 3
for more like 60% values:
print (df[3].gt(60).sum())
26
df[df[3].gt(60)].plot.bar()