My code starts this way: it takes data from HERE and I want to extract al the rows that contain "fascia_anagrafica" equal to "20-29". In italian "fascia_anagrafica" means "age range". That was relatively simple, as you see below, and I dropped some unimportant values.
import pandas as pd
import json
import numpy
import sympy
from numpy import arange,exp
from scipy.optimize import curve_fit
from matplotlib import pyplot
import math
import decimal
df = pd.read_csv('https://raw.githubusercontent.com/italia/covid19-opendata-
vaccini/master/dati/somministrazioni-vaccini-latest.csv')
df = df[df["fascia_anagrafica"] == "20-29"]
df01=df.drop(columns= ["fornitore","area","sesso_maschile","sesso_femminile","seconda_dose","pregressa_infezione","dose_aggiuntiva","codice_NUTS1","codice_NUTS2","codice_regione_ISTAT","nome_area"])
now dataframe looks like this:IMAGE
as you see, for every date there is the "20-29 age range" and for every line you may find the value "prima_dose" which stands for "first_dose".
Now the problem: If you take into consideration the date "2020-12-27" you will notice that it is repeated about 20 times (with 20 different values) since in italy there are 21 regions, then the same applies for the other dates. Unfortunately they are not always 21, because in certain regions they didn't put any values in some days so the dataframe is NOT periodic.
I want to add a column in the dataframe that makes a sum of the values that has same date fo all dates in the dataframe. An exaple here:
Date.................prima_dose...........sum_column
2020-8-9.............. 1.......................13 <----this is (1 3 4 5 in the day 2020-8-9)
2020-8-9..............3........................8 <----this is (2 5 1 in the day 2020-8-10)
2020-8-9.............. 4..............and so on...
2020-8-9.............. 5
2020-8-10.............. 2
2020-8-10.............. 5
2020-8-10.............. 1
thanks!
CodePudding user response:
If you just want to sum all the values of 'prima_dose' for each date and get the result in a new dataframe, you could use groupby.sum()
:
result = df01.groupby('data_somministrazione')['prima_dose'].sum().reset_index()
Prints:
>>> result
data_somministrazione prima_dose
0 2020-12-27 700
1 2020-12-28 171
2 2020-12-29 87
3 2020-12-30 486
4 2020-12-31 2425
.. ... ...
289 2021-10-12 11583
290 2021-10-13 12532
291 2021-10-14 15347
292 2021-10-15 13689
293 2021-10-16 9293
[294 rows x 2 columns]
This will change the structure of your current dataframe, and return a unique row per date
If you want to add a new column in your existing dataframe without altering it's structure, you should use groupby.transform()
:
df01['prima_dose_per_date'] = df01.groupby('data_somministrazione')['prima_dose'].transform('sum')
Prints:
>>> df01
data_somministrazione fascia_anagrafica prima_dose prima_dose_per_date
0 2020-12-27 20-29 2 700
7 2020-12-27 20-29 9 700
12 2020-12-27 20-29 60 700
17 2020-12-27 20-29 59 700
23 2020-12-27 20-29 139 700
... ... ... ...
138475 2021-10-16 20-29 533 9293
138484 2021-10-16 20-29 112 9293
138493 2021-10-16 20-29 0 9293
138502 2021-10-16 20-29 529 9293
138515 2021-10-16 20-29 0 9293
[15595 rows x 4 columns]
This will keep the current structure of your dataframe and return a new column with the sum of prima_dose per each date.