SUM in dataframe of rows that has the same date and ADD new column-CodePudding

My code starts this way: it takes data from HERE and I want to extract al the rows that contain "fascia_anagrafica" equal to "20-29". In italian "fascia_anagrafica" means "age range". That was relatively simple, as you see below, and I dropped some unimportant values.

import pandas as pd
import json
import numpy
import sympy
from numpy import arange,exp
from scipy.optimize import curve_fit
from matplotlib import pyplot
import math
import decimal

df = pd.read_csv('https://raw.githubusercontent.com/italia/covid19-opendata- 
vaccini/master/dati/somministrazioni-vaccini-latest.csv') 

df = df[df["fascia_anagrafica"] == "20-29"]

df01=df.drop(columns= ["fornitore","area","sesso_maschile","sesso_femminile","seconda_dose","pregressa_infezione","dose_aggiuntiva","codice_NUTS1","codice_NUTS2","codice_regione_ISTAT","nome_area"])

now dataframe looks like this:IMAGE

as you see, for every date there is the "20-29 age range" and for every line you may find the value "prima_dose" which stands for "first_dose".

Now the problem: If you take into consideration the date "2020-12-27" you will notice that it is repeated about 20 times (with 20 different values) since in italy there are 21 regions, then the same applies for the other dates. Unfortunately they are not always 21, because in certain regions they didn't put any values in some days so the dataframe is NOT periodic.

I want to add a column in the dataframe that makes a sum of the values that has same date fo all dates in the dataframe. An exaple here:

Date.................prima_dose...........sum_column

2020-8-9.............. 1.......................13 <----this is (1 3 4 5 in the day 2020-8-9)

2020-8-9..............3........................8 <----this is (2 5 1 in the day 2020-8-10)

2020-8-9.............. 4..............and so on...

2020-8-9.............. 5

2020-8-10.............. 2

2020-8-10.............. 5

2020-8-10.............. 1

thanks!

CodePudding user response：

If you just want to sum all the values of 'prima_dose' for each date and get the result in a new dataframe, you could use groupby.sum():

result = df01.groupby('data_somministrazione')['prima_dose'].sum().reset_index()

Prints:

>>>  result

    data_somministrazione  prima_dose
0              2020-12-27         700
1              2020-12-28         171
2              2020-12-29          87
3              2020-12-30         486
4              2020-12-31        2425
..                    ...         ...
289            2021-10-12       11583
290            2021-10-13       12532
291            2021-10-14       15347
292            2021-10-15       13689
293            2021-10-16        9293

[294 rows x 2 columns]

This will change the structure of your current dataframe, and return a unique row per date

If you want to add a new column in your existing dataframe without altering it's structure, you should use groupby.transform():

df01['prima_dose_per_date'] = df01.groupby('data_somministrazione')['prima_dose'].transform('sum')

Prints:

>>> df01

       data_somministrazione fascia_anagrafica  prima_dose  prima_dose_per_date
0                 2020-12-27             20-29           2                  700
7                 2020-12-27             20-29           9                  700
12                2020-12-27             20-29          60                  700
17                2020-12-27             20-29          59                  700
23                2020-12-27             20-29         139                  700
                     ...               ...         ...                  ...
138475            2021-10-16             20-29         533                 9293
138484            2021-10-16             20-29         112                 9293
138493            2021-10-16             20-29           0                 9293
138502            2021-10-16             20-29         529                 9293
138515            2021-10-16             20-29           0                 9293

[15595 rows x 4 columns]

This will keep the current structure of your dataframe and return a new column with the sum of prima_dose per each date.