How to convert csv file into Python dictionary and sum/operate values based on certain conditions?-CodePudding

I have a huge csv file containing information about COVID-19 cases and deaths for every single county in the United States.

To give you a general idea of the information contained in this file, here are the first 10 lines of it:

date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alabama,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alabama,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alabama,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alabama,01017,3460,113,1720,73,1740,40

Anyways, I want to create a Python dictionary for this data with each of the keys being a tuple consisting of the state and county names and the each of the values being a list of integers: the first int representing the number of confirmed cases and the second one representing the number of confirmed deaths.

Basically, I want output like this:

dic = {(“state”,“county”):[confirmed_cases, confirmed_deaths]}

How would I generate a python dictionary like the one above for all the counties in the csv file? Please use csv.reader.

Additionally, I need to find the sum of all the confirmed deaths for a particular state. How would I, for example, sum up the values in 'confirmed deaths' for all of the rows where 'state' is 'Alabama'?

I also want to find number of counties in the entire file where the number of confirmed cases is between certain intervals. How would I go about doing that in Python?

Additionally, I have another csv file with all the population values for each state for the past couple of years. To get an idea of the info this file contains, here are the first 10 lines:

table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts),,,,,,,,,,,,

.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749

The most recent population estimate is represented by the last value for each state. For example, for Alabama, '4903185' is the most recent population estimate.

I want to create a python dictionary with the keys being strings representing each state name and the values being an integer representing only the last number for each state in the csv file.

Basically, {“state”:population”}

And to add on to that, I would like to calculate the death density rate for each state by dividing the number of confirmed deaths in each state and dividing them by the most recent population figure for each state. How would I program that?

And finally, I want to calculate the us death density rate (entire country) by adding all of the values in the confirmed deaths column and dividing that sum by the sum of all of the values in the latest population column. How would I do that?

Thank you!

CodePudding user response：

I highly recommend checking out pandas. It is a wonderful library based all around tabular data. There are specific commands in it to read a csv, then send it to json, making it easily manipulatable in Python.

It would go something like:

import pandas as pd

df = pd.read_csv("filename.csv")
json_variable = df.to_json()
print(json_variable)

Of course, you can format this with pretty print, but if you already have the CSV, all you should need to do now is access that "json_variable".

CodePudding user response：

Better use pandas.DataFrame instead of dictionary - it will be more useful.

I use io.StringIO only to simulate file in memory.

I changed few states in example data to have more then one state

Using groupby(['state', 'county']) I could get expected values and create dictionary but it is more useful to use directly values.

text = '''date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alaska,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alaska,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alaska,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alaska,01017,3460,113,1720,73,1740,40'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text))
#df = pd.read_csv("filename.csv")

for name, group in df.groupby(['state', 'county']):
    print('\n---', name, '---\n')
    #print(group)
    print('confirmed_cases :', group['confirmed_cases'].values[-1])
    print('confirmed_deaths:', group['confirmed_deaths'].values[-1])

Result:

--- ('Alabama', 'Autauga') ---

confirmed_cases: 5557
confirmed_deaths: 85

--- ('Alabama', 'Baldwin') ---

confirmed_cases: 14329
confirmed_deaths: 220

--- ('Alabama', 'Barbour') ---

confirmed_cases: 1225
confirmed_deaths: 37

--- ('Alabama', 'Bibb') ---

confirmed_cases: 2031
confirmed_deaths: 35

--- ('Alabama', 'Blount') ---

confirmed_cases: 4901
confirmed_deaths: 109

--- ('Alaska', 'Bullock') ---

confirmed_cases: 1059
confirmed_deaths: 29

--- ('Alaska', 'Butler') ---

confirmed_cases: 1888
confirmed_deaths: 60

--- ('Alaska', 'Calhoun') ---

confirmed_cases: 10608
confirmed_deaths: 242

--- ('Alaska', 'Chambers') ---

confirmed_cases: 1720
confirmed_deaths: 73

The same way you can groupby('state') and calculate sum() of confirmed_deaths

for name, group in df.groupby('state'):
    print('\n---', name, '---\n')
    print('confirmed_deaths:', group['confirmed_deaths'].sum())

Result:

--- Alabama ---

confirmed_deaths: 486

--- Alaska ---

confirmed_deaths: 404

Or shorter

df_state = df.groupby('state').sum()
print(df_state)

And you have new DataFrame

         fips  cases  deaths  ...  confirmed_deaths  probable_cases  probable_deaths
state                         ...                                                   
Alabama  5025  37828     632  ...               486            9785              146
Alaska   4056  20859     519  ...               404            5584              115

which uses state as row's index so you could use it to join with other CSV.

text2 = '''.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749
'''

import pandas as pd
import io

df2 = pd.read_csv(io.StringIO(text2), header=None)

# convert first column into index - and remove `.`
df2.index = df2[0].str.replace('.', '')

# get name of last column
last_column = df2.columns[-1]

print(df2[[last_column]])

Result:

                            12
0                             
Alabama                4903185
Alaska                  731545
Arizona                7278717
Arkansas               3017804
California            39512223
Colorado               5758736
Connecticut            3565287
Delaware                973764
District of Columbia    705749

And join both

df3 = df_state.join(df2[[last_column]])

print(df3)

to get

         fips  cases  deaths  ...  probable_cases  probable_deaths       12
state                         ...                                          
Alabama  5025  37828     632  ...            9785              146  4903185
Alaska   4056  20859     519  ...            5584              115   731545

And then I can calculate density for every state

density = df3['confirmed_deaths']/df3.iloc[:,-1]

print('density:', density)

Result:

Alabama    0.000099
Alaska     0.000552

And the same for all country

density = df3['confirmed_deaths'].sum() / df3.iloc[:,-1].sum()

print('density:', density)

Result:

density: 0.00015794900554241286

With pandas.DataFrame I could do this without for-loops.

EDIT:

Full example with small changes

text1 = '''date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alaska,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alaska,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alaska,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alaska,01017,3460,113,1720,73,1740,40'''

text2 = '''.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749
'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text1))

# ---

for name, group in df.groupby(['state', 'county']):
    print('\n---', name, '---\n')
    #print(group)
    print('confirmed_cases:', group['confirmed_cases'].values[0])
    print('confirmed_deaths:', group['confirmed_deaths'].values[0])

for name, group in df.groupby('state'):
    print('\n---', name, '---\n')
    print('confirmed_deaths:', group['confirmed_deaths'].sum())

print('\n--- df_state ---\n')
df_state = df.groupby('state').sum()
print(df_state)

# ---

df_population = pd.read_csv(io.StringIO(text2), header=None)

# ---

df_population.index = df_population[0].str.replace('.', '', regex=False)
# get last column
last_column = df_population.columns[-1]
# rename last column
df_population = df_population.rename(columns={last_column: 'last_population'})

# ---

print('\n--- df_population ---\n')
print(df_population['last_population'])

df_joined = df_state.join(df_population['last_population'])
print('\n--- df_joined ---\n')
print(df_joined)

density_states = df_joined['confirmed_deaths']/df_joined['last_population']
print('\n--- density states ---\n')
print(density_states)

density_country = df_joined['confirmed_deaths'].sum() / df_joined['last_population'].sum()
print('\n--- density country ---\n')
print(density_country)