I have a huge csv file containing information about COVID-19 cases and deaths for every single county in the United States.
To give you a general idea of the information contained in this file, here are the first 10 lines of it:
date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alabama,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alabama,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alabama,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alabama,01017,3460,113,1720,73,1740,40
Anyways, I want to create a Python dictionary for this data with each of the keys being a tuple consisting of the state and county names and the each of the values being a list of integers: the first int representing the number of confirmed cases and the second one representing the number of confirmed deaths.
Basically, I want output like this:
dic = {(“state”,“county”):[confirmed_cases, confirmed_deaths]}
How would I generate a python dictionary like the one above for all the counties in the csv file? Please use csv.reader.
Additionally, I need to find the sum of all the confirmed deaths for a particular state. How would I, for example, sum up the values in 'confirmed deaths' for all of the rows where 'state' is 'Alabama'?
I also want to find number of counties in the entire file where the number of confirmed cases is between certain intervals. How would I go about doing that in Python?
Additionally, I have another csv file with all the population values for each state for the past couple of years. To get an idea of the info this file contains, here are the first 10 lines:
table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts),,,,,,,,,,,,
.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749
The most recent population estimate is represented by the last value for each state. For example, for Alabama, '4903185' is the most recent population estimate.
I want to create a python dictionary with the keys being strings representing each state name and the values being an integer representing only the last number for each state in the csv file.
Basically, {“state”:population”}
And to add on to that, I would like to calculate the death density rate for each state by dividing the number of confirmed deaths in each state and dividing them by the most recent population figure for each state. How would I program that?
And finally, I want to calculate the us death density rate (entire country) by adding all of the values in the confirmed deaths column and dividing that sum by the sum of all of the values in the latest population column. How would I do that?
Thank you!
CodePudding user response:
I highly recommend checking out pandas. It is a wonderful library based all around tabular data. There are specific commands in it to read a csv, then send it to json, making it easily manipulatable in Python.
It would go something like:
import pandas as pd
df = pd.read_csv("filename.csv")
json_variable = df.to_json()
print(json_variable)
Of course, you can format this with pretty print, but if you already have the CSV, all you should need to do now is access that "json_variable".
CodePudding user response:
Better use pandas.DataFrame
instead of dictionary
- it will be more useful.
I use io.StringIO
only to simulate file in memory.
I changed few states in example data to have more then one state
Using groupby(['state', 'county'])
I could get expected values and create dictionary but it is more useful to use directly values.
text = '''date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alaska,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alaska,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alaska,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alaska,01017,3460,113,1720,73,1740,40'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
#df = pd.read_csv("filename.csv")
for name, group in df.groupby(['state', 'county']):
print('\n---', name, '---\n')
#print(group)
print('confirmed_cases :', group['confirmed_cases'].values[-1])
print('confirmed_deaths:', group['confirmed_deaths'].values[-1])
Result:
--- ('Alabama', 'Autauga') ---
confirmed_cases: 5557
confirmed_deaths: 85
--- ('Alabama', 'Baldwin') ---
confirmed_cases: 14329
confirmed_deaths: 220
--- ('Alabama', 'Barbour') ---
confirmed_cases: 1225
confirmed_deaths: 37
--- ('Alabama', 'Bibb') ---
confirmed_cases: 2031
confirmed_deaths: 35
--- ('Alabama', 'Blount') ---
confirmed_cases: 4901
confirmed_deaths: 109
--- ('Alaska', 'Bullock') ---
confirmed_cases: 1059
confirmed_deaths: 29
--- ('Alaska', 'Butler') ---
confirmed_cases: 1888
confirmed_deaths: 60
--- ('Alaska', 'Calhoun') ---
confirmed_cases: 10608
confirmed_deaths: 242
--- ('Alaska', 'Chambers') ---
confirmed_cases: 1720
confirmed_deaths: 73
The same way you can groupby('state')
and calculate sum()
of confirmed_deaths
for name, group in df.groupby('state'):
print('\n---', name, '---\n')
print('confirmed_deaths:', group['confirmed_deaths'].sum())
Result:
--- Alabama ---
confirmed_deaths: 486
--- Alaska ---
confirmed_deaths: 404
Or shorter
df_state = df.groupby('state').sum()
print(df_state)
And you have new DataFrame
fips cases deaths ... confirmed_deaths probable_cases probable_deaths
state ...
Alabama 5025 37828 632 ... 486 9785 146
Alaska 4056 20859 519 ... 404 5584 115
which uses state
as row's index so you could use it to join with other CSV
.
text2 = '''.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749
'''
import pandas as pd
import io
df2 = pd.read_csv(io.StringIO(text2), header=None)
# convert first column into index - and remove `.`
df2.index = df2[0].str.replace('.', '')
# get name of last column
last_column = df2.columns[-1]
print(df2[[last_column]])
Result:
12
0
Alabama 4903185
Alaska 731545
Arizona 7278717
Arkansas 3017804
California 39512223
Colorado 5758736
Connecticut 3565287
Delaware 973764
District of Columbia 705749
And join both
df3 = df_state.join(df2[[last_column]])
print(df3)
to get
fips cases deaths ... probable_cases probable_deaths 12
state ...
Alabama 5025 37828 632 ... 9785 146 4903185
Alaska 4056 20859 519 ... 5584 115 731545
And then I can calculate density
for every state
density = df3['confirmed_deaths']/df3.iloc[:,-1]
print('density:', density)
Result:
Alabama 0.000099
Alaska 0.000552
And the same for all country
density = df3['confirmed_deaths'].sum() / df3.iloc[:,-1].sum()
print('density:', density)
Result:
density: 0.00015794900554241286
With pandas.DataFrame
I could do this without for
-loops.
EDIT:
Full example with small changes
text1 = '''date,county,state,fips,cases,deaths,confirmed_cases,confirmed_deaths,probable_cases,probable_deaths
2021-03-18,Autauga,Alabama,01001,6483,96,5557,85,926,11
2021-03-18,Baldwin,Alabama,01003,20263,295,14329,220,5934,75
2021-03-18,Barbour,Alabama,01005,2199,54,1225,37,974,17
2021-03-18,Bibb,Alabama,01007,2512,58,2031,35,481,23
2021-03-18,Blount,Alabama,01009,6371,129,4901,109,1470,20
2021-03-18,Bullock,Alaska,01011,1193,39,1059,29,134,10
2021-03-18,Butler,Alaska,01013,2069,66,1888,60,181,6
2021-03-18,Calhoun,Alaska,01015,14137,301,10608,242,3529,59
2021-03-18,Chambers,Alaska,01017,3460,113,1720,73,1740,40'''
text2 = '''.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
.Colorado,5029196,5029319,5047349,5121108,5192647,5269035,5350101,5450623,5539215,5611885,5691287,5758736
.Connecticut,3574097,3574147,3579114,3588283,3594547,3594841,3594524,3587122,3578141,3573297,3571520,3565287
.Delaware,897934,897937,899593,907381,915179,923576,932487,941252,948921,956823,965479,973764
.District of Columbia,601723,601767,605226,619800,634924,650581,662328,675400,685815,694906,701547,705749
'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text1))
# ---
for name, group in df.groupby(['state', 'county']):
print('\n---', name, '---\n')
#print(group)
print('confirmed_cases:', group['confirmed_cases'].values[0])
print('confirmed_deaths:', group['confirmed_deaths'].values[0])
for name, group in df.groupby('state'):
print('\n---', name, '---\n')
print('confirmed_deaths:', group['confirmed_deaths'].sum())
print('\n--- df_state ---\n')
df_state = df.groupby('state').sum()
print(df_state)
# ---
df_population = pd.read_csv(io.StringIO(text2), header=None)
# ---
df_population.index = df_population[0].str.replace('.', '', regex=False)
# get last column
last_column = df_population.columns[-1]
# rename last column
df_population = df_population.rename(columns={last_column: 'last_population'})
# ---
print('\n--- df_population ---\n')
print(df_population['last_population'])
df_joined = df_state.join(df_population['last_population'])
print('\n--- df_joined ---\n')
print(df_joined)
density_states = df_joined['confirmed_deaths']/df_joined['last_population']
print('\n--- density states ---\n')
print(density_states)
density_country = df_joined['confirmed_deaths'].sum() / df_joined['last_population'].sum()
print('\n--- density country ---\n')
print(density_country)