So, I have a dictionary that groups US state abbreviations into regions. Then I have a Pandas DataFrame column with the state abbreviations. I'm trying to loop through the dictionary values to find a match and append the region name/dictionary key to a list. It seems to work but I am missing some abbreviations, so, the total list length is 463. However, when I add an else
-statement to try and account for the missing states, the list length jumps up to 16500.
Can somebody explain what's going on here?
user_df['State']
contains the following abbreviations:
user_df['State'].unique()
>>> ['GA', 'WA', 'NV', 'OK', 'TX', 'CA', 'MI', 'FL', 'OH', 'IL-IN-WI', 'TN', 'NY-NJ-PA', 'PA', 'DC-VA-MD-WV', 'IN', 'NE-IA', 'PA-NJ-DE-MD', 'AL', 'NC-SC', 'CO', 'NM', 'MA-NH', 'AZ', 'OR-WA', 'OH-KY-IN', 'SC', 'NY', 'TN-MS-AR', 'KY-IN', 'RI-MA', 'UT', 'HI', 'CT', 'LA', 'VA-NC', 'MD', 'WI', 'VA', 'MO-IL', 'MN-WI', 'MO-KS', 'NC']
region_dict = {}
region_dict['northeast'] = ['NY-NJ-PA', 'PA-NJ-DE-MD', 'NY', 'CT', 'MA-NH','RI-MA', 'DC-VA-MD-WV']
region_dict['midwest'] = ['OH', 'OK', 'MI', 'IL-IN-WI', 'NE-IA', 'MO-KS', 'MN-WI']
region_dict['southeast'] = ['TX','NC-SC', 'NC-VA', 'GA', 'FL', 'TN', 'KY-IN', 'TN-MS-AR', 'LA', 'IN' ]
region_dict['pacific'] = ['CA', 'NV', 'OR', 'WA', 'HI', 'OR-WA']
region_dict['intermountain'] = ['NM', 'CO', 'AZ']
region_list = []
for row in user_df['State']:
for k, v in region_dict.items():
for el in v:
if el == row:
region_list.append(k)
# Until here it works for 463 but when I add the line below
# to fill in for missing values in my dictionary I get 16500
else:
region_list.append('Unknown')
CodePudding user response:
For each row of user_df
, you are appending a list item for each of the 33 regions included in region_dict
. Thus, the length of the result will be <number of rows of user_df> * <number of regions in region_dict>
.
As suggested by @balderman, inverting your definition of region_dict
would spare a for
-loop; in fact, even two for
-loops in your code, after inversion of region_dict
(which is quick and not scaling with the size of user_df
). The code gets shorter and it's easier to follow what's going on.
If you want to spare typing, let Python invert the dictionary for you automatically, for example:
region_dict_inverted = {}
for region in region_dict.keys():
for abbr in region_dict[region]:
region_dict_inverted[abbr] = region
Your loop will be then
region_list = []
for state in user_df['State']:
if state in region_dict_inverted:
region_list.append(region_dict_inverted[state])
else:
region_list.append('unknown')