I'm taking a course and I need to solve the following assignment: "In this part, you should write a for loop, updating the df_users dataframe.
Go through each user, and update their zip code, to Safe Harbor specifications:
If the user is from a zip code for the which the “Geographic Subdivision” is less than equal to 20,000, change the zip code in df_users to ‘0’ (as a string) Otherwise, zip should be only the first 3 numbers of the full zip code Do all this by directly updating the zip column of the df_users DataFrame Hints:
This will be several lines of code, looping through the DataFrame, getting each zip code, checking the geographic subdivision with the population in zip_dict, and setting the zip_code accordingly. Be very aware of your variable types when working with zip codes here."
Here you can find all the data necessary to understand the context:
https://raw.githubusercontent.com/DataScienceInPractice/Data/master/
assignment: 'A4'
data_files: user_dat.csv, zip_pop.csv
After cleaning the data from user_dat.csv leaving only the columns: 'age', 'zip' and 'gender', and creating a dictionary from zip_pop.csv that contains the population of the first 3 digits from all the zipcodes; I wrote this code:
# Loop through the dataframe's to get each zipcode
for zipcode in df_users['zip']:
# check if the zipcode's 3 first numbers from the dataframe, correspond to a population of more or less than 20.000 people
if zip_dict[zipcode[:len(zipcode) - 2]] <= 20000:
# if less, change zipcode value to string zero.
df_users.loc[df_users['zip'] == zipcode, 'zip'] = '0'
else:
# If more, preserve only the first 3 digits of the zipcode.
df_users.loc[df_users['zip'] == zipcode, 'zip'] = zipcode[:len(zipcode) - 2]
This code works halfways and I don't understand why. It changes the zipcode to 0 if the population is less than 20.000 people, and also changes the first zipcodes (up until the ones that start with '078') but then it returns this error message:
KeyError Traceback (most recent call last)
/var/folders/95/4vh4zhc1273fgmfs4wyntxn00000gn/T/ipykernel_44758/1429192050.py in < module >
1 for zipcode in df_users['zip']:
----> 2 if zip_dict[zipcode[:len(zipcode) - 2]] <= 20000:
3 df_users.loc[df_users['zip'] == zipcode, 'zip'] = '0'
4 else:
5 df_users.loc[df_users['zip'] == zipcode, 'zip'] = str(zipcode[:len(zipcode) - 2])
KeyError: '0'
I get that the problem is in the last line of code, because I've been doing every line at a time and each of them worked, until I put that last one. And if I just print the zipcodes instead of that last line, it also works!
Can anyone can help me understand why my code is wrong?
CodePudding user response:
You're modifying a collection of values (i.e. df_users['zip']
) whilst you're iterating over it. This is a common anti pattern. If a loop is absolutely required, then you could consider iterating over df_users['zip'].unique()
instead. That creates a copy of all the unique zip codes, solving your current error, and it means that you aren't redoing work when you encounter a duplicate zipcode.
If a loop is not required, then there are better (more pandas style) ways to go about your problem. I would suggest something like (untested):
zip_start = df_users['zip'].str[:-2]
df_users['zip'] = zip_start.where(zip_start.map(zip_dict) > 20000, other="0")