Imagine I have a dataset that is like so:
cust_id cust_state prod_title total_sales trans_timestamp
0 1001015 NY All Veggie Yummies 72.99 2021-03-30 10:06:05.108653
1 1001019 NJ Ball and String 18.95 2021-03-30 10:07:01.746945
2 1022098 NY Cat Cave 28.45 2021-03-30 10:10:41.387170
3 1022157 PA Chewie Dental 24.95 2021-03-30 10:20:11.571311
4 1022189 Illinois Chomp-a Plush 60.99 2021-03-30 10:33:11.289467
5 1002664 Kansas Feline Fix Mix 65.99 2021-03-30 10:37:55.446798
6 1002666 Florida Fetch Blaster 9.95 2021-03-30 10:39:36.488829
7 1002175 Hawaii Foozy Mouse 45.99 2021-03-30 10:43:21.490817
8 1002666 Florida Kitty Climber 35.99 2021-03-30 10:51:31.510563
9 1022189 Illinois Purr Mix 32.99 2021-03-30 11:01:50.082440
10 1011924 Indiana Fetch Blaster 19.90 2021-03-30 11:11:15.944726
11 1022236 AK Purr Mix 98.97 2021-03-30 11:15:39.390134
12 1022189 IL Cat Cave 56.90 2021-03-30 11:21:10.167505
13 1002137 Wisconsin Purrfect Puree 54.95 2021-03-30 11:27:51.133318
14 1002159 KY Foozy Mouse 91.98 2021-03-30 11:29:00.292890
15 1002175 Hawaii Reddy Beddy 21.95 2021-03-30 11:30:22.928818
16 1002108 Maine Cat Cave 85.83 2021-03-30 11:34:35.776578
17 1002186 MI Scratchy Post 48.95 2021-03-30 10:39:28.604007
18 1002175 HI Snack-em Fish 15.99 2021-03-30 10:39:40.349882
19 1002261 CO Snoozer Essentails 99.95 2021-03-30 10:51:31.510563
20 1002666 Florida Scratchy Post 48.95 2021-03-30 11:01:28.024109
21 1002159 Kentucky Purrfect Puree 219.80 2021-03-30 11:11:15.944726
22 1002680 SC Chewie Dental 49.90 2021-03-30 11:15:39.390134
23 1002678 Texas Reddy Beddy 65.85 2021-03-30 11:21:10.167505
24 1013769 OK The New Bone 71.96 2021-03-30 11:27:17.147159
25 1013859 KY Reddy Beddy 109.75 2021-03-30 11:28:48.669564
Fix the inconsistencies in the column cust_state
.
So for this one I created a dictionary for all the abbreviations to the actual state names, then I applied it as a map to the actual cust_state
column. However when I did that it changed all the abbreviations to NaN values. How do I change all the states values to one style? I'll post my code beneath.
state_to_abr = dict({'AL':'Alabama','AK':'Alaska','AZ':'Arizona','AR':'Arkansas',
'CA':'California','CO':'Colorado','CT':'Connecticut',
'DE':'Delaware',
'FL':'Florida',
'GA':'Georgia',
'HI':'Hawaii',
'ID':'Idaho','IL':'Illinois','IN':'Indiana','IA':'Iowa',
'KS':'Kansas','KY':'Kentucky',
'LA':'Louisiana',
'ME':'Maine','MD':'Maryland','MA':'Massachusetts','MI':'Michigan','MN':'Minnesota','MS':'Mississippi','MO':'Missouri','MT':'Montana',
'NE':'Nebraska','NV':'Nevada','NH':'New Hampshire','NJ':'New Jersey','NM':'New Mexico','NY':'New York','NC':'North Carolina','ND':'North Dakota',
'OH':'Ohio','OK':'Oklahoma','OR':'Oregon',
'PA':'Pennsylvania',
'RI':'Rhode Island',
'South Carolina':'SC',
'TN':'Tennessee','TX':'Texas',
'UT':'Utah',
'VT':'Vermont','VI':'Virgin Islands',
'WA':'Washington','WV':'West Virginia','WI':'Wisconsin','WY':'Wyoming'})
df_cleaned['cust_state'] = df_cleaned['cust_state'].map(state_to_abr)
df_cleaned
CodePudding user response:
This will let you map to the dictionary you wrote (I had to change the order of SC though)
state_to_abr = dict({'AL':'Alabama','AK':'Alaska','AZ':'Arizona','AR':'Arkansas',
'CA':'California','CO':'Colorado','CT':'Connecticut',
'DE':'Delaware',
'FL':'Florida',
'GA':'Georgia',
'HI':'Hawaii',
'ID':'Idaho','IL':'Illinois','IN':'Indiana','IA':'Iowa',
'KS':'Kansas','KY':'Kentucky',
'LA':'Louisiana',
'ME':'Maine','MD':'Maryland','MA':'Massachusetts','MI':'Michigan','MN':'Minnesota','MS':'Mississippi','MO':'Missouri','MT':'Montana',
'NE':'Nebraska','NV':'Nevada','NH':'New Hampshire','NJ':'New Jersey','NM':'New Mexico','NY':'New York','NC':'North Carolina','ND':'North Dakota',
'OH':'Ohio','OK':'Oklahoma','OR':'Oregon',
'PA':'Pennsylvania',
'RI':'Rhode Island',
'SC':'South Carolina',
'TN':'Tennessee','TX':'Texas',
'UT':'Utah',
'VT':'Vermont','VI':'Virgin Islands',
'WA':'Washington','WV':'West Virginia','WI':'Wisconsin','WY':'Wyoming'})
df_states = pd.DataFrame.from_dict(state_to_abr,orient='index', columns = ['State_Name'])
df['cust_state'] = np.where(df['cust_state'].str.len() <= 2, df['cust_state'].map(df_states['State_Name']), df['cust_state'])
df
CodePudding user response:
you seemed to have a spaces in the abbreviation, so stripping the spaces will yield the result. secondly, since the cust_states has both abbreviations and full name, you may want to map when the mapping doesn't result in a null values, hence the use of the mask
df['cust_state']=df['cust_state'].mask(
df['cust_state'].str.strip().map(state_to_abr).notna(),
df['cust_state'].str.strip().map(state_to_abr))
df
OR
check the length, when its two, do a map, else leave value as is
df['cust_state']=df['cust_state'].mask(
df['cust_state'].str.strip().str.len()==2,
df['cust_state'].str.strip().map(state_to_abr))
df
cust_id cust_state prod_title total_sales trans_timestamp
0 1001015 New York All Veggie Yummies 72.99 2021-03-30 10:06:05.108653
1 1001019 New Jersey Ball and String 18.95 2021-03-30 10:07:01.746945
2 1022098 New York Cat Cave 28.45 2021-03-30 10:10:41.387170
3 1022157 Pennsylvania Chewie Dental 24.95 2021-03-30 10:20:11.571311
4 1022189 Illinois Chomp-a Plush 60.99 2021-03-30 10:33:11.289467
5 1002664 Kansas Feline Fix Mix 65.99 2021-03-30 10:37:55.446798
6 1002666 Florida Fetch Blaster 9.95 2021-03-30 10:39:36.488829
7 1002175 Hawaii Foozy Mouse 45.99 2021-03-30 10:43:21.490817
8 1002666 Florida Kitty Climber 35.99 2021-03-30 10:51:31.510563
9 1022189 Illinois Purr Mix 32.99 2021-03-30 11:01:50.082440
10 1011924 Indiana Fetch Blaster 19.90 2021-03-30 11:11:15.944726
11 1022236 Alaska Purr Mix 98.97 2021-03-30 11:15:39.390134
12 1022189 Illinois Cat Cave 56.90 2021-03-30 11:21:10.167505
13 1002137 Wisconsin Purrfect Puree 54.95 2021-03-30 11:27:51.133318
14 1002159 Kentucky Foozy Mouse 91.98 2021-03-30 11:29:00.292890
15 1002175 Hawaii Reddy Beddy 21.95 2021-03-30 11:30:22.928818
16 1002108 Maine Cat Cave 85.83 2021-03-30 11:34:35.776578
17 1002186 Michigan Scratchy Post 48.95 2021-03-30 10:39:28.604007
18 1002175 Hawaii Snack-em Fish 15.99 2021-03-30 10:39:40.349882
19 1002261 Colorado Snoozer Essentails 99.95 2021-03-30 10:51:31.510563
20 1002666 Florida Scratchy Post 48.95 2021-03-30 11:01:28.024109
21 1002159 Kentucky Purrfect Puree 219.80 2021-03-30 11:11:15.944726
22 1002680 SC Chewie Dental 49.90 2021-03-30 11:15:39.390134
23 1002678 Texas Reddy Beddy 65.85 2021-03-30 11:21:10.167505
24 1013769 Oklahoma The New Bone 71.96 2021-03-30 11:27:17.147159
25 1013859 Kentucky Reddy Beddy 109.75 2021-03-30 11:28:48.669564