Home > Mobile >  Python Validate County State
Python Validate County State

Time:06-07

I have a dataframe that contains county and state information (ie: Bucks, PA). I am trying to validate that my dataframe is correct and if not (as it often isn't spelled correctly), fix the mistakes. I have a separate list that contains all county and state information for the USA to use for against my dataframe for the validation.

This is what I have accomplished so far:

df_CountyState_List= ['Floyd, IA', 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ] 
df_results = ['Franklin, IA', 'Freemont, IA', 'Green, IA'] 
desired df_results=[ 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ]

for updatelocation in df_CountyState_List['Location'].to_list():
  df_results.loc[ df_results['Location'].str.contains(updatelocation), 'Location' ] ==updatelocation

CodePudding user response:

You can iterate over each line or use np equivalent to check every country (|state) and if it's not in the list, you can try to use difflib to find the closest word in yours lists :

How to find the most similar word in a list in python

CodePudding user response:

If you only want to validate, i.e. to check whether the rows are OK or not, you can use isin().

Let's say your column is called "County", then:

df_results.County.isin(df_CountyState_List[0])

0     True
1    False
2    False
Name: County, dtype: bool

If you want also to correct the mistakes, you can use difflib as suggested in another answer. difflib requires a list, not a dataframe, so if df_CountyState_List is a DataFrame (as the 'df' in the name suggest):

countyState_List = df_CountyState_List[0].tolist()
df_validated = df_results.County.apply(lambda row: difflib.get_close_matches(row, countyState_List)[0])
  • Related