Python Validate County State-CodePudding

I have a dataframe that contains county and state information (ie: Bucks, PA). I am trying to validate that my dataframe is correct and if not (as it often isn't spelled correctly), fix the mistakes. I have a separate list that contains all county and state information for the USA to use for against my dataframe for the validation.

This is what I have accomplished so far:

df_CountyState_List= ['Floyd, IA', 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ] 
df_results = ['Franklin, IA', 'Freemont, IA', 'Green, IA'] 
desired df_results=[ 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ]

for updatelocation in df_CountyState_List['Location'].to_list():
  df_results.loc[ df_results['Location'].str.contains(updatelocation), 'Location' ] ==updatelocation

CodePudding user response：

You can iterate over each line or use np equivalent to check every country (|state) and if it's not in the list, you can try to use difflib to find the closest word in yours lists :

How to find the most similar word in a list in python

CodePudding user response：

If you only want to validate, i.e. to check whether the rows are OK or not, you can use isin().

Let's say your column is called "County", then:

df_results.County.isin(df_CountyState_List[0])

0     True
1    False
2    False
Name: County, dtype: bool

If you want also to correct the mistakes, you can use difflib as suggested in another answer. difflib requires a list, not a dataframe, so if df_CountyState_List is a DataFrame (as the 'df' in the name suggest):

countyState_List = df_CountyState_List[0].tolist()
df_validated = df_results.County.apply(lambda row: difflib.get_close_matches(row, countyState_List)[0])