I have a dataframe that contains county and state information (ie: Bucks, PA). I am trying to validate that my dataframe is correct and if not (as it often isn't spelled correctly), fix the mistakes. I have a separate list that contains all county and state information for the USA to use for against my dataframe for the validation.
This is what I have accomplished so far:
df_CountyState_List= ['Floyd, IA', 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ]
df_results = ['Franklin, IA', 'Freemont, IA', 'Green, IA']
desired df_results=[ 'Franklin, IA', 'Fremont, IA', 'Greene, IA' ]
for updatelocation in df_CountyState_List['Location'].to_list():
df_results.loc[ df_results['Location'].str.contains(updatelocation), 'Location' ] ==updatelocation
CodePudding user response:
You can iterate over each line or use np equivalent to check every country (|state) and if it's not in the list, you can try to use difflib to find the closest word in yours lists :
How to find the most similar word in a list in python
CodePudding user response:
If you only want to validate, i.e. to check whether the rows are OK or not, you can use isin().
Let's say your column is called "County", then:
df_results.County.isin(df_CountyState_List[0])
0 True
1 False
2 False
Name: County, dtype: bool
If you want also to correct the mistakes, you can use difflib
as suggested in another answer. difflib requires a list, not a dataframe, so if df_CountyState_List
is a DataFrame (as the 'df' in the name suggest):
countyState_List = df_CountyState_List[0].tolist()
df_validated = df_results.County.apply(lambda row: difflib.get_close_matches(row, countyState_List)[0])