I am creating an app that loads a DataFrame from spreadsheet. Sometimes the names are not the same as before (someone changed a column name slightly) and are just barely different. What would be the best way to "look" for these closely related column names in new spreadsheets when loading the df? If I as python to look for "col_1" but in the users spreadsheet the column is "col1" then python won't find it.
Example:
import pandas as pd
df = pd.read_excel('data.xlsx')
Here are the column names I am looking for, the rest of the columns load just fine and the column names that a just barely different get skipped and the data never gets loaded. How can I make sure that if the name is close to the name that python is looking for it will get loaded in the df?
Names I'm looking for:
'Water Consumption' | 'Female Weight' | 'Uniformity' |
---|
data.xlsx, incoming data column names that are slightly different:
'Water Consumed Actual' | 'Body Weight Actual' | 'Unif %' |
---|
CodePudding user response:
The builtin function difflib.get_close_matches
can help you with column names that are slightly wrong. Using that with the usecols
argument of pd.read_excel
should get you most of the way there.
You can do something like:
import difflib
import pandas as pd
desired_columns = ['Water Consumption', 'Female Weight', 'Uniformity']
def column_checker(col_name):
if difflib.get_close_matches(col_name, desired_columns):
return True:
else:
return False
df = pd.read_excel('data.xlsx', usecols=column_checker)
You can mess with the parameters of get_close_matches
to make it more or less sensitive too.