Looking for names that somewhat match-CodePudding

I am creating an app that loads a DataFrame from spreadsheet. Sometimes the names are not the same as before (someone changed a column name slightly) and are just barely different. What would be the best way to "look" for these closely related column names in new spreadsheets when loading the df? If I as python to look for "col_1" but in the users spreadsheet the column is "col1" then python won't find it.

Example:


import pandas as pd
df = pd.read_excel('data.xlsx')

Here are the column names I am looking for, the rest of the columns load just fine and the column names that a just barely different get skipped and the data never gets loaded. How can I make sure that if the name is close to the name that python is looking for it will get loaded in the df?

Names I'm looking for:

'Water Consumption'	'Female Weight'	'Uniformity'

data.xlsx, incoming data column names that are slightly different:

'Water Consumed Actual'	'Body Weight Actual'	'Unif %'

CodePudding user response：

The builtin function difflib.get_close_matches can help you with column names that are slightly wrong. Using that with the usecols argument of pd.read_excel should get you most of the way there.

You can do something like:

import difflib
import pandas as pd

desired_columns = ['Water Consumption', 'Female Weight', 'Uniformity']

def column_checker(col_name):
    if difflib.get_close_matches(col_name, desired_columns):
        return True:
    else:
        return False

df = pd.read_excel('data.xlsx', usecols=column_checker)

You can mess with the parameters of get_close_matches to make it more or less sensitive too.