csv import - how to ingeniously check that the name of columns are "correct"?-CodePudding

I'm trying to model electrical grid from data manually written in csv. I have as, an example, a column that is supposed to be called 'DEPART 1'. I often can find 'Départ 1', 'DEP1','depart 1',' DEPART 1 ' or many other possibilities...

Right know, i'm importing it with :

import_net_data = pd.read_excel(path_file, sheet_name=None)

I would like to be able to identify the columns that are close to the "official name" (maybe by ignoring spaces, maj ...)

Is there a way to proper way to :

replace any of thoses incorrect string (whithout giving all of possibilites) by the correct one
Check if there is only one occurence of thoses column name

CodePudding user response：

I suggest you using regular expressions to identify a pattern suitable between these column names, and them replace them with the official name.

You can use the re library to do so. Combine it with regex101 website to find the best regular expression to fit for all cases.

Here is a small code sample to solve this particular case:

import re

official_name = "depart 1"

column_names = [
    "Départ 1",
    "DEP1",
    "depart 1",
    " DEPART 1 ",
    " depart      1"]
    
regex = "\s*[d^D][e^E^é^É][p^P]\D*\s*1\s*"

for name in column_names:
    print(name)
    result = re.search(regex, name)
    if result:
        print("Replace with {0}".format(official_name))
    else:
        print("Could not find the regex pattern")

It outputs this:

Départ 1
Replace with depart 1
DEP1
Replace with depart 1
depart 1
Replace with depart 1
 DEPART 1 
Replace with depart 1
 depart      1
Replace with depart 1

CodePudding user response：

You need to use fuzzy string matching here. For python, as an option, you can look at the thefuzz package it's calculate Levenshtein distance for strings.

As an example:

from thefuzz import fuzz


st = 'DEPART 1'
strs = [ 'Départ 1', 'DEP1','depart 1',' DEPART 1 ']

for s in strs:
    l_d= fuzz.ratio(st.lower(), s.lower()) # Levenshtein distance
    print(st, s, '|', 'Levenshtein distance: ', l_d, 'is the same: ', l_d > 60)

Output:

DEPART 1 Départ 1 | Levenshtein distance:  88   is the same:  True
DEPART 1 DEP1     | Levenshtein distance:  67   is the same:  True
DEPART 1 depart 1 | Levenshtein distance:  100  is the same:  True
DEPART 1 DEPART 1 | Levenshtein distance:  89   is the same:  True

Check more info: https://www.datacamp.com/community/tutorials/fuzzy-string-python

Using it you can achieve your goals.

"replace any of thoses incorrect string":

import pandas as pd
from thefuzz import fuzz

st = 'DEPART 1'

df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])
print(df)

cols = []
for column in df.columns:
    if fuzz.ratio(st.lower(), column.lower()) > 60:
        cols.append(st)
    else:
        cols.append(column)

df.columns = cols

print(df)

Output:

Columns: [DEPART 1, DEP1, depart 1, depart 1, not even close]
Columns: [DEPART 1, DEPART 1, DEPART 1, DEPART 1, not even close]

"check occurences of column names":

import pandas as pd
import collections

df = pd.DataFrame(columns=['DEPART 1','DEP1','depart 1','depart 1','not even close'])

print(collections.Counter(df.columns))

Output:

Counter({'depart 1': 2, 'DEPART 1': 1, 'DEP1': 1, 'not even close': 1})