Home > Blockchain >  Data cleaning first names and last names but they have weird structure with symbols
Data cleaning first names and last names but they have weird structure with symbols

Time:01-11

I'm trying to clean an SQL retail database however i'm confused at the structure of the first name columns so ideally i would like a clean set of names

What i had attempted was

 #change the datatype of first_name to str
        user_dataframe['first_name'] = user_dataframe['first_name'].astype('string')
        user_dataframe['last_name'] = user_dataframe['last_name'].astype('string')

Which just changed the data type from object to string but now i am not sure how to search for the strings that i do not want

the string which are dirty come in this format

Hans Jürgen
Bärbel
Süleyman
Sören
Klaus-Jürgen
2GU3G97VI1
I7IJDAPMIM
Gülten
DD0K0FUDRY

What i am thinking if using a regex expression to drop any rows the have the following pattern character followed by number but i'm not sure what some of the symbols mean on dirty data

CodePudding user response:

The problem is the encoding in Python/Pandas. Try to change the encoding, when reading the data. See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html and potentially https://docs.python.org/3/library/codecs.html#standard-encodings.

Also see the following answer: Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

CodePudding user response:

As someone has mentioned, it is the encoding type which caused the issue, using utf-8-sig encoding fixed the issue of the weird characters

  • Related