I'm trying to clean an SQL retail database however i'm confused at the structure of the first name columns so ideally i would like a clean set of names
What i had attempted was
#change the datatype of first_name to str
user_dataframe['first_name'] = user_dataframe['first_name'].astype('string')
user_dataframe['last_name'] = user_dataframe['last_name'].astype('string')
Which just changed the data type from object to string but now i am not sure how to search for the strings that i do not want
the string which are dirty come in this format
Hans Jürgen
Bärbel
Süleyman
Sören
Klaus-Jürgen
2GU3G97VI1
I7IJDAPMIM
Gülten
DD0K0FUDRY
What i am thinking if using a regex expression to drop any rows the have the following pattern character followed by number but i'm not sure what some of the symbols mean on dirty data
CodePudding user response:
The problem is the encoding in Python/Pandas. Try to change the encoding, when reading the data. See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html and potentially https://docs.python.org/3/library/codecs.html#standard-encodings.
Also see the following answer: Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#
CodePudding user response:
As someone has mentioned, it is the encoding type which caused the issue, using utf-8-sig encoding fixed the issue of the weird characters