Is there way to remove only BAD characters from a string in Python/pandas?-CodePudding

I am trying to read a PDF using Camelot library and store it to a dataframe. The resulting dataframe has garbled/bad characters in string fields.

Eg: 123Rise â€“ Tower & Troe's Mechâ€“

I want to remove ONLY the Garbled characters and keep everything else including symbols.

I tried regex such as these [^\w.,&,'-\s] to only keep desirable values. But I'm having to add every special character which need not be removed into this. I cannot ditch Camelot library as well.

Is there a way to solve this ??

CodePudding user response：

You could try to use unicodedata library to normalize the data you have, for example:

import unicodedata

def formatString(value, allow_unicode=False):
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    return(value)

print(formatString("123Rise â€“ Tower & Troe's Mechâ€“"))

Result:

123Rise a Tower & Troe's Mecha

CodePudding user response：

One way to achieve that, is to remove non-ASCII characters.

my_text = "123Rise â€“ Tower & Troe's Mechâ€“"
my_text = ''.join([char if ord(char) < 128 else '' for char in my_text])
print(my_text)

Result:

123Rise  Tower & Troe's Mech

Also you can use this website as reference to normal and extended ASCII characters.

CodePudding user response：

Another way I commonly use for filtering out non-ascii garbage and may be relevant (or not) is:

# Your "messy" data in question.
string = "123Rise â€“ Tower & Troe's Mechâ€“"

# Iterate over each character, and filter by only ord(c) < 128.
clean = "".join([c for c in string if ord(c) < 128])

What is ord? Ord (as I understand it) converts a character to its binary/ascii numeric representation. You can use this to your advantage, by filtering only numbers less than 128 (as above) which will limit your text range to basic ascii and no unicode stuff without having to work with messy encodings.

Hope that helps!