I am trying to reduce a csv file content based on first_name column, when I say reduce, I am trying to filter out only those rows which contain latin characters in it.
my data looks like this,
A_ID ID_NUMBER DT25 DT45
abcd 0001 Condé and Geoff Shallard
abcd 555248817 Rändi & John Fay
abcd 54786 Randy john
abcd 006299 László and Virginia Csernohorszky-Hope
abcd 000323 Kim Jonh
abcd 01012 Larry will
I am just trying to create a DF with all the rows with SPL/ latin characters in DT25,
output expected is something like:
A_ID ID_NUMBER DT25 DT45
abcd 0001 Condé and Geoff Shallard
abcd 555248817 Rändi & John Fay
abcd 006299 László and Virginia Csernohorszky-Hope
I tried this,
import string
df = pd.read_csv(filename)
pattern = "^[a-zA-Z-'&.]*$"
alphabet = string.ascii_letters string.punctuation
#first_name_df = df[~df['DT25'].str.contains(alphabet, na = False)]
first_name_df = df[~df['DT25'].str.contains(pattern, na = False)]
print(first_name_df)
This is again giving me original DF. Can pandas expert help me with this please?
CodePudding user response:
You can use the regular expression [^\t-\r -~]
:
filtered = df[df['DT25'].str.contains('[^\t-\r -~]')]
Output:
>>> filtered
A_ID ID_NUMBER DT25
0 abcd 1 Condé and Geoff Shallard
1 abcd 555248817 Rändi & John Fay
3 abcd 6299 László and Virginia Csernohorszky-Hope