I am trying to detect values with some specific characters e.g(?,/ etc). Below you can see a small sample with some data.
import pandas as pd
import numpy as np
data = {
'artificial_number':['000100000','000010000','00001000/1','00001000?','0?00/10000'],
}
df1 = pd.DataFrame(data, columns = [
'artificial_number',])
Now I want to detect values with specific characters that are not numbers ('00001000/1','00001000?','0?00/10000')
I tried with this lines below
import re
clean = re.sub(r'[^a-zA-Z0-9\._-]', '', df1['artificial_number'])
But this code is not working as I expected. So can anybody help me how to solve this problem ?
CodePudding user response:
#replace the non-digit with an empty value
df1['artificial_number'].str.replace(r'([^\d])','', regex=True)
0 000100000
1 000010000
2 000010001
3 00001000
4 00010000
Name: artificial_number, dtype: object
if you like to list the column with non-digit values
df1.loc[df1['artificial_number'].str.extract(r'([^\d])')[0].notna()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
CodePudding user response:
Assuming a number in your case is an integer, to find the values that have non-numbers, just count the number of numbers, and compare with length of string:
rows = [len(re.findall('[0-9]', s)) != len(s) for s in df1.artificial_number]
df1.loc[rows]
# artificial_number
#2 00001000/1
#3 00001000?
#4 0?00/10000
CodePudding user response:
To detect which of the values aren't interpretable as numeric, you can also use str.isnumeric
:
df1.loc[~df1.artificial_number.str.isnumeric()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
If all characters need to be digits (e.g. 10.0
should also be excluded), use str.isdigit
:
df1.loc[~df1.artificial_number.str.isdigit()]
df1.iloc[0,0] = '000100000.0'
artificial_number
0 000100000.0
2 00001000/1
3 00001000?
4 0?00/10000