I am trying to extract numbers only from a pandas column
I used .str.extract('([?:\s\d ] )')
and it seems it worked well, but when I checked the data, there is a row that it is not matching the condition.
Row contains: 86531 86530 86529PIP 91897PIP
Result: 86531 86530 86529
As you can see last value 91897PIP
has not been processed. I do not understand why.
What I need is that all the instances where there are numbers are extracted no matter if it has letter, spaces or special char at the beginning or end.
CodePudding user response:
Your regex doesn't do what you think it does. What you have is a character class, which matches any of the characters in the set ?: \t\r\n\f\v0-9
. So when the regex encounters the first non-matching character (P
for your sample data) it stops. It's probably simpler to use replace
to get rid of non-whitespace and digit characters:
df = pd.DataFrame({'data':['86531 86530 86529PIP 91897PIP']})
df['data'].str.replace('([^\s\d])', '', regex=True)
Which for your data will give:
86531 86530 86529 91897
CodePudding user response:
You can pass in a regular expression to .extract()
that matches one or more digits:
df = pd.DataFrame({"data": ["86531", "86530", "86529PIP", "91897PIP"]})
df["data"] = df["data"].str.extract(r"(\d )")
print(df)
This outputs:
data
0 86531
1 86530
2 86529
3 91897