Extract only numbers from string with python-CodePudding

I am trying to extract numbers only from a pandas column

I used .str.extract('([?:\s\d ] )')

and it seems it worked well, but when I checked the data, there is a row that it is not matching the condition.

Row contains: 86531 86530 86529PIP 91897PIP

Result: 86531 86530 86529

As you can see last value 91897PIP has not been processed. I do not understand why.

What I need is that all the instances where there are numbers are extracted no matter if it has letter, spaces or special char at the beginning or end.

CodePudding user response：

Your regex doesn't do what you think it does. What you have is a character class, which matches any of the characters in the set ?: \t\r\n\f\v0-9 . So when the regex encounters the first non-matching character (P for your sample data) it stops. It's probably simpler to use replace to get rid of non-whitespace and digit characters:

df = pd.DataFrame({'data':['86531 86530 86529PIP 91897PIP']})
df['data'].str.replace('([^\s\d])', '', regex=True)

Which for your data will give:

86531 86530 86529 91897

CodePudding user response：

You can pass in a regular expression to .extract() that matches one or more digits:

df = pd.DataFrame({"data": ["86531", "86530", "86529PIP", "91897PIP"]})
df["data"] = df["data"].str.extract(r"(\d )")
print(df)

This outputs: