I have a problem. I would like to remove all numbers that have more than 2 digits. What is the best way to do this in pandas?
customerId text
0 1 Hello you should call 46232348
1 2 What is 42
2 3 Is this a number or 23213
3 4 1 person is there
4 5 It is 4x4 cm
import pandas as pd
d = {
"customerId": [1, 2, 3, 4, 5],
"text": ["Hello you should call 46232348",
"What is 42",
"Is this a number or 23213",
'1 person is there',
'It is 4x4 cm'],
}
df = pd.DataFrame(data=d)
print(df)
df['text_without_number'] = df['text'].str.replace('\d ', '')
print(df)
What I got
customerId text text_without_number
0 1 Hello you should call 46232348 Hello you should call
1 2 What is 42 What is
2 3 Is this a number or 23213 Is this a number or
3 4 1 person is there person is there
4 5 It is 4x4 cm It is x cm
What I want
customerId text text_without_number
0 1 Hello you should call 46232348 Hello you should call
1 2 What is 42 What is 42
2 3 Is this a number or 23213 Is this a number or
3 4 1 person is there 1 person is there
4 5 It is 4x4 cm It is 4x4 cm
CodePudding user response:
You can use \d{3,}
to get 3 or more digits:
df['text_without_number'] = df['text'].str.replace(r'\s*\d{3,}', '', regex=True)
output:
customerId text text_without_number
0 1 Hello you should call 46232348 Hello you should call
1 2 What is 42 What is 42
2 3 Is this a number or 23213 Is this a number or
3 4 1 person is there 1 person is there
4 5 It is 4x4 cm It is 4x4 cm