I have a datframe
>temp
Age Rank PhoneNumber State City
10 1 99-22344-1 Ga abc
15 12 No Ma xyz
For the column(Phone Number), I want to strip all characters like - unless they are full phone numbers and if it says No or any word apart from a numeric, I want it to be a blank. How can I do this
My attempt is able to handle special chars but not words symbols like 'No'
temp['PhoneNumber '] = temp['PhoneNumber '].str.replace('[^\d] ', '')
Desired Output df -
>temp
Age Rank PhoneNumber State City
10 1 99223441 Ga abc
15 12 Ma xyz
CodePudding user response:
This does the job.
import pandas as pd
import re
data = [
[10, 1, '99-223344-1', 'GA', 'Abc'],
[15, 12, "No", 'MA', 'Xyz']
]
df = pd.DataFrame(data, columns=['Age Rank PhoneNumber State City'.split()])
print(df)
def valphone(p):
p = p['PhoneNumber']
if re.match(r'[123456789-] $', p):
return p
else:
return ""
print(df['PhoneNumber'])
df['PhoneNumber'] = df['PhoneNumber'].apply(valphone, axis=1)
print(df)
Output:
Age Rank PhoneNumber State City
0 10 1 99-223344-1 GA Abc
1 15 12 No MA Xyz
Age Rank PhoneNumber State City
0 10 1 99-223344-1 GA Abc
1 15 12 MA Xyz
I do have to admit to a bit of frustration with this. I EXPECTED to be able to do
df['PhoneNumber'] = df['PhoneNumber'].apply(valphone)
because df['PhoneNumber']
should return a Series, and the Series.apply
function should pass me one value at a time. However, that's not what happens here, and I don't know why. df['PhoneNumber']
returns a DataFrame instead of a Series, so I have to use the column reference inside the function.
Thus, YOU may need to do some experimentation. If df['PhoneNumber']
returns a Series for you, then you don't need the axis=1
, and you don't need the p = p['PhoneNumber']
line in the function.
Followup
OK, assuming the presence of a "phone number validation" module, as is mentioned in the comments, this becomes:
import phonenumbers
...
def valphone(p):
p = p['PhoneNumber'] # May not be required
n = phonenumbmers.parse(p)
if phonenumbers.is_possible_number(n):
return p
else:
return ''
...
CodePudding user response:
temp['PhoneNumber'] = temp['PhoneNumber'].str.findall(r'\d').str.join('')