Phone number cleaning-CodePudding

I have a list of phone numbers in pandas like this:

Phone Number
923*********
0923********
03**********
0923********

I want to clean the phone numbers based on two rules

If the length of string is 11, number should start with '03'
If the length of string is 12, number should start with '923'

I want to discard all other numbers.

So far I have tried creating two seperate columns by following code:

before_cust['digits'] = before_cust['Information'].str.len()
before_cust['starting'] = before_cust['Information'].astype(str).str[:3]
before_cust.loc[((before_cust['digits'] == 11) & before_cust[before_cust['starting'].str.contains("03")==True]) | ((before_cust['digits'] == 12) & (before_cust[before_cust['starting'].str.contains("923")==True]))]

However this code doesn't work. Is there a more efficient way to do this?

CodePudding user response：

Create 2 boolean masks for each condition then filter out your dataframe:

# If the length of string is 11, number should start with '03'
m1 = df['Information'].str.len().eq(11) & df['Information'].str.startswith('03')

# If the length of string is 12, number should start with '923'
m2 = df['Information'].str.len().eq(12) & df['Information'].str.startswith('923')

out = df.loc[m1|m2]
print(out)

# Output:
    Information
0  923*********

Note: I think it doesn't work because you use str.contains rather than str.startswith.

CodePudding user response：

Assuming you want to get rid of all the rows that do not satisfy your condition(As you haven't included any other information regarding the dataframe), I'd go with this approach :

func = lambda num :(len(num) == 11 and num.startswith("03")) or (len(num) == 12 and num.startswith("923"))
df = df[df["Information"].apply(func)].reset_index(drop = True)

The lambda function simply returns the boolean that becomes True if your desired condition is met, else False.
Then simply apply this filter to your dataframe and get rid of all the other columns!