I have a list of phone numbers in pandas like this:
Phone Number |
---|
923********* |
0923******** |
03********** |
0923******** |
I want to clean the phone numbers based on two rules
- If the length of string is 11, number should start with '03'
- If the length of string is 12, number should start with '923'
I want to discard all other numbers.
So far I have tried creating two seperate columns by following code:
before_cust['digits'] = before_cust['Information'].str.len()
before_cust['starting'] = before_cust['Information'].astype(str).str[:3]
before_cust.loc[((before_cust['digits'] == 11) & before_cust[before_cust['starting'].str.contains("03")==True]) | ((before_cust['digits'] == 12) & (before_cust[before_cust['starting'].str.contains("923")==True]))]
However this code doesn't work. Is there a more efficient way to do this?
CodePudding user response:
Create 2 boolean masks for each condition then filter out your dataframe:
# If the length of string is 11, number should start with '03'
m1 = df['Information'].str.len().eq(11) & df['Information'].str.startswith('03')
# If the length of string is 12, number should start with '923'
m2 = df['Information'].str.len().eq(12) & df['Information'].str.startswith('923')
out = df.loc[m1|m2]
print(out)
# Output:
Information
0 923*********
Note: I think it doesn't work because you use str.contains
rather than str.startswith
.
CodePudding user response:
Assuming you want to get rid of all the rows that do not satisfy your condition(As you haven't included any other information regarding the dataframe), I'd go with this approach :
func = lambda num :(len(num) == 11 and num.startswith("03")) or (len(num) == 12 and num.startswith("923"))
df = df[df["Information"].apply(func)].reset_index(drop = True)
- The lambda function simply returns the boolean that becomes True if your desired condition is met, else False.
- Then simply apply this filter to your dataframe and get rid of all the other columns!