I have a table, approx. 1K rows and two columns. The first row is the emp_id and the second is tel_num. The tel_num column is not formatted all the same...some examples are (555) 555-9876, 18763334455, 505-999-888x222, some have no values...and so on. The goal is to format them all the same 10 digits without the leading 1s or any extensions.
The table looks like the following
emp_id | tel_num |
---|---|
Jon Doe | 18763334455 |
Cal Foe | 505-999-8888x222 |
Ho Moe | nan |
GI joe | 676.909.4321 |
trying to make this...
Column A | tel_format |
---|---|
Jon Doe | (876) 333-4455 |
Cal Foe | (505) 999-8888 |
Ho Moe | nan |
GI Joe | (679) 909-4321 |
I tried this format... I'm using databricks. The current process i tried is somewhat like this...
def formatphone(ph_var):
...some process
return formatted_ph
df = df.withColumn('tel_format', formatphone(df.tel_num))
I can't get it to work.
CodePudding user response:
You can use the following function, assuming that all possible formats are shown in your sample data.
To use this function in withColumn()
, you need to create a UDF from it.
@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
if phone_number is None:
return None
if phone_number=='nan':
return None
if phone_number[0]== ' ':
return '(' phone_number[2:5] ') ' phone_number[5:8] '-' phone_number[8:12]
if '-' in phone_number:
return '(' phone_number[0:3] ') ' phone_number[4:7] '-' phone_number[8:12]
if '.' in phone_number:
return '(' phone_number[0:3] ') ' phone_number[4:7] '-' phone_number[8:12]
else:
return None