Pyspark Dataframe Phone Number Format-CodePudding

I have a table, approx. 1K rows and two columns. The first row is the emp_id and the second is tel_num. The tel_num column is not formatted all the same...some examples are (555) 555-9876, 18763334455, 505-999-888x222, some have no values...and so on. The goal is to format them all the same 10 digits without the leading 1s or any extensions.

The table looks like the following

emp_id	tel_num
Jon Doe	18763334455
Cal Foe	505-999-8888x222
Ho Moe	nan
GI joe	676.909.4321

trying to make this...

Column A	tel_format
Jon Doe	(876) 333-4455
Cal Foe	(505) 999-8888
Ho Moe	nan
GI Joe	(679) 909-4321

I tried this format... I'm using databricks. The current process i tried is somewhat like this...

def formatphone(ph_var):
    ...some process
    return formatted_ph

df = df.withColumn('tel_format', formatphone(df.tel_num))

I can't get it to work.

CodePudding user response：

You can use the following function, assuming that all possible formats are shown in your sample data. To use this function in withColumn(), you need to create a UDF from it.

@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
    if phone_number is None:
        return None
    if phone_number=='nan':
        return None
    if phone_number[0]== ' ':
        return '('    phone_number[2:5]   ') '   phone_number[5:8]   '-'   phone_number[8:12]
    if '-' in phone_number:
        return '('    phone_number[0:3]   ') '   phone_number[4:7]   '-'   phone_number[8:12]
    if '.' in phone_number:
        return '('    phone_number[0:3]   ') '   phone_number[4:7]   '-'   phone_number[8:12]
    else:
        return None