Home > Software design >  Pyspark Dataframe Phone Number Format
Pyspark Dataframe Phone Number Format

Time:01-12

I have a table, approx. 1K rows and two columns. The first row is the emp_id and the second is tel_num. The tel_num column is not formatted all the same...some examples are (555) 555-9876, 18763334455, 505-999-888x222, some have no values...and so on. The goal is to format them all the same 10 digits without the leading 1s or any extensions.

The table looks like the following

emp_id tel_num
Jon Doe 18763334455
Cal Foe 505-999-8888x222
Ho Moe nan
GI joe 676.909.4321

trying to make this...

Column A tel_format
Jon Doe (876) 333-4455
Cal Foe (505) 999-8888
Ho Moe nan
GI Joe (679) 909-4321

I tried this format... I'm using databricks. The current process i tried is somewhat like this...

def formatphone(ph_var):
    ...some process
    return formatted_ph

df = df.withColumn('tel_format', formatphone(df.tel_num))

I can't get it to work.

CodePudding user response:

You can use the following function, assuming that all possible formats are shown in your sample data. To use this function in withColumn(), you need to create a UDF from it.

@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
    if phone_number is None:
        return None
    if phone_number=='nan':
        return None
    if phone_number[0]== ' ':
        return '('    phone_number[2:5]   ') '   phone_number[5:8]   '-'   phone_number[8:12]
    if '-' in phone_number:
        return '('    phone_number[0:3]   ') '   phone_number[4:7]   '-'   phone_number[8:12]
    if '.' in phone_number:
        return '('    phone_number[0:3]   ') '   phone_number[4:7]   '-'   phone_number[8:12]
    else:
        return None

  
  • Related