Extract and create a new email column dataframe pyspark-CodePudding

I have a df with a column with emails and more information that I don't want. Here are some examples:

                                       Email_Col
"Snow, John" <[email protected]>, "Stark, Arya" <[email protected]>
"YourBoss" <[email protected]>
"test1 <[email protected]>", "test2 <[email protected]>", "test3" <[email protected]>

I need to clean the column or create a new one with the emails. Here the expected output, an array column:

                           New_Email_Col
[[email protected], Stark, [email protected]]
[[email protected]]
[[email protected] [email protected], [email protected]]

My code:

import re

def extract(col):
    for row in col:
        all_matches = re.findall(r'\w .\w @\w .\w ', row)
    return all_matches

extract_udf = udf(lambda col: extract(col), ArrayType(StringType()))

df = df.withColumn(('emails'), extract_udf(col('to')))

My error:

PythonException: 'TypeError: expected string or bytes-like object', from , line 4. Full traceback below

CodePudding user response：

Please refrain from udf - they are slow and nowadays not needed in the vast majority of cases. The following does the trick:

F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)")

regexp_extract_all is available from Spark 3.1

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('''"Snow, John" <[email protected]>, "Stark, Arya" <[email protected]>''',),
     ('''"YourBoss" <[email protected]>''',),
     ('''"test1 <[email protected]>", "test2 <[email protected]>", "test3" <[email protected]>''',)],
    ['Email_Col'])

df = df.withColumn('Email_Col', F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)"))

df.show(truncate=0)
#  -------------------------------------------------------------------- 
# |Email_Col                                                           |
#  -------------------------------------------------------------------- 
# |[[email protected], [email protected]]                    |
# |[[email protected]]                                 |
# |[[email protected], [email protected], [email protected]]|
#  --------------------------------------------------------------------

To add a separate new column:

df = df.withColumn('New_Email_Col', F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)"))