I have a df with a column with emails and more information that I don't want. Here are some examples:
Email_Col
"Snow, John" <[email protected]>, "Stark, Arya" <[email protected]>
"YourBoss" <[email protected]>
"test1 <[email protected]>", "test2 <[email protected]>", "test3" <[email protected]>
I need to clean the column or create a new one with the emails. Here the expected output, an array column:
New_Email_Col
[[email protected], Stark, [email protected]]
[[email protected]]
[[email protected] [email protected], [email protected]]
My code:
import re
def extract(col):
for row in col:
all_matches = re.findall(r'\w .\w @\w .\w ', row)
return all_matches
extract_udf = udf(lambda col: extract(col), ArrayType(StringType()))
df = df.withColumn(('emails'), extract_udf(col('to')))
My error:
PythonException: 'TypeError: expected string or bytes-like object', from , line 4. Full traceback below
CodePudding user response:
Please refrain from udf
- they are slow and nowadays not needed in the vast majority of cases. The following does the trick:
F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)")
regexp_extract_all
is available from Spark 3.1
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('''"Snow, John" <[email protected]>, "Stark, Arya" <[email protected]>''',),
('''"YourBoss" <[email protected]>''',),
('''"test1 <[email protected]>", "test2 <[email protected]>", "test3" <[email protected]>''',)],
['Email_Col'])
df = df.withColumn('Email_Col', F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)"))
df.show(truncate=0)
# --------------------------------------------------------------------
# |Email_Col |
# --------------------------------------------------------------------
# |[[email protected], [email protected]] |
# |[[email protected]] |
# |[[email protected], [email protected], [email protected]]|
# --------------------------------------------------------------------
To add a separate new column:
df = df.withColumn('New_Email_Col', F.expr("regexp_extract_all(Email_Col, '(?<=<).*?(?=>)', 0)"))