Remove period then email extension after '@' into new column to extract first and last nam-CodePudding

I have a list of emails that are in the format [email protected]. I would like to create a new column with only the first and last name extracted from the email address.

I am using PySpark. This is an example of the desired output:

data = [{"Email": "[email protected]", "Role": "manager"},
{"Email": "[email protected]", "Role": "vp"}]

df = spark.createDataFrame(data)

type(df)

# original data set
 ------------------ ------- 
|Email             |Role   |
 ------------------ ------- 
|[email protected]|manager|
|[email protected]|vp     |
 ------------------|------- 

# what I want the output to look like
 ------------------ ------- -------- 
|Email             |Role   |Name    |
 ------------------ ------- -------- 
|[email protected]|manager|john doe|
|[email protected]|vp     |jane doe|
 ------------------|-------|--------

How can I remove the period, replace it with a space, then drop everything after the @ into a new column to get the names like the example above?

CodePudding user response：

You can use Python's .split method for strings and a loop to add a "Name" field to each record in your list.

for d in data:
    d["Name] = " ".join(e["Email"].split("@")[0].split("."))

In the above loop we split the "Email" field at the "@" character, creating a list of two elements, of which we take the first one, and then split that on the character ".", which gives us the first and last name. Then we join them with a space (" ") in between.

CodePudding user response：

You can use regex_extract and regex_replace.

from pyspark.sql import functions as F
df = df.withColumn('Name', F.regexp_extract(
        F.regexp_replace('Email', '\.', ' '), 
        '(.*)@', 
        1)
     )

First, regexp_replace('Email', '\.', ' ') will replace . to space in Email column.

Then, regexp_extract(..., '(.*)@', 1) will extract the 1st capture group.

Regex explanation

(.*) => .* is any characters with any length. Wrap with () to make a capture group.
@ => match @ mark. 

(.*)@ => 1st Capture group will capture any characters before @.

CodePudding user response：

It will replace the . and @... with a space which we'll have to trim from the end.

from pyspark.sql import functions as F

df.withColumn('Name', F.trim(F.regexp_replace('Email', '\.|@.*', ' '))).show()
#  ------------------ ------- -------- 
# |             Email|   Role|    Name|
#  ------------------ ------- -------- 
# |[email protected]|manager|john doe|
# |[email protected]|     vp|jane doe|
#  ------------------ ------- --------