Home > Back-end >  How to extractstring between two characters in a column of spark dataframe
How to extractstring between two characters in a column of spark dataframe

Time:06-21

I am dealing with spark data frame df which has two columns tstamp and c_1. Data type for c_1 is 'string', and I want to add a new column by extracting string between two characters in that field.

For example: original dataframe df

tstamp c_1
2022-06-15 10:00:00 xxx&cd7=H10S10P10&cd21=GA&cd3=6...
2022-06-15 10:10:01 xz&cd7=H11S11P11&cd21=CA&cd3=5...

We want to add a new column (same or another dataframe) called cd_7 and the value will be the string between 'cd7=' and '&cd21' like below:

tstamp c_1 cd_7
2022-06-15 10:00:00 xxx&cd7=H10S10P10&cd21=GA&cd3=6... H10S10P10
2022-06-15 10:10:01 xz&cd7=H11S11P11&cd21=CA&cd3=5... H11S11P11

How could I write it using Pyspark? Thanks!

CodePudding user response:

Use regex to extract everything between special characters = and &

df.withColumn('x', regexp_extract('c_1', '(?<=[\=]).*(?=[\&])',0)).show()

 ------------------- -------------------- --------- 
|             tstamp|                 c_1|        x|
 ------------------- -------------------- --------- 
|2022-06-15 10:00:00|xxx&cd7=H10S10P10...|H10S10P10|
|2022-06-15 10:10:01|xz&cd7=H11S11P11&...|H11S11P11|
 ------------------- -------------------- --------- 

CodePudding user response:

Used an alternative way to get the answer by converting this to a pandas dataframe and do data manipulation, but not ideal if data is large.

df['cd_7'] = df['c_1'].apply(lambda st: st[st.find("cd7=") 4:st.find("&cd21")])

  • Related