How to extractstring between two characters in a column of spark dataframe-CodePudding

I am dealing with spark data frame df which has two columns tstamp and c_1. Data type for c_1 is 'string', and I want to add a new column by extracting string between two characters in that field.

For example: original dataframe df

tstamp	c_1
2022-06-15 10:00:00	xxx&cd7=H10S10P10&cd21=GA&cd3=6...
2022-06-15 10:10:01	xz&cd7=H11S11P11&cd21=CA&cd3=5...

We want to add a new column (same or another dataframe) called cd_7 and the value will be the string between 'cd7=' and '&cd21' like below:

tstamp	c_1	cd_7
2022-06-15 10:00:00	xxx&cd7=H10S10P10&cd21=GA&cd3=6...	H10S10P10
2022-06-15 10:10:01	xz&cd7=H11S11P11&cd21=CA&cd3=5...	H11S11P11

How could I write it using Pyspark? Thanks!

CodePudding user response：

Use regex to extract everything between special characters = and &

df.withColumn('x', regexp_extract('c_1', '(?<=[\=]).*(?=[\&])',0)).show()

 ------------------- -------------------- --------- 
|             tstamp|                 c_1|        x|
 ------------------- -------------------- --------- 
|2022-06-15 10:00:00|xxx&cd7=H10S10P10...|H10S10P10|
|2022-06-15 10:10:01|xz&cd7=H11S11P11&...|H11S11P11|
 ------------------- -------------------- ---------

CodePudding user response：

Used an alternative way to get the answer by converting this to a pandas dataframe and do data manipulation, but not ideal if data is large.

df['cd_7'] = df['c_1'].apply(lambda st: st[st.find("cd7=") 4:st.find("&cd21")])