I am dealing with spark data frame df
which has two columns tstamp
and c_1
. Data type for c_1
is 'string', and I want to add a new column by extracting string between two characters in that field.
For example: original dataframe df
tstamp | c_1 |
---|---|
2022-06-15 10:00:00 | xxx&cd7=H10S10P10&cd21=GA&cd3=6... |
2022-06-15 10:10:01 | xz&cd7=H11S11P11&cd21=CA&cd3=5... |
We want to add a new column (same or another dataframe) called cd_7
and the value will be the string between 'cd7=' and '&cd21' like below:
tstamp | c_1 | cd_7 |
---|---|---|
2022-06-15 10:00:00 | xxx&cd7=H10S10P10&cd21=GA&cd3=6... | H10S10P10 |
2022-06-15 10:10:01 | xz&cd7=H11S11P11&cd21=CA&cd3=5... | H11S11P11 |
How could I write it using Pyspark? Thanks!
CodePudding user response:
Use regex to extract everything between special characters =
and &
df.withColumn('x', regexp_extract('c_1', '(?<=[\=]).*(?=[\&])',0)).show()
------------------- -------------------- ---------
| tstamp| c_1| x|
------------------- -------------------- ---------
|2022-06-15 10:00:00|xxx&cd7=H10S10P10...|H10S10P10|
|2022-06-15 10:10:01|xz&cd7=H11S11P11&...|H11S11P11|
------------------- -------------------- ---------
CodePudding user response:
Used an alternative way to get the answer by converting this to a pandas dataframe and do data manipulation, but not ideal if data is large.
df['cd_7'] = df['c_1'].apply(lambda st: st[st.find("cd7=") 4:st.find("&cd21")])