There will be one or more pairs of values in the column1 attribute, separated by semi-colons. For each pair of values, take the one to the right of the hyphen and generate a record for that CompanyCode. For example, if column1 = GT01-5636;GT01-7212, records would be generated for CompanyCode 5636 and 7212. All attributes other than CompanyCode are duplicated.
--------------------
| unparsed_data|
--------------------
|GT01-5636;GT02-7212
|Gx01-5626;GY01-1112;GL01-4336;GT09-0012
I want my data in this form:
--------------------
|CompanyCode
--------------------
|5636 |
|7212 |
|5626 |
|1112 |
|4336 |
|0012 |
CodePudding user response:
You can use the regexp_extract_all
function to extract all eligible strings, and then use the explode
function to expand them.
val df1 = df.selectExpr("explode(regexp_extract_all(unparsed_data, '(.? )-(\\\\d )', 2)) as CompanyCode")
df1.show()
CodePudding user response:
A non-regex way using split
and explode
. I'm writing it in pyspark, but it'd be fairly similar in scala.
data_sdf. \
withColumn('data_split', func.split('unparsed_data', ';')). \
selectExpr('explode(data_split) as split_data'). \
withColumn('company_code', func.split('split_data', '-')[1]). \
show()
# ---------- ------------
# |split_data|company_code|
# ---------- ------------
# | GT01-5636| 5636|
# | GT02-7212| 7212|
# | Gx01-5626| 5626|
# | GY01-1112| 1112|
# | GL01-4336| 4336|
# | GT09-0012| 0012|
# ---------- ------------