How to use split function in scala with an iteration on a column?-CodePudding

There will be one or more pairs of values in the column1 attribute, separated by semi-colons. For each pair of values, take the one to the right of the hyphen and generate a record for that CompanyCode. For example, if column1 = GT01-5636;GT01-7212, records would be generated for CompanyCode 5636 and 7212. All attributes other than CompanyCode are duplicated.

 --------------------
|       unparsed_data|
 --------------------
|GT01-5636;GT02-7212
|Gx01-5626;GY01-1112;GL01-4336;GT09-0012

I want my data in this form:

 --------------------
|CompanyCode
 --------------------
|5636                |
|7212                |
|5626                |
|1112                |
|4336                |
|0012                |

CodePudding user response：

You can use the regexp_extract_all function to extract all eligible strings, and then use the explode function to expand them.

val df1 = df.selectExpr("explode(regexp_extract_all(unparsed_data, '(.? )-(\\\\d )', 2)) as CompanyCode")
df1.show()

CodePudding user response：

A non-regex way using split and explode. I'm writing it in pyspark, but it'd be fairly similar in scala.

data_sdf. \
    withColumn('data_split', func.split('unparsed_data', ';')). \
    selectExpr('explode(data_split) as split_data'). \
    withColumn('company_code', func.split('split_data', '-')[1]). \
    show()

#  ---------- ------------ 
# |split_data|company_code|
#  ---------- ------------ 
# | GT01-5636|        5636|
# | GT02-7212|        7212|
# | Gx01-5626|        5626|
# | GY01-1112|        1112|
# | GL01-4336|        4336|
# | GT09-0012|        0012|
#  ---------- ------------