Split pyspark dataframe column and limit the splits-CodePudding

I have the below spark dataframe.

Column_1
Physics=99;name=Xxxx;age=15
Physics=97;chemistry=85;name=yyyy;age=14
Physics=97;chemistry=85;maths=65;name=zzzz;age=14

I have to split the above dataframe column into multiple columns like below.

column_1                         name   age
Physics=99                       Xxxx         15
Physics=97;chemistry=85          yyyy           14
Physics=97;chemistry=85;maths=65 zzzz      14

I tried split using delimiter ; and limit. But it splits the subjects too into different columns. Name and age is clubbed together into single column. I require all subjects in one column and only name and age in separate columns.

Is it possible to achieve this in Pyspark.

CodePudding user response：

You can use replace trick to split the columns.

df = spark.createDataFrame([('Physics=99;name=Xxxx;age=15'),('Physics=97;chemistry=85;name=yyyy;age=14'),('Physics=97;chemistry=85;maths=65;name=zzzz;age=14')], 'string').toDF('c1')

df.withColumn('c1', f.regexp_replace('c1', ';name', ',name')) \
  .withColumn('c1', f.regexp_replace('c1', ';age', ',age')) \
  .withColumn('c1', f.split('c1', ',')) \
  .select(
    f.col('c1')[0].alias('stat'),
    f.col('c1')[1].alias('name'),
    f.col('c1')[2].alias('age')) \
  .show(truncate=False)

 -------------------------------- --------- ------ 
|stat                            |name     |age   |
 -------------------------------- --------- ------ 
|Physics=99                      |name=Xxxx|age=15|
|Physics=97;chemistry=85         |name=yyyy|age=14|
|Physics=97;chemistry=85;maths=65|name=zzzz|age=14|
 -------------------------------- --------- ------

CodePudding user response：

You can do like this to extract the names with a regex:

import pyspark.sql.functions as F
df = spark.createDataFrame([("Physics=99;name=Xxxx;age=15",), ("Physics=97;chemistry=85;name=yyyy;age=14",),("Physics=97;chemistry=85;maths=65;name=zzzz;age=14",)], ["Column1"])
new_df = df.withColumn("name", F.regexp_extract('Column1', r'name=(\w )', 1).alias('name'))
new_df.show()

Output:

 -------------------- ---- 
|             Column1|name|
 -------------------- ---- 
|Physics=99;name=X...|Xxxx|
|Physics=97;chemis...|yyyy|
|Physics=97;chemis...|zzzz|
 -------------------- ----