I have the below spark dataframe.
Column_1
Physics=99;name=Xxxx;age=15
Physics=97;chemistry=85;name=yyyy;age=14
Physics=97;chemistry=85;maths=65;name=zzzz;age=14
I have to split the above dataframe column into multiple columns like below.
column_1 name age
Physics=99 Xxxx 15
Physics=97;chemistry=85 yyyy 14
Physics=97;chemistry=85;maths=65 zzzz 14
I tried split using delimiter ; and limit. But it splits the subjects too into different columns. Name and age is clubbed together into single column. I require all subjects in one column and only name and age in separate columns.
Is it possible to achieve this in Pyspark.
CodePudding user response:
You can use replace trick to split the columns.
df = spark.createDataFrame([('Physics=99;name=Xxxx;age=15'),('Physics=97;chemistry=85;name=yyyy;age=14'),('Physics=97;chemistry=85;maths=65;name=zzzz;age=14')], 'string').toDF('c1')
df.withColumn('c1', f.regexp_replace('c1', ';name', ',name')) \
.withColumn('c1', f.regexp_replace('c1', ';age', ',age')) \
.withColumn('c1', f.split('c1', ',')) \
.select(
f.col('c1')[0].alias('stat'),
f.col('c1')[1].alias('name'),
f.col('c1')[2].alias('age')) \
.show(truncate=False)
-------------------------------- --------- ------
|stat |name |age |
-------------------------------- --------- ------
|Physics=99 |name=Xxxx|age=15|
|Physics=97;chemistry=85 |name=yyyy|age=14|
|Physics=97;chemistry=85;maths=65|name=zzzz|age=14|
-------------------------------- --------- ------
CodePudding user response:
You can do like this to extract the names with a regex:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Physics=99;name=Xxxx;age=15",), ("Physics=97;chemistry=85;name=yyyy;age=14",),("Physics=97;chemistry=85;maths=65;name=zzzz;age=14",)], ["Column1"])
new_df = df.withColumn("name", F.regexp_extract('Column1', r'name=(\w )', 1).alias('name'))
new_df.show()
Output:
-------------------- ----
| Column1|name|
-------------------- ----
|Physics=99;name=X...|Xxxx|
|Physics=97;chemis...|yyyy|
|Physics=97;chemis...|zzzz|
-------------------- ----