I am looking for an efficient way to explode the rows in the pyspark dataframe df_input
into columns. I dont understand that format '@{name...}'
and don't know where to start in order to decode it. Thanks for help!
df_input = sqlContext.createDataFrame(
[
(1, '@{name= Hans; age= 45}'),
(2, '@{name= Jeff; age= 15}'),
(3, '@{name= Elona; age= 23}')
],
('id', 'firstCol')
)
expected result:
--- ----- ---
| id| name|age|
--- ----- ---
| 1| Hans| 45|
| 2| Jeff| 15|
| 3|Elona| 23|
--- ----- ---
CodePudding user response:
from pyspark.sql.functions import regexp_extract
df_input.select(
df_input.id, #id
regexp_extract( #use regex
df_input.firstCol, #on firstCol
'\s(.*);', #find a space character then capture a (group of text) until you find a ';'
1 # use capture group 1 as text
).alias("name"),
regexp_extract(
df_input.firstCol,
'\s.*\s(.*)}', #find the second space then capture a (group of text) until you find a '}'
1 # use capture group 1 as text
).alias("age")
).show()
--- ----- ---
| id| name|age|
--- ----- ---
| 1| Hans| 45|
| 2| Jeff| 15|
| 3|Elona| 23|
--- ----- ---
CodePudding user response:
Convert the string into map type using str_to_map
function, explode it then pivot the keys:
from pyspark.sql import functions as F
df = df_input.selectExpr(
"id",
"explode(str_to_map(regexp_replace(firstCol, '[@{}]', ''), ';', '='))"
).groupby("id").pivot("key").agg(F.first("value"))
df.show()
# --- ---- ------
#|id | age|name |
# --- ---- ------
#|1 | 45 | Hans |
#|2 | 15 | Jeff |
#|3 | 23 | Elona|
# --- ---- ------