Home > Software engineering >  PySpark Exploding array<map<string,string>>
PySpark Exploding array<map<string,string>>

Time:10-26

I am trying to explode a column in the format array<map<string,string>>. Using PySpark in DataBricks. Data looks like:

root
 |-- zipcode: string (nullable = true)
 |-- employment_status: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
sdf:pyspark.sql.dataframe.DataFrame
zipcode:string
employment_status:array
element:map
key:string
value:string


 ------- --------------------------------------------------------------------------------------------------------------------------------------------- 
|zipcode|employment_status                                                                                                                            |
 ------- --------------------------------------------------------------------------------------------------------------------------------------------- 
|95678  |[[key -> Data, values -> [{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]]]|
|95679  |[[key -> Data, values -> [{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}]]]         |
|95680  |[[key -> Data, values -> [{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}]]]        |
|95681  |[[key -> Data, values -> [{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}]]]    |
|95682  |[[key -> Data, values -> [{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}]]] |
 ------- --------------------------------------------------------------------------------------------------------------------------------------------- 

I have no trouble exploding this and make use of the values with

sdf2 = sdf.select(sdf.zipcode, explode(sdf.employment_status))
sdf3 = sdf2.select(sdf2.zipcode, explode(sdf2.col))
sdf4 = sdf3.filter(sdf3.value != "Data").select(sdf3.zipcode, sdf3.value)

which results in the following:

sdf4:pyspark.sql.dataframe.DataFrame
zipcode:string
value:string

 ------- ------------------------------------------------------------------------------------------------------------------ 
|zipcode|value                                                                                                             |
 ------- ------------------------------------------------------------------------------------------------------------------ 
|95678  |[{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]|
|95679  |[{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}]         |
|95680  |[{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}]        |
|95681  |[{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}]    |
|95682  |[{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}] |
 ------- ------------------------------------------------------------------------------------------------------------------ 

I have a solution with F.regexp_extract and F.collect_list, but it feels not quit right. The result should be the following

 ------- ------------- ------------- ------------ 
|zipcode|full_employed|part_employed|non_employed|
 ------- ------------- ------------- ------------ 
|  95678|        13348|         8918|        9972|
|  95679|            0|           29|           0|
|  95680|           43|            0|          71|
|  95681|          327|          265|         691|
|  95682|         8534|         6436|        8748|
 ------- ------------- ------------- ------------ 

Or "Full-time", "Part-time" and "No earnings" as col- names, guess that doesn't matter.

Any ideas are highly appreciated! Thanks!

CodePudding user response:

so something like this?

from pyspark.sql import functions as F

(sdf4
    .withColumn('y1', F.regexp_extract('value', 'y=([^}] ).*y=([^}] ).*y=([^}] )', 1).cast('int'))
    .withColumn('y2', F.regexp_extract('value', 'y=([^}] ).*y=([^}] ).*y=([^}] )', 2).cast('int'))
    .withColumn('y3', F.regexp_extract('value', 'y=([^}] ).*y=([^}] ).*y=([^}] )', 3).cast('int'))

    .select('zipcode', F
        .expr('stack(1, y1, y2, y3)')
        .alias('full_employed','part_employed','non_employed')
    )
    .show()
)

# Output
#  ------- ------------- ------------- ------------ 
# |zipcode|full_employed|part_employed|non_employed|
#  ------- ------------- ------------- ------------ 
# |  95678|        13348|         8918|        9972|
# |  95679|            0|           29|           0|
# |  95680|           43|            0|          71|
# |  95681|          327|          265|         278|
# |  95682|         8534|         6436|        8748|
#  ------- ------------- ------------- ------------ 
  • Related