I am trying to explode a column in the format array<map<string,string>>. Using PySpark in DataBricks. Data looks like:
root
|-- zipcode: string (nullable = true)
|-- employment_status: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
sdf:pyspark.sql.dataframe.DataFrame
zipcode:string
employment_status:array
element:map
key:string
value:string
------- ---------------------------------------------------------------------------------------------------------------------------------------------
|zipcode|employment_status |
------- ---------------------------------------------------------------------------------------------------------------------------------------------
|95678 |[[key -> Data, values -> [{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]]]|
|95679 |[[key -> Data, values -> [{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}]]] |
|95680 |[[key -> Data, values -> [{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}]]] |
|95681 |[[key -> Data, values -> [{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}]]] |
|95682 |[[key -> Data, values -> [{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}]]] |
------- ---------------------------------------------------------------------------------------------------------------------------------------------
I have no trouble exploding this and make use of the values with
sdf2 = sdf.select(sdf.zipcode, explode(sdf.employment_status))
sdf3 = sdf2.select(sdf2.zipcode, explode(sdf2.col))
sdf4 = sdf3.filter(sdf3.value != "Data").select(sdf3.zipcode, sdf3.value)
which results in the following:
sdf4:pyspark.sql.dataframe.DataFrame
zipcode:string
value:string
------- ------------------------------------------------------------------------------------------------------------------
|zipcode|value |
------- ------------------------------------------------------------------------------------------------------------------
|95678 |[{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]|
|95679 |[{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}] |
|95680 |[{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}] |
|95681 |[{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}] |
|95682 |[{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}] |
------- ------------------------------------------------------------------------------------------------------------------
I have a solution with F.regexp_extract and F.collect_list, but it feels not quit right. The result should be the following
------- ------------- ------------- ------------
|zipcode|full_employed|part_employed|non_employed|
------- ------------- ------------- ------------
| 95678| 13348| 8918| 9972|
| 95679| 0| 29| 0|
| 95680| 43| 0| 71|
| 95681| 327| 265| 691|
| 95682| 8534| 6436| 8748|
------- ------------- ------------- ------------
Or "Full-time", "Part-time" and "No earnings" as col- names, guess that doesn't matter.
Any ideas are highly appreciated! Thanks!
CodePudding user response:
so something like this?
from pyspark.sql import functions as F
(sdf4
.withColumn('y1', F.regexp_extract('value', 'y=([^}] ).*y=([^}] ).*y=([^}] )', 1).cast('int'))
.withColumn('y2', F.regexp_extract('value', 'y=([^}] ).*y=([^}] ).*y=([^}] )', 2).cast('int'))
.withColumn('y3', F.regexp_extract('value', 'y=([^}] ).*y=([^}] ).*y=([^}] )', 3).cast('int'))
.select('zipcode', F
.expr('stack(1, y1, y2, y3)')
.alias('full_employed','part_employed','non_employed')
)
.show()
)
# Output
# ------- ------------- ------------- ------------
# |zipcode|full_employed|part_employed|non_employed|
# ------- ------------- ------------- ------------
# | 95678| 13348| 8918| 9972|
# | 95679| 0| 29| 0|
# | 95680| 43| 0| 71|
# | 95681| 327| 265| 278|
# | 95682| 8534| 6436| 8748|
# ------- ------------- ------------- ------------