Home > Software engineering >  How to expand nested JSON into Spark dataframe on AWS glue
How to expand nested JSON into Spark dataframe on AWS glue

Time:11-26

Working with the following marketing JSON file

{
    "request_id": "xx",
    "timeseries_stats": [
        {
            "timeseries_stat": {
                "id": "xx",
                "timeseries": [
                    {
                        "start_time": "xx",
                        "end_time": "xx",
                        "stats": {
                            "impressions": xx,
                            "swipes": xx,
                            "view_completion": xx,
                            "spend": xx
                        }
                    },
                    {
                        "start_time": "xx",
                        "end_time": "xx",
                        "stats": {
                            "impressions": xx,
                            "swipes": xx,
                            "view_completion": xx,
                            "spend": xx
                        }
                    }

I can parse this using pandas very easily and obtain the desired dataframe in the format

start_time   end_time     impressions   swipes   view_completion    spend
    xx          xx             xx         xx            xx            xx
    xx          xx             xx         xx            xx            xx

but need to do it in spark on AWS Glue.

After creating an initial spark dataframe (df) using

rdd = sc.parallelize(JSON_resp['timeseries_stats'][0]['timeseries_stat']['timeseries'])
df = rdd.toDF()

I tried expanding the stats key as follows

df_expanded = df.select("start_time","end_time","stats.*")

Error:

AnalysisException: 'Can only star expand struct data types. 
Attribute: `ArrayBuffer(stats)`;'

&

from pyspark.sql.functions import explode
df_expanded = df.select("start_time","end_time").withColumn("stats", explode(df.stats))

Error:

AnalysisException: 'The number of aliases supplied in the AS clause does not match the 
number of columns output by the UDTF expected 2 aliases but got stats ;

Pretty new to spark, any help would be much appreciated for either of the 2 approaches!

It's a pretty similar problem as in:

parse array of dictionaries from JSON with Spark

except I need to flatten this additional stats key.

CodePudding user response:

When you explode a map column, it will give you two columns and so .withColumn is not working. Use explode with select statement.

df.select('start_time', 'end_time', f.explode('stats')) \
  .groupBy('start_time', 'end_time').pivot('key').agg(f.first('value')).show()

 ---------- -------- ----------- ----- ------ --------------- 
|start_time|end_time|impressions|spend|swipes|view_completion|
 ---------- -------- ----------- ----- ------ --------------- 
|        yy|      yy|         yy|   yy|    yy|             yy|
|        xx|      xx|         xx|   xx|    xx|             xx|
 ---------- -------- ----------- ----- ------ --------------- 
  • Related