Home > Software engineering >  input to function explode should be array or map type, not struct
input to function explode should be array or map type, not struct

Time:01-01

I have the following data.

data = [
    [
        "2022-12-12",
        ["IND", "u1", [["auction_1", [[1,20], [2,12]]], ["auction_2", [[1,5], [2,7]]]]],
    ],
    [
        "2022-12-12",
        ["USA", "u2", [["auction_1", [[1,8], [2,12]]], ["auction_2", [[1,11], [2,4]]]]],
    ],
]

I have the following schema

actionSchema = T.StructType([
    T.StructField("amountSpent", T.LongType()),
    T.StructField("timeSpent", T.LongType())
])

actionsSchema = T.StructType(
    [T.StructField("action1", actionSchema), T.StructField("action2", actionSchema)]
)

userSchema = T.ArrayType(
    T.StructType(
        [
            T.StructField("refferalId", T.StringType()),
            T.StructField("actions", actionsSchema),
        ]
    )
)

dataSchema = T.StructType(
    [
        T.StructField("country", T.StringType()),
        T.StructField("userId", T.StringType()),
        T.StructField("users", userSchema),
    ]
)

schema = T.StructType(
    [T.StructField("date", T.StringType()), T.StructField("data", dataSchema)]
)
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()

# it has the following schema
root
 |-- date: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- userId: string (nullable = true)
 |    |-- users: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- refferalId: string (nullable = true)
 |    |    |    |-- actions: struct (nullable = true)
 |    |    |    |    |-- action1: struct (nullable = true)
 |    |    |    |    |    |-- amountSpent: long (nullable = true)
 |    |    |    |    |    |-- timeSpent: long (nullable = true)
 |    |    |    |    |-- action2: struct (nullable = true)
 |    |    |    |    |    |-- amountSpent: long (nullable = true)
 |    |    |    |    |    |-- timeSpent: long (nullable = true)

I would like to have to data in the below format for further analysis.

date, country, userId, refferalId, action, amountSpent, timeSpent
2022-12-31, IND, 123, 123213,      action1, 5,          56
display(df.select(F.explode("data")))
# cannot resolve 'explode(data)' due to data type mismatch: input to function explode should be an array or map type

Any help would be really appreciated.

If we can not explode any StructType how can I achieve the above data format?

I went through these questions also, but didn’t get much help -> Error while exploding a struct column in Spark

CodePudding user response:

You must explode data.users :

df.select('date', 'data.country', 'data.userId', F.explode('data.users').alias('info'))

For the actions you need a query like below (after exploding data.users):

.select('date', 'country', 'userId', 'info.refferalId', F.explode('actions').alias('actionInfo'))

But because you defined actions as struct it can't be exploded. If you change its schema to list the code will works fine

CodePudding user response:

This is basically a task where you need to transform a bunch of your data to get it to the form you want. You'll need a combination of the pyspark.sql.functions to get where you want.

If we start from your df:

output = df.select("date", "data.country", "data.userId", explode(col("data.users")).alias("users")) \
           .select("date", "country", "userId", "users.*") \
           .withColumn("actions", explode(array(
               struct("actions.action1.*", lit("action1").alias("action")),
               struct("actions.action2.*", lit("action2").alias("action"))
               )
           )) \
           .select("date", "country", "userId", "refferalId", "actions.*")

output.printSchema()
root
 |-- date: string (nullable = true)
 |-- country: string (nullable = true)
 |-- userId: string (nullable = true)
 |-- refferalId: string (nullable = true)
 |-- amountSpent: long (nullable = true)
 |-- timeSpent: long (nullable = true)
 |-- action: string (nullable = false)

output.show()
 ---------- ------- ------ ---------- ----------- --------- ------- 
|      date|country|userId|refferalId|amountSpent|timeSpent| action|
 ---------- ------- ------ ---------- ----------- --------- ------- 
|2022-12-12|    IND|    u1| auction_1|          1|       20|action1|
|2022-12-12|    IND|    u1| auction_1|          2|       12|action2|
|2022-12-12|    IND|    u1| auction_2|          1|        5|action1|
|2022-12-12|    IND|    u1| auction_2|          2|        7|action2|
|2022-12-12|    USA|    u2| auction_1|          1|        8|action1|
|2022-12-12|    USA|    u2| auction_1|          2|       12|action2|
|2022-12-12|    USA|    u2| auction_2|          1|       11|action1|
|2022-12-12|    USA|    u2| auction_2|          2|        4|action2|
 ---------- ------- ------ ---------- ----------- --------- ------- 

The operations, transformation per transformation:

  • The first, select statement unwraps the data struct and explodes the data.users array
  • Second, select statement unwraps the users struct
  • Third, withColumn statement is a bit more complicated. At this point we have 2 structs (action1 and action2) that have the same schema. What we're doing here is:
    • adding a literal column, action with either value action1 or action2 to our actions columns
    • putting those 2 similar columns in an array using the array function
    • exploding that array
  • Fourth, select statement is to unwrap the actions struct that we created

Hope this helps!

CodePudding user response:

The problem is that you cannot explode structs. You can only explode arrays or maps. The first step you need to take is to explode data.users (and not just data). You can do it this way:

users = df\
    .withColumn("s", F.explode("data.users"))\
    .select("date", "data.country", "data.userId", "s.*")
users.show()
 ---------- ------- ------ ---------- ------------------ 
|      date|country|userId|refferalId|           actions|
 ---------- ------- ------ ---------- ------------------ 
|2022-12-12|    IND|    u1| auction_1|{{1, 20}, {2, 12}}|
|2022-12-12|    IND|    u1| auction_2|  {{1, 5}, {2, 7}}|
|2022-12-12|    USA|    u2| auction_1| {{1, 8}, {2, 12}}|
|2022-12-12|    USA|    u2| auction_2| {{1, 11}, {2, 4}}|
 ---------- ------- ------ ---------- ------------------ 

From there, you want to explode the actions but as before, you cannot explode structs. To overcome that, you can convert it into an array of structs.

users\
    .withColumn("actions", F.array(
        [ F.struct(
            F.lit(f"action{i}").alias("action"),
            F.col("actions")[f"action{i}"].alias("meta")
        ) for i in [1, 2] ]
    ))\
    .withColumn("action", F.explode("actions"))\
    .select("date", "country", "userId", "refferalId", "action.action", "action.meta.*")\
    .show()
 ---------- ------- ------ ---------- ------- ----------- --------- 
|      date|country|userId|refferalId| action|amountSpent|timeSpent|
 ---------- ------- ------ ---------- ------- ----------- --------- 
|2022-12-12|    IND|    u1| auction_1|action1|          1|       20|
|2022-12-12|    IND|    u1| auction_1|action2|          2|       12|
|2022-12-12|    IND|    u1| auction_2|action1|          1|        5|
|2022-12-12|    IND|    u1| auction_2|action2|          2|        7|
|2022-12-12|    USA|    u2| auction_1|action1|          1|        8|
|2022-12-12|    USA|    u2| auction_1|action2|          2|       12|
|2022-12-12|    USA|    u2| auction_2|action1|          1|       11|
|2022-12-12|    USA|    u2| auction_2|action2|          2|        4|
 ---------- ------- ------ ---------- ------- ----------- --------- 
  • Related