I have the following data.
data = [
[
"2022-12-12",
["IND", "u1", [["auction_1", [[1,20], [2,12]]], ["auction_2", [[1,5], [2,7]]]]],
],
[
"2022-12-12",
["USA", "u2", [["auction_1", [[1,8], [2,12]]], ["auction_2", [[1,11], [2,4]]]]],
],
]
I have the following schema
actionSchema = T.StructType([
T.StructField("amountSpent", T.LongType()),
T.StructField("timeSpent", T.LongType())
])
actionsSchema = T.StructType(
[T.StructField("action1", actionSchema), T.StructField("action2", actionSchema)]
)
userSchema = T.ArrayType(
T.StructType(
[
T.StructField("refferalId", T.StringType()),
T.StructField("actions", actionsSchema),
]
)
)
dataSchema = T.StructType(
[
T.StructField("country", T.StringType()),
T.StructField("userId", T.StringType()),
T.StructField("users", userSchema),
]
)
schema = T.StructType(
[T.StructField("date", T.StringType()), T.StructField("data", dataSchema)]
)
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
# it has the following schema
root
|-- date: string (nullable = true)
|-- data: struct (nullable = true)
| |-- country: string (nullable = true)
| |-- userId: string (nullable = true)
| |-- users: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- refferalId: string (nullable = true)
| | | |-- actions: struct (nullable = true)
| | | | |-- action1: struct (nullable = true)
| | | | | |-- amountSpent: long (nullable = true)
| | | | | |-- timeSpent: long (nullable = true)
| | | | |-- action2: struct (nullable = true)
| | | | | |-- amountSpent: long (nullable = true)
| | | | | |-- timeSpent: long (nullable = true)
I would like to have to data in the below format for further analysis.
date, country, userId, refferalId, action, amountSpent, timeSpent
2022-12-31, IND, 123, 123213, action1, 5, 56
display(df.select(F.explode("data")))
# cannot resolve 'explode(data)' due to data type mismatch: input to function explode should be an array or map type
Any help would be really appreciated.
If we can not explode any StructType
how can I achieve the above data format?
I went through these questions also, but didn’t get much help -> Error while exploding a struct column in Spark
CodePudding user response:
You must explode data.users :
df.select('date', 'data.country', 'data.userId', F.explode('data.users').alias('info'))
For the actions you need a query like below (after exploding data.users):
.select('date', 'country', 'userId', 'info.refferalId', F.explode('actions').alias('actionInfo'))
But because you defined actions as struct it can't be exploded. If you change its schema to list the code will works fine
CodePudding user response:
This is basically a task where you need to transform a bunch of your data to get it to the form you want. You'll need a combination of the pyspark.sql.functions
to get where you want.
If we start from your df
:
output = df.select("date", "data.country", "data.userId", explode(col("data.users")).alias("users")) \
.select("date", "country", "userId", "users.*") \
.withColumn("actions", explode(array(
struct("actions.action1.*", lit("action1").alias("action")),
struct("actions.action2.*", lit("action2").alias("action"))
)
)) \
.select("date", "country", "userId", "refferalId", "actions.*")
output.printSchema()
root
|-- date: string (nullable = true)
|-- country: string (nullable = true)
|-- userId: string (nullable = true)
|-- refferalId: string (nullable = true)
|-- amountSpent: long (nullable = true)
|-- timeSpent: long (nullable = true)
|-- action: string (nullable = false)
output.show()
---------- ------- ------ ---------- ----------- --------- -------
| date|country|userId|refferalId|amountSpent|timeSpent| action|
---------- ------- ------ ---------- ----------- --------- -------
|2022-12-12| IND| u1| auction_1| 1| 20|action1|
|2022-12-12| IND| u1| auction_1| 2| 12|action2|
|2022-12-12| IND| u1| auction_2| 1| 5|action1|
|2022-12-12| IND| u1| auction_2| 2| 7|action2|
|2022-12-12| USA| u2| auction_1| 1| 8|action1|
|2022-12-12| USA| u2| auction_1| 2| 12|action2|
|2022-12-12| USA| u2| auction_2| 1| 11|action1|
|2022-12-12| USA| u2| auction_2| 2| 4|action2|
---------- ------- ------ ---------- ----------- --------- -------
The operations, transformation per transformation:
- The first,
select
statement unwraps thedata
struct and explodes thedata.users
array - Second,
select
statement unwraps theusers
struct - Third,
withColumn
statement is a bit more complicated. At this point we have 2 structs (action1
andaction2
) that have the same schema. What we're doing here is:- adding a literal column,
action
with either valueaction1
oraction2
to ouractions
columns - putting those 2 similar columns in an array using the
array
function - exploding that array
- adding a literal column,
- Fourth, select statement is to unwrap the
actions
struct that we created
Hope this helps!
CodePudding user response:
The problem is that you cannot explode structs. You can only explode arrays or maps. The first step you need to take is to explode data.users
(and not just data). You can do it this way:
users = df\
.withColumn("s", F.explode("data.users"))\
.select("date", "data.country", "data.userId", "s.*")
users.show()
---------- ------- ------ ---------- ------------------
| date|country|userId|refferalId| actions|
---------- ------- ------ ---------- ------------------
|2022-12-12| IND| u1| auction_1|{{1, 20}, {2, 12}}|
|2022-12-12| IND| u1| auction_2| {{1, 5}, {2, 7}}|
|2022-12-12| USA| u2| auction_1| {{1, 8}, {2, 12}}|
|2022-12-12| USA| u2| auction_2| {{1, 11}, {2, 4}}|
---------- ------- ------ ---------- ------------------
From there, you want to explode the actions but as before, you cannot explode structs. To overcome that, you can convert it into an array of structs.
users\
.withColumn("actions", F.array(
[ F.struct(
F.lit(f"action{i}").alias("action"),
F.col("actions")[f"action{i}"].alias("meta")
) for i in [1, 2] ]
))\
.withColumn("action", F.explode("actions"))\
.select("date", "country", "userId", "refferalId", "action.action", "action.meta.*")\
.show()
---------- ------- ------ ---------- ------- ----------- ---------
| date|country|userId|refferalId| action|amountSpent|timeSpent|
---------- ------- ------ ---------- ------- ----------- ---------
|2022-12-12| IND| u1| auction_1|action1| 1| 20|
|2022-12-12| IND| u1| auction_1|action2| 2| 12|
|2022-12-12| IND| u1| auction_2|action1| 1| 5|
|2022-12-12| IND| u1| auction_2|action2| 2| 7|
|2022-12-12| USA| u2| auction_1|action1| 1| 8|
|2022-12-12| USA| u2| auction_1|action2| 2| 12|
|2022-12-12| USA| u2| auction_2|action1| 1| 11|
|2022-12-12| USA| u2| auction_2|action2| 2| 4|
---------- ------- ------ ---------- ------- ----------- ---------