i've got a databricks dataframe with a load of questionnaire results in them, the questionnaires on the df vary in length and do not always the same questions.
How can I get the questions and answers out of the string field Responses so i have a 3 column list of
"CustomerID, Questions, Answers"
So from this
CustomerID Responses
1 [{"question1":"answer 1"},{"question 2":"answer2"}]
2 [{"question1":"answer 1a"},{"question 2":"answer2b"}]
3 [{"question1":"answer 1b"},{"question 3":"answer3"}]
to resolve to
CustomerID Questions Answers
1 question1 answer1
1 question2 answer2
2 question1 answer1a
2 question2 answer2b
3 question1 answer1b
3 question3 answer3
CodePudding user response:
from pyspark.sql.functions import explode
df2 = df.withColumn("firstExplode", explode(df.responses))
final_df = df2.select(df2.customerId, explode(df.firstExplode))
CodePudding user response:
Since your data is String-based, not JSON-based, you'd have to parse it first with a schema, then you will be able to explode it
from pyspark.sql import functions as F
from pyspark.sql import types as T
(df
.withColumn('Responses', F.from_json('Responses', T.ArrayType(T.MapType(T.StringType(), T.StringType()))))
.withColumn('Response', F.explode('Responses'))
.withColumn('Question', F.map_keys('Response')[0])
.withColumn('Answer', F.map_values('Response')[0])
.drop('Responses', 'Response')
.show(10, False)
)
# Output
# ---------- ---------- ---------
# |CustomerID|Question |Answer |
# ---------- ---------- ---------
# |1 |question1 |answer 1 |
# |1 |question 2|answer2 |
# |2 |question1 |answer 1a|
# |2 |question 2|answer2b |
# |3 |question1 |answer 1b|
# |3 |question 3|answer3 |
# ---------- ---------- ---------