Turn pyspark databricks data frame with string in an array shape into standard columns-CodePudding

i've got a databricks dataframe with a load of questionnaire results in them, the questionnaires on the df vary in length and do not always the same questions.

How can I get the questions and answers out of the string field Responses so i have a 3 column list of "CustomerID, Questions, Answers"

So from this

CustomerID     Responses
1              [{"question1":"answer 1"},{"question 2":"answer2"}]
2              [{"question1":"answer 1a"},{"question 2":"answer2b"}]
3              [{"question1":"answer 1b"},{"question 3":"answer3"}]

to resolve to

CustomerID   Questions  Answers
1            question1  answer1
1            question2  answer2
2            question1  answer1a
2            question2  answer2b 
3            question1  answer1b 
3            question3  answer3

CodePudding user response：

from pyspark.sql.functions import explode

df2 = df.withColumn("firstExplode", explode(df.responses))
final_df = df2.select(df2.customerId, explode(df.firstExplode))

CodePudding user response：

Since your data is String-based, not JSON-based, you'd have to parse it first with a schema, then you will be able to explode it

from pyspark.sql import functions as F
from pyspark.sql import types as T

(df
    .withColumn('Responses', F.from_json('Responses', T.ArrayType(T.MapType(T.StringType(), T.StringType()))))
    .withColumn('Response', F.explode('Responses'))
    .withColumn('Question', F.map_keys('Response')[0])
    .withColumn('Answer', F.map_values('Response')[0])
    .drop('Responses', 'Response')
    .show(10, False)
)

# Output
#  ---------- ---------- --------- 
# |CustomerID|Question  |Answer   |
#  ---------- ---------- --------- 
# |1         |question1 |answer 1 |
# |1         |question 2|answer2  |
# |2         |question1 |answer 1a|
# |2         |question 2|answer2b |
# |3         |question1 |answer 1b|
# |3         |question 3|answer3  |
#  ---------- ---------- ---------