Using spark dataframe transpose-CodePudding

I am trying to figure out how to solve this use case using spark dataframe.

In the below google sheet, I have the source data where the questions from the survey answered by the people will be stored. Also the question columns will be more than 1000 columns approx, and is more dynamic and not fixed.

There is a metadata table, which explains about the question, its description and the choices it can contain.

Output table should be the like the one I had mentioned in the sheet. Any suggestions or ideas on how this can be achieved ?

https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0

CodePudding user response：

Let's assume your main table is called df:

 --------- ----------- ----------- ------ ------ ------ 
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
 --------- ----------- ----------- ------ ------ ------ 
|xyz      |xyz        |john       |1     |2     |1     |
|abc      |abc        |foo        |3     |1     |1     |
|def      |def        |bar        |2     |2     |2     |
 --------- ----------- ----------- ------ ------ ------

and the mapping table is called df2:

 ----------- ------------- ------------------- --------- ----------- 
|question_id|question_name|question_text      |choice_id|choice_desc|
 ----------- ------------- ------------------- --------- ----------- 
|Q1D102     |Gender       |What is your gender|1        |Male       |
|Q1D102     |Gender       |What is your gender|2        |Female     |
|Q1D102     |Gender       |What is your gender|3        |Diverse    |
|Q1D103     |Age          |What is your age   |1        |20 - 50    |
|Q1D103     |Age          |What is your age   |2        |50 >       |
|Q1D105     |work_status  |Do you work        |1        |Yes        |
|Q1D105     |work_status  |Do you work        |2        |No         |
 ----------- ------------- ------------------- --------- -----------

We can construct a dynamic unpivot expression as below:

val columns = df.columns.filter(c => c.startsWith("Q1D"))

val data = columns.map(c => s"'$c', $c").mkString(",")

val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"

With 3 questions, we get the following expression (Q1D102, Q1D103 and Q1D105): stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)

Finally, we use the constructed variable:

df = df
  .selectExpr("survey_id", "response_id", "person_name", finalExpr)
  .join(df2, Seq("question_id", "choice_id"), "left")

You get this result:

 ----------- --------- --------- ----------- ----------- ------------- ------------------- ----------- 
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text      |choice_desc|
 ----------- --------- --------- ----------- ----------- ------------- ------------------- ----------- 
|Q1D102     |1        |xyz      |xyz        |john       |Gender       |What is your gender|Male       |
|Q1D102     |2        |def      |def        |bar        |Gender       |What is your gender|Female     |
|Q1D102     |3        |abc      |abc        |foo        |Gender       |What is your gender|Diverse    |
|Q1D103     |1        |abc      |abc        |foo        |Age          |What is your age   |20 - 50    |
|Q1D103     |2        |xyz      |xyz        |john       |Age          |What is your age   |50 >       |
|Q1D103     |2        |def      |def        |bar        |Age          |What is your age   |50 >       |
|Q1D105     |1        |xyz      |xyz        |john       |work_status  |Do you work        |Yes        |
|Q1D105     |1        |abc      |abc        |foo        |work_status  |Do you work        |Yes        |
|Q1D105     |2        |def      |def        |bar        |work_status  |Do you work        |No         |
 ----------- --------- --------- ----------- ----------- ------------- ------------------- -----------

Which I think is what you need (just unordered), good luck!