Home > Back-end >  PySpark: get all dataframe columns defined as values into another column
PySpark: get all dataframe columns defined as values into another column

Time:11-15

I'm newby with PySpark and don't know what's the problem with my code. I have 2 dataframes

df1= 
 --- -------------- 
| id|No_of_Question|
 --- -------------- 
|  1|            Q1|
|  2|            Q4|
|  3|           Q23|
|...|           ...|
 --- -------------- 

df2 = 
 -------------------- --- --- --- --- --- --- 
| Q1| Q2| Q3| Q4| Q5|  ...   |Q22|Q23|Q24|Q25|
 -------------------- --- --- --- --- --- --- 
|  1|  0|  1|  0|  0|  ...   |  1|  1|  1|  1|
 -------------------- --- --- --- --- --- --- 

I'd like to create a new dataframe with all columns from df2 defined into df1.No_of_Question.

Expected result

df2 = 
 ------------ 
| Q1| Q4| Q24|
 ------------ 
|  1|  0|   1|
 ------------ 

I've already tried

df2 = df2.select(*F.collect_list(df1.No_of_Question)) #Error: Column is not iterable

or

df2 = df2.select(F.collect_list(df1.No_of_Question)) #Error: Resolved attribute(s) No_of_Question#1791 missing from Q1, Q2...

or

df2 = df2.select(*df1.No_of_Question)

of

df2= df2.select([col for col in df2.columns if col in df1.No_of_Question])

But none of these solutions worked. Could you help me please?

CodePudding user response:

You can collect the values of No_of_Question into a python list then pass it to df2.select().

Try this:

questions = [
    F.col(r.No_of_Question).alias(r.No_of_Question) 
    for r in df1.select("No_of_Question").collect()
]

df2 = df2.select(*questions)
  • Related