I'm newby with PySpark and don't know what's the problem with my code. I have 2 dataframes
df1=
--- --------------
| id|No_of_Question|
--- --------------
| 1| Q1|
| 2| Q4|
| 3| Q23|
|...| ...|
--- --------------
df2 =
-------------------- --- --- --- --- --- ---
| Q1| Q2| Q3| Q4| Q5| ... |Q22|Q23|Q24|Q25|
-------------------- --- --- --- --- --- ---
| 1| 0| 1| 0| 0| ... | 1| 1| 1| 1|
-------------------- --- --- --- --- --- ---
I'd like to create a new dataframe with all columns from df2
defined into df1.No_of_Question
.
Expected result
df2 =
------------
| Q1| Q4| Q24|
------------
| 1| 0| 1|
------------
I've already tried
df2 = df2.select(*F.collect_list(df1.No_of_Question)) #Error: Column is not iterable
or
df2 = df2.select(F.collect_list(df1.No_of_Question)) #Error: Resolved attribute(s) No_of_Question#1791 missing from Q1, Q2...
or
df2 = df2.select(*df1.No_of_Question)
of
df2= df2.select([col for col in df2.columns if col in df1.No_of_Question])
But none of these solutions worked. Could you help me please?
CodePudding user response:
You can collect the values of No_of_Question
into a python list then pass it to df2.select()
.
Try this:
questions = [
F.col(r.No_of_Question).alias(r.No_of_Question)
for r in df1.select("No_of_Question").collect()
]
df2 = df2.select(*questions)