pyspark - assign non-null columns to new columns-CodePudding

I have a dataframe of the following scheme in pyspark:

  user_id  datadate       page_1.A   page_1.B  page_1.C  page_2.A  page_2.B  \
0      111  20220203         NaN       NaN      NaN      NaN       NaN   
1      222  20220203         5         5         5       5.0       5.0   
2      333  20220203         3         3         3       3.0       3.0   

     page_2.C  page_3.A  page_3.B  page_3.C  
0       NaN       1.0       1.0       2.0  
1       5.0       NaN       NaN       NaN  
2       4.0       NaN       NaN       NaN

So it contains columns like user_id, datadate, and few columns for each page (got 3 pages), which are the result of 2 joins. In this example, i have page_1, page_2, page_3, and each has 3 columns: A,B,C. Additionally, for each page columns, for each row, they will either be all null or all full, like in my example. I don't care about the values of each of the columns per page, I just want to get for each row, the [A,B,C] values that are not null.

example for a wanted result table:

  user_id  datadate  A  B  C
0      111  20220203  1  1  2
1      222  20220203  5  5  5
2      333  20220203  3  3  3

so the logic will be something like:

df[A] = page_1.A or page_2.A or page_3.A, whichever is not null
df[B] = page_1.B or page_2.B or page_3.B, whichever is not null
df[C] = page_1.C or page_2.C or page_3.C, whichever is not null

for all of the rows.. and of course, I would like to do it in an efficient way. Thanks a lot.

CodePudding user response：

You can use the sql functions greatest to extract the greatest values in a list of columns. You can find the documentation here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.greatest.html

from pyspark.sql import functions as F
(df.withColumn('A', F.greates(F.col('page_1.A'), F.col('page_2.A), F.col('page_3.A'))
   .withColumn('B', F.greates(F.col('page_1.B'), F.col('page_2.B), F.col('page_3.B'))
   .select('userid', 'datadate', 'A', 'B'))