Home > Net >  Pivot fixed amount of rows to fixed amount of columns in PySpark
Pivot fixed amount of rows to fixed amount of columns in PySpark

Time:09-23

Imagine I have the following table:

unique_id column_A column_B
123 12345 ABCDEFG
123 23456 BCDEFGH
123 34567 CDEFGHI
234 12345 ABCDEFG

The amount of rows per unique ID is maximum 3. The result I want to achieve is the following

unique_id column_A_1 column_A_2 column_A_3 column_B_1 column_B_2 column_B_3
123 12345 23456 34567 ABCDEFG BCDEFGH CDEFGHI
234 12345 ABCDEFG

CodePudding user response:

you can assign a row_number to each record and pivot that.

Here's an example of retaining 2 values per id using your input dataframe.

pivoted_sdf = data_sdf. \
    withColumn('rn', 
               func.row_number().over(wd.partitionBy('unique_id').orderBy(func.rand()))
               ). \
    filter(func.col('rn') <= 2). \
    groupBy('unique_id'). \
    pivot('rn', values=['1', '2']). \
    agg(func.first('col_a').alias('col_a'), 
        func.first('col_b').alias('col_b')
        )

#  --------- ------- ------- ------- ------- 
# |unique_id|1_col_a|1_col_b|2_col_a|2_col_b|
#  --------- ------- ------- ------- ------- 
# |      234|  12345|ABCDEFG|   null|   null|
# |      123|  34567|CDEFGHI|  23456|BCDEFGH|
#  --------- ------- ------- ------- ------- 

Notice the column names - spark added the row number as a prefix to the aggregation alias. You can rename the columns to make that as a suffix.

def renameTheCol(column):
    col_split = column.split('_')
    col_split_rearr = col_split[1:]   [col_split[0]]
    new_column = '_'.join(col_split_rearr)

    return new_column

pivoted_sdf. \
    select('unique_id', 
           *[func.col(k).alias(renameTheCol(k)) for k in pivoted_sdf.columns if k != 'unique_id']
           ). \
    show()

#  --------- ------- ------- ------- ------- 
# |unique_id|col_a_1|col_b_1|col_a_2|col_b_2|
#  --------- ------- ------- ------- ------- 
# |      234|  12345|ABCDEFG|   null|   null|
# |      123|  23456|BCDEFGH|  34567|CDEFGHI|
#  --------- ------- ------- ------- ------- 
  • Related