use a specfic coumn values as a checker to change other column values in pyspark/pandas-CodePudding

If I have below table

|a      | id    | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1    | 2001  | 0    | 0     | 0   | ... |   0  |
|"hello"| 1   | 2015  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2002  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2015  | 0    | 0     | 0   | ... |   0  |

How to I change the dataframe so it checks the year column in each row and changes the above example m2001 and m2015 to 1 and as id is 1 in both, the new table will look like below

|a      | id     |m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1    |  0   | 1     | 0     | ...  |   1  |
|"hello"| 2    |  0   | 0     | 1     | ...  |   1  |

CodePudding user response：

new = df.select('a','id','year',*[when((size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))))==1,1).otherwise(0).alias(x) for x in df.columns if x not in ['a','id','year']])

new.groupBy('a','id').agg(*[max(x).alias(x) for x in new.columns if x not in ['a','id','year']] ).show()

How it works

Collapse the columns into rows and pair them with the year columns value

 df.select('a','id','year',*[F.array(F.lit(col('year').astype('string')), lit(x[1:])).alias(x) for x in df.columns if x not in ['a','id','year']])

Find the distinct elements in the arrays in each column

df.select('a','id','year',*[F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:]))).alias(x) for x in df.columns if x not in ['a','id','year']])

Find the size of each array in the individual columns

df.select('a','id','year',*[size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))).alias(x) for x in df.columns if x not in ['a','id','year']])

Finally, where the size is not 1, means the column value and year didnt agree, so make it 0, else 1

Finally, groupby add find max value in each column