If I have below table
|a | id | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1 | 2001 | 0 | 0 | 0 | ... | 0 |
|"hello"| 1 | 2015 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2002 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2015 | 0 | 0 | 0 | ... | 0 |
How to I change the dataframe so it checks the year column in each row and changes the above example m2001 and m2015 to 1 and as id is 1 in both, the new table will look like below
|a | id |m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1 | 0 | 1 | 0 | ... | 1 |
|"hello"| 2 | 0 | 0 | 1 | ... | 1 |
CodePudding user response:
new = df.select('a','id','year',*[when((size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))))==1,1).otherwise(0).alias(x) for x in df.columns if x not in ['a','id','year']])
new.groupBy('a','id').agg(*[max(x).alias(x) for x in new.columns if x not in ['a','id','year']] ).show()
How it works
Collapse the columns into rows and pair them with the year columns value
df.select('a','id','year',*[F.array(F.lit(col('year').astype('string')), lit(x[1:])).alias(x) for x in df.columns if x not in ['a','id','year']])
Find the distinct elements in the arrays in each column
df.select('a','id','year',*[F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:]))).alias(x) for x in df.columns if x not in ['a','id','year']])
Find the size of each array in the individual columns
df.select('a','id','year',*[size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))).alias(x) for x in df.columns if x not in ['a','id','year']])
Finally, where the size is not 1, means the column value and year didnt agree, so make it 0, else 1
Finally, groupby add find max value in each column