I am new to PySpark and I am trying to convert couple of columns with double datatype to binary and I want to count number of non-zero values in the binary number / get sum of binary number digits
My sample data looks as follows
bit_1 bit_2 bit_3 bit_4 bit_5 bit_6
0 2 8 0 0 0
11 0 16 64 0 0
10 0 0 0 256 144
12 15 15 0 0 0
20 0 17 0 0 0
250 12 0 0 0 0
300 72 84 64 0 0
320 100 120 140 220 240
so far I tried below
test_df = df.withColumn('bit_sum', sum(map(int,"{0:b}".format(F.col('bit_1')))))
above code throws me error
I even tried below
df_2 = (df
.withColumn('bit_1_bi', F.lpad(F.bin(F.col('bit_1')),12,'0'))
.withColumn('bit_2_bi', F.lpad(F.bin(F.col('bit_2')),12,'0'))
.withColumn('bit_3_bi', F.lpad(F.bin(F.col('bit_3')),12,'0'))
.withColumn('bit_4_bi', F.lpad(F.bin(F.col('bit_4')),12,'0'))
.withColumn('bit_5_bi', F.lpad(F.bin(F.col('bit_5')),12,'0'))
.withColumn('bit_6_bi', F.lpad(F.bin(F.col('bit_6')),12,'0'))
)
CodePudding user response:
Let us use bin
to convert the column values into binary string representation, then replace 0's
with empty string and count the length of resulting string to calculate number of 1's
df.select(*[F.length(F.regexp_replace(F.bin(c), '0', '')).alias(c) for c in df.columns])
----- ----- ----- ----- ----- -----
|bit_1|bit_2|bit_3|bit_4|bit_5|bit_6|
----- ----- ----- ----- ----- -----
| 0| 1| 1| 0| 0| 0|
| 3| 0| 1| 1| 0| 0|
| 2| 0| 0| 0| 1| 2|
| 2| 4| 4| 0| 0| 0|
| 2| 0| 2| 0| 0| 0|
| 6| 2| 0| 0| 0| 0|
| 4| 2| 3| 1| 0| 0|
| 2| 3| 4| 3| 5| 4|
----- ----- ----- ----- ----- -----