Home > Blockchain >  Counting total rows, rows with null value, rows with zero values, and their ratios on PySpark
Counting total rows, rows with null value, rows with zero values, and their ratios on PySpark

Time:07-21

I have a table structure like this:

unique_id | group | value_1 | value_2 | value_3
abc_xxx     1        200        null     100
def_xxx     1        0           3       40
ghi_xxx     2        300         1        2

that I need to extract the following information from:

  • Total number of rows per group
  • Count number of rows per group that contains null values.
  • Count number of rows per group with zero values.

I can do the first one with a simple groupBy and count

df.select().groupBy(group).count()

I'm not so sure how to approach the next two which is needed for me to compute the null and zero rate from the total rows per group.

CodePudding user response:

data= [
    ('abc_xxx', 1, 200, None, 100),
    ('def_xxx', 1, 0,    3,   40 ),
    ('ghi_xxx', 2, 300,  1,    2 ),
]

df = spark.createDataFrame(data, ['unique_id','group','value_1','value_2','value_3'])
# new edit
df = df\
    .withColumn('contains_null', when(isnull(col('value_1')) | isnull(col('value_2')) | isnull(col('value_3')), lit(1)).otherwise(lit(0)))\
    .withColumn('contains_zero', when((col('value_1')==0) | (col('value_2')==0) | (col('value_3')==0), lit(1)).otherwise(lit(0)))

df.groupBy('group')\
    .agg(count('unique_id').alias('total_rows'), sum('contains_null').alias('null_value_rows'), sum('contains_zero').alias('zero_value_rows')).show()

 ----- ---------- --------------- --------------- 
|group|total_rows|null_value_rows|zero_value_rows|
 ----- ---------- --------------- --------------- 
|    1|         2|              1|              1|
|    2|         1|              0|              0|
 ----- ---------- --------------- --------------- 

# total_count = (count('value_1')   count('value_2')   count('value_3'))
# null_count = (sum(when(isnull(col('value_1')), lit(1)).otherwise(lit(0))   when(isnull(col('value_2')), lit(1)).otherwise(lit(0))   when(isnull(col('value_3')), lit(1)).otherwise(lit(0))))
# zero_count = (sum(when(col('value_1')==0, lit(1)).otherwise(lit(0))   when(col('value_2')==0, lit(1)).otherwise(lit(0))   when(col('value_3')==0, lit(1)).otherwise(lit(0))))

# df.groupBy('group')\
    # .agg(total_count.alias('total_numbers'), null_count.alias('null_values'), zero_count.alias('zero_values')).show()

# ----- ------------- ----------- ----------- 
#|group|total_numbers|null_values|zero_values|
# ----- ------------- ----------- ----------- 
#|    1|            5|          1|          1|
#|    2|            3|          0|          0|
# ----- ------------- ----------- ----------- 
  • Related