Home > OS >  Pyspark question making count result into a dataframe
Pyspark question making count result into a dataframe

Time:11-08

I have a pyspark function that looks like this. \

spark.sql("select count(*) from student_table where student_id is NULL") \
spark.sql("select count(*) from student_table where student_scores is NULL") \
spark.sql("select count(*) from student_table where student_health is NULL")

I get a result that looks like \

 ------- |count(1)|\n------- |    0| ------- \n|count(1)|\n------- |    100| ------- \n|count(1)|\n------- |  24145|

What I want to do is to make the result into a dataframe for each column by using pandas or pyspark function. The result should have each null value result for each column.
For example,

enter image description here

Thanks in advance if someone can help me out.

CodePudding user response:

You could use union between the 3 queries but actually you can get all null counts for each column using one query:

spark.sql("""
    SELECT  SUM(INT(student_id IS NULL))     AS student_id_nb_null,
            SUM(INT(student_scores IS NULL)) AS student_scores_nb_null,
            SUM(INT(student_health IS NULL)) AS student_health_nb_null,
    FROM    student_table 
""").show()

# ------------------ ---------------------- ---------------------- 
#|student_id_nb_null|student_scores_nb_null|student_health_nb_null|
# ------------------ ---------------------- ---------------------- 
#|                 0|                   100|                 24145|
# ------------------ ---------------------- ---------------------- 

Or by using DataFrame API with:

import pyspark.sql.functions as F    

df.agg(
    F.sum(F.col("student_id").isNull().cast("int")).alias("student_id_nb_null"),
    F.sum(F.col("student_scores").isNull().cast("int")).alias("student_scores_nb_null"),
    F.sum(F.col("student_health").isNull().cast("int")).alias("student_health_nb_null")
)

CodePudding user response:

Use union all and add all your queries in one spark.sql.

Example:

spark.sql("""select "student_id" `column_name`,count(*) `null_result` from tmp where student_id is null \
union all \
select "student_scores" `column_name`,count(*) `null_result` from tmp where student_scores is null \
union all \
select "student_health" `column_name`,count(*) `null_result` from tmp where student_health is null""").\
show()
  • Related