I have a pyspark function that looks like this. \
spark.sql("select count(*) from student_table where student_id is NULL") \
spark.sql("select count(*) from student_table where student_scores is NULL") \
spark.sql("select count(*) from student_table where student_health is NULL")
I get a result that looks like \
------- |count(1)|\n------- | 0| ------- \n|count(1)|\n------- | 100| ------- \n|count(1)|\n------- | 24145|
What I want to do is to make the result into a dataframe for each column by using pandas or pyspark function.
The result should have each null value result for each column.
For example,
Thanks in advance if someone can help me out.
CodePudding user response:
You could use union between the 3 queries but actually you can get all null counts for each column using one query:
spark.sql("""
SELECT SUM(INT(student_id IS NULL)) AS student_id_nb_null,
SUM(INT(student_scores IS NULL)) AS student_scores_nb_null,
SUM(INT(student_health IS NULL)) AS student_health_nb_null,
FROM student_table
""").show()
# ------------------ ---------------------- ----------------------
#|student_id_nb_null|student_scores_nb_null|student_health_nb_null|
# ------------------ ---------------------- ----------------------
#| 0| 100| 24145|
# ------------------ ---------------------- ----------------------
Or by using DataFrame API with:
import pyspark.sql.functions as F
df.agg(
F.sum(F.col("student_id").isNull().cast("int")).alias("student_id_nb_null"),
F.sum(F.col("student_scores").isNull().cast("int")).alias("student_scores_nb_null"),
F.sum(F.col("student_health").isNull().cast("int")).alias("student_health_nb_null")
)
CodePudding user response:
Use union all
and add all your queries in one spark.sql
.
Example:
spark.sql("""select "student_id" `column_name`,count(*) `null_result` from tmp where student_id is null \
union all \
select "student_scores" `column_name`,count(*) `null_result` from tmp where student_scores is null \
union all \
select "student_health" `column_name`,count(*) `null_result` from tmp where student_health is null""").\
show()