I have a table(df) which has multiple columns: col1, col2,col3 and so on.
col1 | col2 | col3 | .... | coln |
---|---|---|---|---|
1 | abc | 1 | qwe | |
1 | xyz | |||
2 | 3 | |||
3 | abc | 6 | qwe |
I want my final table(df) to have following columns:
attribute_name: contains the name of columns from previous table
count: contains total count of the table
distinct_count: contains distinct count of each column from previous table
null_count: contains count of null values of each column from previous table
The final table should like like this:
attribute_name | count | distinct_count | null_count |
---|---|---|---|
col1 | 4 | 3 | 0 |
col2 | 4 | 2 | 1 |
col3 | 4 | 3 | 1 |
coln | 4 | 1 | 2 |
Could someone help me on how i can implement this in pyspark?
CodePudding user response:
I didn't test it or checked if it is correct, but something like this should work:
attr_df_list = []
for column_name in df.columns:
attr_df_list.append(
df.selectExpr(
f"{column_name} AS attribute_name",
"COUNT(*) AS count",
f"COUNT(DISTINCT {column_name}) AS distinct_count",
f"COUNT_IF({column_name} IS NULL) AS null_count"
)
)
result_df = reduce(lambda df1, df2: df1.union(df2), attr_df_list)
CodePudding user response:
Here's a solution:
df = spark.createDataFrame([("apple",1,1),("mango",2,2),("apple",None,3),("mango",None,4)], ["col1","col2","col3"])
df.show()
# Out:
# ---—— —--— —--—
# | col1|col2|col3|
# —---— —--— —--—
# |apple| 1| 1|
# |mango| 2| 2|
# |apple|null| 3|
# |mango|null| 4|
# —---— —--— —--—
from pyspark.sql.functions import col
data = [(c, \
df.filter(col(c).isNotNull()).count(), \
df[[c]].distinct().count(), \
df.filter(col(c).isNull()).count() \
) for c in df.columns]
cols=['attribute_name','count','distinct_count','null_count']
spark.createDataFrame(data, cols).show()
# Out:
# ——————-------— —---— ——————-------— ——-----———
# |attribute_name|count|distinct_count|null_count|
# ————-------——— —---— —————-------—— ———-----——
# | col1| 4| 2| 0|
# | col2| 2| 3| 2|
# | col3| 4| 4| 0|
# ————-------——— —---— ————-------——— ———-----——
The idea is to loop through the columns of the original dataframe and for each column create a new row with the aggregated data.