Creating rows of a table using the columns of another table in pyspark-CodePudding

I have a table(df) which has multiple columns: col1, col2,col3 and so on.

col1	col2	col3	coln
1	abc	1	qwe
1	xyz
2		3
3	abc	6	qwe

I want my final table(df) to have following columns:

attribute_name: contains the name of columns from previous table
count: contains total count of the table
distinct_count: contains distinct count of each column from previous table
null_count: contains count of null values of each column from previous table

The final table should like like this:

attribute_name	count	distinct_count	null_count
col1	4	3	0
col2	4	2	1
col3	4	3	1
coln	4	1	2

Could someone help me on how i can implement this in pyspark?

CodePudding user response：

I didn't test it or checked if it is correct, but something like this should work:

attr_df_list = []
for column_name in df.columns:
    attr_df_list.append(
      df.selectExpr(
          f"{column_name} AS attribute_name",
          "COUNT(*) AS count",
          f"COUNT(DISTINCT {column_name}) AS distinct_count",
          f"COUNT_IF({column_name} IS NULL) AS null_count"
      )
    )

result_df = reduce(lambda df1, df2: df1.union(df2), attr_df_list)

CodePudding user response：

Here's a solution:

df = spark.createDataFrame([("apple",1,1),("mango",2,2),("apple",None,3),("mango",None,4)], ["col1","col2","col3"])
df.show()
# Out:
#  ---—— —--— —--—                                                                
# | col1|col2|col3|
#  —---— —--— —--— 
# |apple|   1|   1|
# |mango|   2|   2|
# |apple|null|   3|
# |mango|null|   4|
#  —---— —--— —--— 

from pyspark.sql.functions import col
data = [(c, \
           df.filter(col(c).isNotNull()).count(), \
           df[[c]].distinct().count(), \
           df.filter(col(c).isNull()).count() \
          )  for c in df.columns]
cols=['attribute_name','count','distinct_count','null_count']
spark.createDataFrame(data, cols).show()
# Out:
#  ——————-------— —---— ——————-------— ——-----——— 
# |attribute_name|count|distinct_count|null_count|
#  ————-------——— —---— —————-------—— ———-----—— 
# |          col1|    4|             2|         0|
# |          col2|    2|             3|         2|
# |          col3|    4|             4|         0|
#  ————-------——— —---— ————-------——— ———-----——

The idea is to loop through the columns of the original dataframe and for each column create a new row with the aggregated data.