Home > Net >  Choose the column having more data
Choose the column having more data

Time:10-03

I have to select a column out of two columns which has more data or values in it using PySpark and keep it in my DataFrame.

For example, we have two columns A and B:

enter image description here

In example, the column B has more values so I will keep it in my DF for transformations. Similarly, I would take A, if A had more values. I think we can do it using if else conditions, but I'm not able to get the correct logic.

CodePudding user response:

You could first aggregate the columns (count the values in each). This way you will get just 1 row which you could extract as dictionary using .head().asDict(). Then use Python's max(your_dict, key=your_dict.get) to get the dict's key having the max value (i.e. the name of the column which has maximum number of values). Then just select this column.

Example input:

from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 7), (2, 4), (3, 7), (None, 8), (None, 4)], ['A', 'B'])
df.show()
#  ---- --- 
# |   A|  B|
#  ---- --- 
# |   1|  7|
# |   2|  4|
# |   3|  7|
# |null|  8|
# |null|  4|
#  ---- --- 

Script:

val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
max_key = max(val_cnt, key=val_cnt.get)

df.select(max_key).show()
#  --- 
# |  B|
#  --- 
# |  7|
# |  4|
# |  7|
# |  8|
# |  4|
#  --- 

CodePudding user response:

  1. Create a data frame with the data

     df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
    
  2. Define a condition to check for that

     from pyspark.sql.functions import *
     import pyspark.sql.functions as fx
    
     condition = fx.when((fx.col('A').isNotNull() & (fx.col('A')>fx.col('B'))),fx.col('A')).otherwise(fx.col('B'))
     df_1 = df.withColumn('max_value_among_A_and_B',condition)
    
  3. Print the dataframe

    df_1.show()
    

Please check the below screenshot for details

enter image description here

  • Related