How to list distinct values of pyspark dataframe wrt null values in another column-CodePudding

i have a pyspark dataframe:

rowNum      Vehicle      Production
     1      1234         5678
      2      null        1254
      3      null        4567
      4      null        4567

i want to pick all the distinct values of Production in a list format where Vehicle is null. How to achieve this?

result:
production list=['1254','4567']

how to achieve this in pyspark dataframe

CodePudding user response：

I would do something like this:

# Using Spark 3.3.0

# Dataset as per the question
data = [
     [1,   '1234', 5678]
,    [2,   'Null', 1254]
,    [3,   'Null', 4567]
,    [4,   'Null', 4567]    
]

cols = ['rowNum', 'Vehicle', 'Production']

# Creating Dataframe
df = spark.createDataFrame(data, cols)


# list comprehension to represent the distinct Production values on 'Null' Vehicles
list = [p.Production for p in df.select('Production').distinct().where("Vehicle == 'Null'").collect()]

list

The output I am having is the following: