i have a pyspark dataframe:
rowNum Vehicle Production
1 1234 5678
2 null 1254
3 null 4567
4 null 4567
i want to pick all the distinct values of Production in a list format where Vehicle is null. How to achieve this?
result:
production list=['1254','4567']
how to achieve this in pyspark dataframe
CodePudding user response:
I would do something like this:
# Using Spark 3.3.0
# Dataset as per the question
data = [
[1, '1234', 5678]
, [2, 'Null', 1254]
, [3, 'Null', 4567]
, [4, 'Null', 4567]
]
cols = ['rowNum', 'Vehicle', 'Production']
# Creating Dataframe
df = spark.createDataFrame(data, cols)
# list comprehension to represent the distinct Production values on 'Null' Vehicles
list = [p.Production for p in df.select('Production').distinct().where("Vehicle == 'Null'").collect()]
list
The output I am having is the following: