I was reading the book "the Spark Definitive guide" and while doing a code example I couldn't understand the logic completely. Below is the code from the book.
simpleColors = ["black", "white", "green", "blue", "red" ]
def color_locator(column, color_string):
return locate(color_string.upper(), column).cast("boolean").alias("is_" color_string)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors ]
selectedColumns.append(expr("*"))
df.select(*selectedColumns).where(expr("is_white OR is_red")).select("Description").show(3,False)
I don't understand the line selectedColumns.append(expr("*"))
in the code. What does this accomplish . In the book it says that to make sure selectedColumns has to be a Column type we need to do this. It is complete bouncer for me. And in the next statement we are using df.select(*selectedColumns)
. Why we need the * expression at the first place? Please help me resolve the confusion
CodePudding user response:
This code just adds column expression that will get all existing columns of Dataframe to the list of new columns that we created as selectedColumns
. It's a standard pattern when you want to add multiple columns to a Dataframe without resorting to the loop with .withColumn
(that is less efficient than .select
). The expr("*")
have the same effect as [selectedColumns.append(col(nm)) for nm in df.columns]
.
In this specific case the result of select will be several columns is_black
, is_red
, ... all columns of Dataframe.
CodePudding user response:
Let me try to break it down so that you can understand what is happening:
selectedColumns = [color_locator(df.Description, c) for c in simpleColors ]
In this line we are iterating over the colors in simpleColors and creating a list of selectedColumns. At this point in time, selectedColumns contains the columns "is_black","is_green", "is_blue", "is_red". Notice how this doesn't contain the Description column.
The next line,
selectedColumns.append(expr("*"))
Is basically adding every column in the original dataframe to this list of selectedColumns (this is a shorthand instead of adding every column explicitly).
At this point selectedColumns contains the columns "is_black","is_green", "is_blue", "is_red", "*"
df.select(*selectedColumns).where(expr("is_white OR is_red")).select("Description").show(3,False)
in this line *selectedColumns means that we are passing a variable number of arguments you can read more about it here: https://www.geeksforgeeks.org/args-kwargs-python/
to summarize we are selecting the columns, is_black, is_green, is_blue, is_red and * from the original dataframe (df).