Home > other >  Regular expressions in Pyspark
Regular expressions in Pyspark

Time:01-14

I was reading the book "the Spark Definitive guide" and while doing a code example I couldn't understand the logic completely. Below is the code from the book.

simpleColors = ["black", "white", "green", "blue", "red" ]
def color_locator(column, color_string):
        return locate(color_string.upper(), column).cast("boolean").alias("is_"   color_string)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors ]
selectedColumns.append(expr("*"))
df.select(*selectedColumns).where(expr("is_white OR is_red")).select("Description").show(3,False)

I don't understand the line selectedColumns.append(expr("*")) in the code. What does this accomplish . In the book it says that to make sure selectedColumns has to be a Column type we need to do this. It is complete bouncer for me. And in the next statement we are using df.select(*selectedColumns) . Why we need the * expression at the first place? Please help me resolve the confusion

CodePudding user response:

This code just adds column expression that will get all existing columns of Dataframe to the list of new columns that we created as selectedColumns. It's a standard pattern when you want to add multiple columns to a Dataframe without resorting to the loop with .withColumn (that is less efficient than .select). The expr("*") have the same effect as [selectedColumns.append(col(nm)) for nm in df.columns].

In this specific case the result of select will be several columns is_black, is_red, ... all columns of Dataframe.

CodePudding user response:

Let me try to break it down so that you can understand what is happening:

selectedColumns = [color_locator(df.Description, c) for c in simpleColors ]

In this line we are iterating over the colors in simpleColors and creating a list of selectedColumns. At this point in time, selectedColumns contains the columns "is_black","is_green", "is_blue", "is_red". Notice how this doesn't contain the Description column.

The next line,

selectedColumns.append(expr("*"))

Is basically adding every column in the original dataframe to this list of selectedColumns (this is a shorthand instead of adding every column explicitly).

At this point selectedColumns contains the columns "is_black","is_green", "is_blue", "is_red", "*"

df.select(*selectedColumns).where(expr("is_white OR is_red")).select("Description").show(3,False)

in this line *selectedColumns means that we are passing a variable number of arguments you can read more about it here: https://www.geeksforgeeks.org/args-kwargs-python/

to summarize we are selecting the columns, is_black, is_green, is_blue, is_red and * from the original dataframe (df).

  •  Tags:  
  • Related