Home > OS >  How to use SparkSQL Function inside DataFrame Where/Filter condition?
How to use SparkSQL Function inside DataFrame Where/Filter condition?

Time:10-07

I'm using PySpark. For example, I have a simple DataFrame "df" with 1 column "Col1" which contains lots of blank spaces as below:

Col1
"  - "
"abc "
"  xy"

I want to take all the rows that are not "-" after trim. In SQL it's simple:

select Col1 from df where trim(Col1) <> "-"

Result

abc  
  xy

I don't want to use SQL statements here, so I tried both

df.where(trim(df.Col1) <> "-").show() 
and
df.filter(df.Col1.trim() <> "-").show()

But both did't succeed, it said that trim isn't an attribute of a DataFrame, when I read the Document Trim Function is in Function Page, not DataFrame Page. I don't want to use withColumn since I still want to keep the format of my old column. Also transform into SQL Statement Query isn't good since I still have lots of Df functions to use after this.

So how do I do this simple condition checking with Spark DataFrame?

CodePudding user response:

this is the correct syntax:

from pyspark.sql import functions as f
data = [("  - ",), ("abc ",), (" xy",)]
df = spark.sparkContext.parallelize(data).toDF(["value"])
df.where(f.trim(f.col("value")) != "-").show()

output:

 ----- 
|value|
 ----- 
| abc |
|   xy|
 ----- 

The <> operator is for SQL syntax, the equivalent in this scenario is !=

  • Related