I'm using PySpark. For example, I have a simple DataFrame "df"
with 1 column "Col1"
which contains lots of blank spaces as below:
Col1
" - "
"abc "
" xy"
I want to take all the rows that are not "-" after trim
. In SQL it's simple:
select Col1 from df where trim(Col1) <> "-"
Result
abc
xy
I don't want to use SQL statements here, so I tried both
df.where(trim(df.Col1) <> "-").show()
and
df.filter(df.Col1.trim() <> "-").show()
But both did't succeed, it said that trim
isn't an attribute of a DataFrame
, when I read the Document Trim Function is in Function Page, not DataFrame Page. I don't want to use withColumn
since I still want to keep the format of my old column. Also transform into SQL Statement Query isn't good since I still have lots of Df functions to use after this.
So how do I do this simple condition checking with Spark DataFrame
?
CodePudding user response:
this is the correct syntax:
from pyspark.sql import functions as f
data = [(" - ",), ("abc ",), (" xy",)]
df = spark.sparkContext.parallelize(data).toDF(["value"])
df.where(f.trim(f.col("value")) != "-").show()
output:
-----
|value|
-----
| abc |
| xy|
-----
The <>
operator is for SQL syntax, the equivalent in this scenario is !=