I have a dataset and I want to filtered base on a column.
val test = Seq(
("1", "r2_test"),
("2", "some_other_value"),
("3", "hs_2_card"),
("4", "vsx_np_v2"),
("5", "r2_test"),
("2", "some_other_value2")
).toDF("id", "my_column")
I want to create a function to filter my dataframe based on the elements of this list using contains on "my_column"(if contains part of the string, the filter must be applied)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def filteredElements(df: DataFrame): DataFrame = {
val elements = List("r2", "hs", "np")
df.filter($"my_column".contains(elements))
}
But like this, won't work for a list, just for a single element. How can I do to adapt to use my list without have to do multiple filters?
Below how the expected output must be when apply the function
val output = test.transform(filteredElements)
expected =
("1", "r2_test"), // contains "rs"
("3", "hs_2_card"), // contains "hs"
("4", "vsx_np_v2"), // contains "np"
("5", "r2_test"), // contains "r2"
CodePudding user response:
One way to solve this would be to use a UDF
. I think there should be some way to solve this with spark sql functions that I'm not aware of. Anyway, you can define a udf to tell weather a String contains any of the values in your elements List or not:
import org.apache.sql.functions._
val elements = List("r2", "hs", "np")
val isContainedInList = udf { (value: String) =>
elements.exists(e => value.indexOf(e) != -1)
}
You can use this udf in select, filter, basically anywhere you want:
def filteredElements(df: DataFrame): DataFrame = {
df.filter(isContainedInList($"my_column"))
}
And the result is as expected:
--- ---------
| id|my_column|
--- ---------
| 1| r2_test|
| 3|hs_2_card|
| 4|vsx_np_v2|
| 5| r2_test|
--- ---------
CodePudding user response:
You can do it in one line without udf ( better for performance and simpler ):
df.filter(row => elements.exists(row.getAs[String]("my_column").contains)).show()