Let's say I have a dataframe that looks like the following
Column A | Column B |
---|---|
123 | Mark 123 |
456 | Mark 456 |
789 | 789 |
How do I create a df on spark scala with the following conditions: If all column B values contain a "mark", a dataframe with column A is created, else, an empty dataframe is created.
CodePudding user response:
I think you can simply make a count with contains "mark"
in order to evaluate if all rows have it. I think this should do the trick:
import spark.implicits._
val dfWithMark = Seq(
("123", "Mark 123"),
("456", "Mark 456"),
("789", "Mark 789")
).toDF("column A", "column B")
def buildMarkDf(df: DataFrame): DataFrame = {
val count = df.count()
val markCount = df.where(lower($"column B") contains "mark").count()
if (count == markCount)
df.select("column a")
else
spark.emptyDataFrame
}
buildMarkDf(dfWithMark).show(false)
With this statement df.where(lower($"column B") contains "mark").count()
if all columns contain Mark, the count should be the same as the originalDataframe.count()
CodePudding user response:
Let us assume that your dataset is called main
, we can filter and count the column b
values that do not start with mark:
val hasMarks = main.filter(!col("colB").startsWith("Mark")).count()
then, we can do a simple if statement:
val newDataset = if (hasMarks > 0) main.select("colA") else sparkSession.emptyDataFrame
This should do what you are looking for!