How to create a dataframe only if all columns contains a certain value-CodePudding

Let's say I have a dataframe that looks like the following

Column A	Column B
123	Mark 123
456	Mark 456
789	789

How do I create a df on spark scala with the following conditions: If all column B values contain a "mark", a dataframe with column A is created, else, an empty dataframe is created.

CodePudding user response：

I think you can simply make a count with contains "mark" in order to evaluate if all rows have it. I think this should do the trick:

import spark.implicits._

val dfWithMark = Seq(
("123", "Mark 123"),
("456", "Mark 456"),
("789", "Mark 789")
).toDF("column A", "column B")

def buildMarkDf(df: DataFrame): DataFrame = {
  val count = df.count()
  val markCount = df.where(lower($"column B") contains "mark").count()

  if (count == markCount)
    df.select("column a")
  else
    spark.emptyDataFrame
}

buildMarkDf(dfWithMark).show(false)

With this statement df.where(lower($"column B") contains "mark").count() if all columns contain Mark, the count should be the same as the originalDataframe.count()

CodePudding user response：

Let us assume that your dataset is called main, we can filter and count the column b values that do not start with mark:

val hasMarks = main.filter(!col("colB").startsWith("Mark")).count()

then, we can do a simple if statement:

val newDataset = if (hasMarks > 0) main.select("colA") else sparkSession.emptyDataFrame

This should do what you are looking for!