Home > front end >  How to create a dataframe only if all columns contains a certain value
How to create a dataframe only if all columns contains a certain value

Time:07-25

Let's say I have a dataframe that looks like the following

Column A Column B
123 Mark 123
456 Mark 456
789 789

How do I create a df on spark scala with the following conditions: If all column B values contain a "mark", a dataframe with column A is created, else, an empty dataframe is created.

CodePudding user response:

I think you can simply make a count with contains "mark" in order to evaluate if all rows have it. I think this should do the trick:

import spark.implicits._

val dfWithMark = Seq(
("123", "Mark 123"),
("456", "Mark 456"),
("789", "Mark 789")
).toDF("column A", "column B")

def buildMarkDf(df: DataFrame): DataFrame = {
  val count = df.count()
  val markCount = df.where(lower($"column B") contains "mark").count()

  if (count == markCount)
    df.select("column a")
  else
    spark.emptyDataFrame
}

buildMarkDf(dfWithMark).show(false)

With this statement df.where(lower($"column B") contains "mark").count() if all columns contain Mark, the count should be the same as the originalDataframe.count()

CodePudding user response:

Let us assume that your dataset is called main, we can filter and count the column b values that do not start with mark:

val hasMarks = main.filter(!col("colB").startsWith("Mark")).count()

then, we can do a simple if statement:

val newDataset = if (hasMarks > 0) main.select("colA") else sparkSession.emptyDataFrame

This should do what you are looking for!

  • Related