Home > Software design >  Mock Spark DataFrameReader .option
Mock Spark DataFrameReader .option

Time:12-27

I want to parametrice when i have a header and then separator when I read a csv from Spark. I've written this:

DataFrameReader dataFrameReader = spark.read();

    dataFrameReader = "csv".equalsIgnoreCase(params.getReadFileType()) ?
                dataFrameReader
                        .option("sep",params.getDelimiter())
                        .option("header",params.isHeader())
                :dataFrameReader;

I'm new to Groovy and I don't get dataFrameReader.option corrected mocked.

DataFrameReader dfReaderLoader = Mock(DataFrameReader)
DataFrameReader dfReaderOptionString = Mock(DataFrameReader)
DataFrameReader dfReaderOptionBoolean = Mock(DataFrameReader)

SparkSession sparkSession = Mock(SparkSession)
sparkSession.read() >> dfReaderLoader
dfReaderLoader.option(_ as String, _ as String) >> dfReaderOptionString
dfReaderOptionString.option(_ as String, _ as Boolean) >>  dfReaderOptionBoolean

And it gives me a null pointer exception.

java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.DataFrameReader.option(String, boolean)" because the return value of "org.apache.spark.sql.DataFrameReader.option(String, String)" is null

CodePudding user response:

If you don't really care about the intermediate invocations of a builder pattern, i.e. an object that returns itself. I'd suggest to use a Stub, which will return itself, if the method return type matches it's type, or you can use this declaration _ >> _ to achieve the same for Mocks.

given:
ThingBuilder builder = Mock() {
  _ >> _
}

when:
Thing thing = builder
  .id("id-42")
  .name("spock")
  .weight(100)
  .build()

then:
1 * builder.build() >> new Thing(id: 'id-1337') // <-- only assert the last call you actually care about
thing.id == 'id-1337'

Try it in the Groovy Web Console.

That being said, the error would probably go away if you just remove the as String cast of the second argument of option, or fix it to be as Boolean as the error suggests.

CodePudding user response:

I do not know what your problem is, but my guess is that you create the mocks, but then do not inject them into your class under test. If you do that, both your own version as well as Leonard's suggested improved version with a default response work:

Class under test helper class:

class UnderTest {
  SparkSession spark
  Parameters params

  DataFrameReader produce() {
    DataFrameReader dataFrameReader = spark.read()

    dataFrameReader = "csv".equalsIgnoreCase(params.getReadFileType()) ?
      dataFrameReader
        .option("sep", params.getDelimiter())
        .option("header", params.isHeader())
      : dataFrameReader
  }
}
class Parameters {
  String readFileType
  String delimiter
  boolean header
}

Spock specification:

package de.scrum_master.stackoverflow.q74923254

import org.apache.spark.sql.DataFrameReader
import org.apache.spark.sql.SparkSession
import org.spockframework.mock.MockUtil
import spock.lang.Specification

class DataFrameReaderTest extends Specification {
  def 'read #readFileType data'() {
    given:
    DataFrameReader dfReaderLoader = Mock(DataFrameReader)
    DataFrameReader dfReaderOptionString = Mock(DataFrameReader)
    DataFrameReader dfReaderOptionBoolean = Mock(DataFrameReader)

    SparkSession sparkSession = Mock(SparkSession)
    sparkSession.read() >> dfReaderLoader
    dfReaderLoader.option(_ as String, _ as String) >> dfReaderOptionString
    dfReaderOptionString.option(_ as String, _ as Boolean) >> dfReaderOptionBoolean

    def underTest = new UnderTest(spark: sparkSession, params: parameters)

    expect:
    underTest.produce().toString().contains(returnedMockName)

    where:
    readFileType | parameters                                                               | returnedMockName
    'CSV'        | new Parameters(readFileType: readFileType, delimiter: ';', header: true) | 'dfReaderOptionBoolean'
    'XLS'        | new Parameters(readFileType: readFileType)                               | 'dfReaderLoader'
  }

  def 'read #readFileType data (improved)'() {
    given:
    SparkSession sparkSession = Mock() {
      read() >> Mock(DataFrameReader) {
        _ >> _
      }
    }

    def parameters = new Parameters(readFileType: readFileType, delimiter: ';', header: true)
    def underTest = new UnderTest(spark: sparkSession, params: parameters)

    expect:
    new MockUtil().isMock(underTest.produce())

    where:
    readFileType << ['CSV', 'XLS']
  }
}

Try it in the Groovy Web Console.

The result should look similar to this in your IDE:

DataFrameReaderTest ✔
├─ read #readFileType data ✔
│  ├─ read CSV data ✔
│  └─ read XLS data ✔
└─ read #readFileType data (improved) ✔
   ├─ read CSV data (improved) ✔
   └─ read XLS data (improved) ✔
  • Related