Save and load JSON and scala's objects on-top/with Spark-CodePudding

I am encountering an issue reading and writing files with spark, to a "remote" file system (such as hadoop).

What did I do locally?
What do I want to do on 'remote'?

1. What did I do locally?

As for now, I worked with spark locally - read and wrote files to my device, as follows:

Spark-Session Initializating:

  val spark: SparkSession = Try(
    SparkSession.builder()
      .master("local[*]")
      .appName("app")
      .getOrCreate()) match {
    case Success(session)=>session
    case Failure(exception)=> throw new Exception(s"Failed initializing spark, due to: ${exception.getMessage}")
  }

Save/Write locally, and then Load/Read it:

(Json File)

  val content = "{"a": 10, "b": [], "c": {"x": "1", "z": {}}, {"x": "2", "z": {}}}"  // dummy JSON as string
  val fileName = "full_path/sample.json"

  // ... verify directory exists and create it if not ...

  // write sample.json with the content above:
  new PrintWriter(fileName) {
    write(content)
    close()
  }

  // Read & Operate on it:
  val jsonAsBufferedSource = Source.fromFile(fileName)

(Scala's Case-Class)

  case class Dummy(string: String, i: Int) extends Serializable {}
  val content = Dummy("42 is the best number", 42)       // Dummy instance
  val fileName = "full_path/sample.dummy"               // 'dummy' is the serialized saved-object name.
  
  // ... verify directory exists and create it if not ...

  // Write it:
  val output = new ObjectOutputStream(new FileOutputStream(fileName))
  output.writeObject(content)
  output.close()

  // Read:
  val input = new ObjectInputStream(new FileInputStream(fileName))
  val dummyObject = input.readObject.asInstanceOf[Dummy]
  input.close()

  // Operate:
  dummyObject.i   // 42

2. What do I want to do on 'remote'?

I want to be able to read/write on HDFS, S3, or any other 'remote' file system that is available outhere, with spark - as I did locally.

Mostly, the my questions are:

Spark Configurations: what and how should it be changed? [master, etc..]
Working With Spark:
- How can I save and load serializable objects, as I did locally?
- How can I save a Json string, and load it as a BufferedSource?

Generally speaking - I would like to allow myself work locally/remotly with the same "internal-interfaces" of my application.

Thank you for reading!

EDIT

I would like my app to save/read files to DISK and work above my computer's disk when testing and debugging. I would like it to save/read with a remote file-system when in production.
Is it possible using the same spark methods? with what spark-configurations?

Oren

CodePudding user response：

Not sure I understand the question. Spark works with file:// and hdfs:// or s3a:// prefixes all the same. It is Source.fromFile and PrintWriter that are wrong

You'll need to rewrite the functions to use proper Spark methods since Spark is meant to run in the cluster, not isolated to one machine (referred to as the driver)

// read all JSON files in a folder
val df = spark.read.json("file:///path/to/full_path/")

// write the dataframe to HDFS folder
df.write.format("json").save("hdfs://namenode.fqdn:port/hdfs/path/")

Sure, you could serialize a class, write to a file "locally" (which would be "remotely" when deploy-mode=cluster), then upload that, but that doesn't seem like what you're doing here. And rather than doing that, you would parellelize a Seq of the serialized object.

Use json4s rather than ObjectOutputStream to get JSON from case-classes.

Contents

1. What did I do locally?

2. What do I want to do on 'remote'?

EDIT