Home > Enterprise >  Developing a multi-file Scala package in Spark EMR notebook
Developing a multi-file Scala package in Spark EMR notebook

Time:10-07

I'm basically looking for a way to do Spark-based scala development in EMR. So I'd have a couple of project files on the hadoop cluster:

// mypackage.scala
package mypackage

<Spark-dependent scala code> 
// subpackage.scala
package mypackage.subpackage

def myfunc(x: String) {
...
}
<more Spark-dependent scala code> 

I want to be able to edit these scripts on the fly and then import the changes into my EMR notebook.

// EMR_notebook.ipynb
import mypackage.subpackage.myfunc
val output = myfunc('foo')

I understand that

  1. You generally have to compile scala code with sbt before you can use it, and
  2. The best way to imported modified scala code into an EMR notebook is via a jarfile, i.e.
%%configure -f
{ 
    "jars": ["s3://path_to_myproject_jarfile.jar"]
}

But this means that to debug my package, I'd have to modify the scripts mypackage.scala and subpackage.scala, then compile with sbt, then upload to s3, then restart the Spark kernel so I can re-import the jarfile, and only then could I re-run my code and see the effect of any changes. So I'm hoping there's a more efficient way to handle this situation.

Apologies for any ambiguity/scala illiteracy. Thanks!

CodePudding user response:

Yes you are correct. this is the only way i know too. But that's precisely the reason they have spark-shell, if you use something like Databricks or even log in to the emr, open spark-shell and run your bits to check in there.

  • Related