we are running a Spark job against our Kubernetes cluster and try to log the model to MLflow. We are running Spark 3.2.1 and MLflow 1.26.1 and we are using the following jars to communicate with s3 hadoop-aws-3.2.2.jar
and aws-java-sdk-bundle-1.11.375.jar
and configure our spark-submit job with the following parameters:
--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.fast.upload=true \
When we try to save our Spark model with mlflow.spark.log_model()
we are getting the following exception:
22/06/24 13:27:21 ERROR Instrumentation: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.ml.util.FileSystemOverwrite.handleOverwrite(ReadWrite.scala:673)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:167)
at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344)
at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344)
at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174)
at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169)
at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42)
at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344)
at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Unknown Source)
We tried to start our MLflow server with -default-artifact-root
set to s3a://...
but when we run our spark job and we call mlflow.get_artifact_uri()
(which is also used to construct the upload uri in mlflow.spark.log_model()
) the result starts with s3
which probably cause the former mentioned exception.
Since Hadoop dropped support for the s3://
filesystem does anyone know how to log spark models to s3 using MLflow?
Cheers
CodePudding user response:
Additional to the spark.hadoop.fs.s3a.impl
config parameter, you can try to also set spark.hadoop.fs.s3.impl
to org.apache.hadoop.fs.s3a.S3AFileSystem