Local data cannot be read in a Dataproc cluster, when using SparkNLP-CodePudding

I am trying to build a Dataproc cluster, with Spark NLP installed in it, then quick test it by reading some CoNLL 2003 data. First, I used this codelab as inspiration, to build my own smaller cluster (project name has been edited for safety purposes):

gcloud dataproc clusters create s17-sparknlp-experiments \
     --enable-component-gateway \
     --region us-west1 \
     --metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.5' \
     --zone us-west1-a \
     --single-node \
     --master-machine-type n1-standard-4 \
     --master-boot-disk-size 35 \
     --image-version 1.5-debian10 \
     --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
     --optional-components JUPYTER,ANACONDA \
     --project my-project

I started the previous cluster via JupyterLab, then downloaded these CoNLL 2003 files in ~/original directory, existing in root . If done correctly, when you run these commands:

cd / && head -n 5 original/eng.train

The following result should obtained:

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC

This means these files should be able to be read in the following Python code, existing in a single-celled Jupyter Notebook:

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.training import CoNLL
import sparknlp

spark = sparknlp.start()
print("Spark NLP version: ", sparknlp.version())  # 2.4.4
print("Apache Spark version: ", spark.version)    # 2.4.8

# Other info of possible interest:
# Python 3.6.13 :: Anaconda, Inc.
# openjdk version "1.8.0_312"
# OpenJDK Runtime Environment (Temurin)(build 1.8.0_312-b07)
# OpenJDK 64-Bit Server VM (Temurin)(build 25.312-b07, mixed mode)

training_data = CoNLL().readDataset(spark, 'original/eng.train')  # The exact same path used before...
training_data.show()

Instead, the following error gets triggered:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-2b145ab3b733> in <module>
----> 1 training_data = CoNLL().readDataset(spark, 'original/eng.train')
      2 training_data.show()

/opt/conda/anaconda/lib/python3.6/site-packages/sparknlp/training.py in readDataset(self, spark, path, read_as)
     32         jSession = spark._jsparkSession
     33 
---> 34         jdf = self._java_obj.readDataset(jSession, path, read_as)
     35         return DataFrame(jdf, spark._wrapped)
     36 

/opt/conda/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/opt/conda/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o87.readDataset.
: java.io.FileNotFoundException: file or folder: original/eng.train not found
    at com.johnsnowlabs.nlp.util.io.ResourceHelper$SourceStream.<init>(ResourceHelper.scala:44)
    at com.johnsnowlabs.nlp.util.io.ResourceHelper$.parseLines(ResourceHelper.scala:215)
    at com.johnsnowlabs.nlp.training.CoNLL.readDocs(CoNLL.scala:31)
    at com.johnsnowlabs.nlp.training.CoNLL.readDataset(CoNLL.scala:198)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

QUESTION: What could be possibly going wrong here?

CodePudding user response：

I think the problem is related to the fact that as you can see in the library source code (1 2) CoNLL().readDataset() read the information from HDFS.

You downloaded the required files and uncompressed them in your cluster master node file system, but you need to make that content accessible through HDFS.

Please, try copying it to HDFS and then repeat the test.

CodePudding user response：

@jccampanero led me in the right direction, however with some tweaks. In specific, you must store the files you want to import, in some Google Cloud Storage bucket; then use that file URI in readDataset:

training_data = CoNLL().readDataset(spark, 'gs://my-bucket/subfolders/eng.train')

This is not the only valid option, there are other ones to achieve the same task, which can be found here.