I am trying to build a Dataproc cluster, with Spark NLP installed in it, then quick test it by reading some CoNLL 2003 data. First, I used this codelab as inspiration, to build my own smaller cluster (project
name has been edited for safety purposes):
gcloud dataproc clusters create s17-sparknlp-experiments \
--enable-component-gateway \
--region us-west1 \
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.5' \
--zone us-west1-a \
--single-node \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 35 \
--image-version 1.5-debian10 \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--optional-components JUPYTER,ANACONDA \
--project my-project
I started the previous cluster via JupyterLab, then downloaded these CoNLL 2003 files in ~/original
directory, existing in root . If done correctly, when you run these commands:
cd / && head -n 5 original/eng.train
The following result should obtained:
-DOCSTART- -X- -X- O
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
This means these files should be able to be read in the following Python code, existing in a single-celled Jupyter Notebook:
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.training import CoNLL
import sparknlp
spark = sparknlp.start()
print("Spark NLP version: ", sparknlp.version()) # 2.4.4
print("Apache Spark version: ", spark.version) # 2.4.8
# Other info of possible interest:
# Python 3.6.13 :: Anaconda, Inc.
# openjdk version "1.8.0_312"
# OpenJDK Runtime Environment (Temurin)(build 1.8.0_312-b07)
# OpenJDK 64-Bit Server VM (Temurin)(build 25.312-b07, mixed mode)
training_data = CoNLL().readDataset(spark, 'original/eng.train') # The exact same path used before...
training_data.show()
Instead, the following error gets triggered:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-2b145ab3b733> in <module>
----> 1 training_data = CoNLL().readDataset(spark, 'original/eng.train')
2 training_data.show()
/opt/conda/anaconda/lib/python3.6/site-packages/sparknlp/training.py in readDataset(self, spark, path, read_as)
32 jSession = spark._jsparkSession
33
---> 34 jdf = self._java_obj.readDataset(jSession, path, read_as)
35 return DataFrame(jdf, spark._wrapped)
36
/opt/conda/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/conda/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o87.readDataset.
: java.io.FileNotFoundException: file or folder: original/eng.train not found
at com.johnsnowlabs.nlp.util.io.ResourceHelper$SourceStream.<init>(ResourceHelper.scala:44)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$.parseLines(ResourceHelper.scala:215)
at com.johnsnowlabs.nlp.training.CoNLL.readDocs(CoNLL.scala:31)
at com.johnsnowlabs.nlp.training.CoNLL.readDataset(CoNLL.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
QUESTION: What could be possibly going wrong here?
CodePudding user response:
I think the problem is related to the fact that as you can see in the library source code (1 2) CoNLL().readDataset()
read the information from HDFS.
You downloaded the required files and uncompressed them in your cluster master node file system, but you need to make that content accessible through HDFS.
Please, try copying it to HDFS and then repeat the test.
CodePudding user response:
@jccampanero led me in the right direction, however with some tweaks. In specific, you must store the files you want to import, in some Google Cloud Storage bucket; then use that file URI in readDataset
:
training_data = CoNLL().readDataset(spark, 'gs://my-bucket/subfolders/eng.train')
This is not the only valid option, there are other ones to achieve the same task, which can be found here.