I am using google colab and I cannot seem to use graphframes.
This is what i do:
!pip install pyspark
Which gives:
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
Downloading pyspark-3.3.0.tar.gz (281.3 MB)
|████████████████████████████████| 281.3 MB 56 kB/s
Collecting py4j==0.10.9.5
Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
|████████████████████████████████| 199 kB 54.0 MB/s
Building wheels for collected packages: pyspark
Building wheel for pyspark (setup.py) ... done
Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=d063a038f20ea3d5245b07282802a899dc3d54ddd31e786d592f39cac89d4651
Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0
Then
!pip install graphframes
Which gives:
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting graphframes
Downloading graphframes-0.6-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from graphframes) (1.21.6)
Collecting nose
Downloading nose-1.3.7-py3-none-any.whl (154 kB)
|████████████████████████████████| 154 kB 9.6 MB/s
Installing collected packages: nose, graphframes
Successfully installed graphframes-0.6 nose-1.3.7
And here's the error:
vertices = spark.createDataFrame([('1', 'Carter', 'Derrick', 50),
('2', 'May', 'Derrick', 26)],
['id', 'name', 'firstname', 'age'])
edges = spark.createDataFrame([('1', '2', 'friend'),
('2', '1', 'friend')],
['src', 'dst', 'type'])
g = GraphFrame(vertices, edges)
Which gives:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-8-231e24e084b3> in <module>
5 ('2', '1', 'friend')],
6 ['src', 'dst', 'type'])
----> 7 g = GraphFrame(vertices, edges)
4 frames
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o130.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Attempt 1:
I found this SO and tried to instantiate spark with what they say there:
spark = SparkSession.builder.master("local[*]").config("spark.jars.packages", "graphframes:graphframes:0.7.0-spark2.4-s_2.11").getOrCreate()
But I get a new error:
/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py:149: UserWarning: DataFrame.sql_ctx is an internal property, and will be removed in future releases. Use DataFrame.sparkSession instead.
"DataFrame.sql_ctx is an internal property, and will be removed "
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-5-231e24e084b3> in <module>
5 ('2', '1', 'friend')],
6 ['src', 'dst', 'type'])
----> 7 g = GraphFrame(vertices, edges)
3 frames
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o71.createGraph.
: java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps scala.Predef$.refArrayOps(java.lang.Object[])'
at org.graphframes.GraphFrame$.apply(GraphFrame.scala:676)
at org.graphframes.GraphFramePythonAPI.createGraph(GraphFramePythonAPI.scala:10)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Here is the example colab:
https://colab.research.google.com/drive/1FLhlUQDwmRmlgSKPKItBlTSsFLrQt7_F#scrollTo=RTW-LuBVF1zv
CodePudding user response:
Solution:
Make sure that the jar version of the curl command and the spark session initialisation are the same.
!pip install pyspark
!pip install graphframes
I want to use the latest version: 0.8.2-spark3.2-s_2.12
So
!curl -L -o "/usr/local/lib/python3.6/dist-packages/pyspark/jars/graphframes-0.8.2-spark3.2-s_2.12.jar" http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.8.2-spark3.2-s_2.12/graphframes-0.8.2-spark3.2-s_2.12.jar
and then
from pyspark.sql import SparkSession
from graphframes import GraphFrame
spark = SparkSession.builder.master("local[*]").config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12").getOrCreate()
See colab.