I need to create a proof of concept using pyspark and I was wondering if there is a way to install it and use it via pip without having to install and configure spark itself. I've read a few answers suggesting that the newer versions of pyspark allow you to run it in standalone mode without without needing the full spark but when I try that I get the following error:
Traceback (most recent call last):
File "C:\Users\320181940\PycharmProjects\meetup\main.py", line 8, in <module>
sc = SparkContext("local", "meetup_etl")
File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\context.py", line 144, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\context.py", line 331, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\java_gateway.py", line 101, in launch_gateway
proc = Popen(command, **popen_kwargs)
File "C:\Python310\lib\subprocess.py", line 966, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Python310\lib\subprocess.py", line 1435, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
I installed pyspark 3.1.3 using pip, and I'm trying to run this on Windows 10. Any help would be much appreciated.
CodePudding user response:
You need to install java and add JAVA_HOME to your environment variables path
CodePudding user response:
Start a python interpreter, create a spark session and run your code, here's an example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
[["I'm ready!"], ["If I could put into words how much I love waking up at 6 am on Mondays I would."]]).toDF(
"text")
df.show()
Also make sure to set up HADOOP_HOME like it's specified in this gist