Can I run pyspark locally without installing spark on windows 10?-CodePudding

I need to create a proof of concept using pyspark and I was wondering if there is a way to install it and use it via pip without having to install and configure spark itself. I've read a few answers suggesting that the newer versions of pyspark allow you to run it in standalone mode without without needing the full spark but when I try that I get the following error:

Traceback (most recent call last):
  File "C:\Users\320181940\PycharmProjects\meetup\main.py", line 8, in <module>
    sc = SparkContext("local", "meetup_etl")
  File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\context.py", line 144, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\context.py", line 331, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\java_gateway.py", line 101, in launch_gateway
    proc = Popen(command, **popen_kwargs)
  File "C:\Python310\lib\subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Python310\lib\subprocess.py", line 1435, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

I installed pyspark 3.1.3 using pip, and I'm trying to run this on Windows 10. Any help would be much appreciated.

CodePudding user response：

You need to install java and add JAVA_HOME to your environment variables path

CodePudding user response：

Start a python interpreter, create a spark session and run your code, here's an example:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
        [["I'm ready!"], ["If I could put into words how much I love waking up at 6 am on Mondays I would."]]).toDF(
        "text")
df.show()

Also make sure to set up HADOOP_HOME like it's specified in this gist