Running these seem to work:
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
Then the following line does not work and produces this error:
!tar xf spark-3.3.0-bin-hadoop3.tgz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
When I do !pwd
, I am in the same folder as where my spark-3.3.0-bin-hadoop3.tgz
is located.
------------EDIT------------
For everyone else having this same error, forget the whole thing. There is a much easier way with 5 lines of code.
Run these instead and Pyspark should automatically be set up in Google Colab:
!pip install pyspark
# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark
# Import a Spark function
from library from pyspark.sql.functions import col
CodePudding user response:
I think you should remove the -q
flag from the wget
command and see what's happening.
The thing is, I could only reproduce your problem with the following actions:
- Suppose I accidentally tried to download Spark from the following link (redirector link):
!wget -q https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
The command above downloaded the file, but it's basically html, only with the filename spark-3.3.0-bin-hadoop3.tgz
.
- Suppose also that later I discovered my mistake, and decided to download from the proper link. I removed the
-q
flag to show what's happening:
!wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
2022-09-12 12:03:09 (224 MB/s) - ‘spark-3.3.0-bin-hadoop3.tgz.1’ saved [299321244/299321244]
Since I already have the file spark-3.3.0-bin-hadoop3.tgz, wget
downloads with another filename.
- So, when I try to unpack the file, basically I'm trying to unpack the first downloaded file, i.e. the wrong one:
!tar xf spark-3.3.0-bin-hadoop3.tgz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now