Home > Software design >  Spark's .tgz File cannot be extracted on Google Colab?
Spark's .tgz File cannot be extracted on Google Colab?

Time:09-13

Running these seem to work:

# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

Then the following line does not work and produces this error:

!tar xf spark-3.3.0-bin-hadoop3.tgz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

When I do !pwd, I am in the same folder as where my spark-3.3.0-bin-hadoop3.tgz is located.

------------EDIT------------

For everyone else having this same error, forget the whole thing. There is a much easier way with 5 lines of code.

Run these instead and Pyspark should automatically be set up in Google Colab:

!pip install pyspark

# Import SparkSession 
from pyspark.sql import SparkSession
# Create a Spark Session 
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information 
spark

# Import a Spark function 
from library from pyspark.sql.functions import col

CodePudding user response:

I think you should remove the -q flag from the wget command and see what's happening.

The thing is, I could only reproduce your problem with the following actions:

  1. Suppose I accidentally tried to download Spark from the following link (redirector link):
!wget -q https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

The command above downloaded the file, but it's basically html, only with the filename spark-3.3.0-bin-hadoop3.tgz.

  1. Suppose also that later I discovered my mistake, and decided to download from the proper link. I removed the -q flag to show what's happening:
!wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
2022-09-12 12:03:09 (224 MB/s) - ‘spark-3.3.0-bin-hadoop3.tgz.1’ saved [299321244/299321244]

Since I already have the file spark-3.3.0-bin-hadoop3.tgz, wget downloads with another filename.

  1. So, when I try to unpack the file, basically I'm trying to unpack the first downloaded file, i.e. the wrong one:
!tar xf spark-3.3.0-bin-hadoop3.tgz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
  • Related