PySpark3 read file from https url-CodePudding

Is there a way in PySpark to read a .tsv.gz from a URL?

from pyspark.sql import SparkSession

def create_spark_session():
    return SparkSession.builder.appName("wikipediaClickstream").getOrCreate()

spark = create_spark_session()
url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
# df = spark.read.csv(url, sep="\t") # doesn't work
df = spark.read.option("sep", "\t").csv(url) # doesn't work either
df.show(10)

Get the following error:

Py4JJavaError: An error occurred while calling o65.csv.
: java.lang.UnsupportedOperationException
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/var/folders/sn/4dk4tbz9735crf4npgcnlt8r0000gn/T/ipykernel_1443/4137722240.py in <module>
      1 url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
      2 # df = spark.read.csv(url, sep="\t")
----> 3 df = spark.read.option("sep", "\t").csv(url)
      4 df.show(10)

spark.version is 3.1.2

CodePudding user response：

Your problem is likely that .csv() is not expecting a url. At best you'll need to first:

download the file and
unzip it (.gz is a compressed file extension)

Looks like you already know how to handle tab-separated files (as hinted by the .tsv extension.

CodePudding user response：

You need to download the file to a local location (if you are running in cluster (Ex: HDFS), you need to put file at a HDFS location) & read it from there using Spark.

import wget
url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
local_path = '/tmp/wikipediadata/clickstream-jawiki-2017-11.tsv.gz'
wget.download(url, local_path)

df = spark.read.option("sep", "\t").csv('file://' local_path)
df.show(10)