Home > Blockchain >  How to read the partitioning columns of spark stream
How to read the partitioning columns of spark stream

Time:11-30

I have a spark streaming job where I am streaming the data and partitioning it by a single or multiple columns and storing in the gcs bucket. Below is the sample code where I am partitioning it by team and stored in gcs bucket.

from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext


temp = spark.createDataFrame([
    (0, "team1",100),
    (1, "team2",200),
    (2, "team3",300),
    (3, "team4",400)
], ["id", "team", "count"])

temp.writeStream.format('parquet').outputMode('append').option('path','gs://').partitionBy('team').start()

While reading the stream from the gcs bucket, I am not getting the "team" column back

df = spark.readStream.option('basePath','gs://').json('path')

CodePudding user response:

First, you don't have a streaming dataframe in the code shown. Just use write and read functions, directly.

You only need to read the base path. Partitions will be automatically discovered.

Also, don't use json() to read Parquet data.

  • Related