I have a spark streaming job where I am streaming the data and partitioning it by a single or multiple columns and storing in the gcs bucket. Below is the sample code where I am partitioning it by team and stored in gcs bucket.
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
temp = spark.createDataFrame([
(0, "team1",100),
(1, "team2",200),
(2, "team3",300),
(3, "team4",400)
], ["id", "team", "count"])
temp.writeStream.format('parquet').outputMode('append').option('path','gs://').partitionBy('team').start()
While reading the stream from the gcs bucket, I am not getting the "team" column back
df = spark.readStream.option('basePath','gs://').json('path')
CodePudding user response:
First, you don't have a streaming dataframe in the code shown. Just use write and read functions, directly.
You only need to read the base path. Partitions will be automatically discovered.
Also, don't use json()
to read Parquet data.