Currently we are working on a real time data feeds having Json data.
While reading the examples from - https://sparkbyexamples.com/spark/spark-streaming-with-kafka/
It looks like we need a schema for kafka json message.
Is there any other way to process data without schema ?
CodePudding user response:
try below code after running the zookeeper, Kafka server and other required service.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest")\
.load() #earliest
print("Printing Schema of transaction_detail_df: ")
df.printSchema()
transaction_detail_df1 = df.selectExpr("CAST(value AS STRING)")
trans_detail_write_stream = transaction_detail_df1 \
.writeStream \
.trigger(processingTime='2 seconds') \
.option("truncate", "false") \
.format("console") \
.start()
trans_detail_write_stream.awaitTermination()
just change the basic configuration, you would be able to see the output
CodePudding user response:
You can use get_json_object
SparkSQL function to parse data out of JSON string data without defining any additional schema.
You can simply use cast
function to deserialize the binary key/value, as the example shows