I have a parquet file made with Pyspark, that contains a time column that contains values from 0 to x seconds. I want to be able to stream each row to a Kafka topic when the time comes.
For example, when the stream starts I want to stream all rows with time 0, the next row is sent when the time from the start of the stream equals the time cell in said row. So on, until we stream the last row.
I can think of a way to do that in Pyspark, but it's inefficient in so many ways that it can lead to multiple rows being sent at once even though their times are not equal.
Is there any other framework or library that can achieve this task well?
CodePudding user response:
No Kafka framework that I've personally seen has such a feature, out of the box.
You'll need to iterate the dataframe, then store somewhere else, such as a priority queue in-memory structure, which will sort the data. Then, you will need to start some Timer to periodically poll that structure.
Alternatively, store the data into a database, then you run a periodic query like time BETWEEN last_poll AND NOW()
, where last_poll
comes from another database field, similar to Kafka offset storage. You might be able to do the same with Apache Drill (or Hive/Presto) on Parquet files, assuming Pandas wouldn't work.
Related question - Python Pandas - Simulate Streaming to Kafka