I am new to Apache Flink(and stackoverflow), and I wanted to know the best practice for dealing with the following scenario:
I am currently consuming real-time message using a KafkaSource from someone else's application. Some of these messages will need to undergo a transformation if the keys in these messages exist in a local database that I have created and have access to. This transformed message then needs to be sent to a KafkaSink one by one.
In order to check if a message needs to be transformed, I need to see if the key for that specific message exists in my local database (I have to query my local database for each message to check for its key).
What is an efficient way to do this?
I have 2 ideas:
Open a connection to the local database and perform a query to see if the record exists in my local database for that message. Repeat this for each message in the stream.
Extend the flink RichSinkFunction and open a connection through that and use the invoke method to perform the query. Use this RichSink to repeat this for each message in the stream.
Performance Concern: I only want to open a connection to the local database once. I think Method #1 would open and close a connection per message while Method #2 would open and close a connection only once.
More generally, is it appropriate to create a RichSink to just run some queries in your local database for read purposes? I am not going to be using this RichSink to actually write any data to the local database.
Thanks!
CodePudding user response:
The preferred approach to access external systems from Flink is to use an AsyncFunction
: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/asyncio/
That is, if your database can handle the load and be fast enough to keep up with the stream throughput. If not, you'll want to implement some kind of CDC stream from your database and store its contents locally as Flink state. Then, have a ConnectedStream
so they both can share state in a CoMap
or CoFlatMap
operator.
CodePudding user response:
ConnectedStream
and AsyncFunction
is the preferred way of approaching this kind of problems.
In case you don't have access to all Flink abstractions (like if you have some existing framework on top of Flink) but you can instantiate FlatMapFunction
you can resort to RichFlatMapFunction
- you'd maintain just a few connection to database this way if you use open
method to instantiate it.