Let's say we are running an online shop and have a Product
table in our database. This represents products that exist:
productId | brandId (FK) | name | description | capacity | pricePerNight |
---|---|---|---|---|---|
1 | 2 | Tent | This is a tent | 4 | 20 |
Now let's say I was making a public API which allows external companies clients to query available products in my store based on a number of parameters. This public API gets thousands of calls a week.
I want to track how many times a product (e.g. the tent) has been returned by this public API and I want to see how this number changes over time.
My solution:
Implement a table (e.g. View
) and insert a new row (asynchronously) every time a product is returned:
productId (FK) | date |
---|---|
1 | 24/03/2021 |
1 | 26/03/2021 |
The problem I can see here is that the table would quickly become very large (with potentially 100,000s of new rows every week). I'm unsure how this would impact overall DB performance (MYSql) but it doesn't seem scalable.
Can you tell me a better solution. Thank you!
CodePudding user response:
Since the purpose of this is to feed offline analytics, it's probably best to have your API publish a stream of events containing a product ID and a timestamp. A consumer of that stream can then track aggregations etc. in a different datastore (Cassandra, DynamoDB, Cosmos, Elasticsearch, some timeseries DB, ...) from your main service's MySQL.
As to what "publishing a stream" entails, it could be as simple as another service with an HTTP/gRPC/whatever endpoint. Publishing a message to a Kafka/Pulsar topic or some MQ would also be viable. A log message which gets picked up by Filebeat and grokked by Logstash could even be viable.
One important thing to note is that publishing this stream is likely some place you should consider ignoring failures (i.e. "fire and forget" publishing): assuming that users querying this API are presumably important for your business, does it really make sense to fail their request because you couldn't record things for later analytics? For the same reason, it's almost certainly best to do this asynchronously, impacting the hot path of serving the API response only as long as needed to schedule a task to publish the event(s). You can consider the inevitable dropped events as basically a form of random sampling: chances are, nobody cares that the tent actually came up in 1000 requests while the statistics service reports 997 (and if a business person pushes back, ask them if having 100% accuracy from the statistics service with a 0.3% failure rate on the public API is preferable to a 0% failure rate on the public API with 99.7% accuracy from the statistics service...).