Highload data update architecture-CodePudding

I'm developing a parcels tracking system and thinking of how to improve it's performance.

Right now we have one table in postgres named parcels containing things like id, last known position, etc.

Everyday about 300.000 new parcels are added to this table. The parcels data is took from external API. We need to track all parcels positions as accurate as possible and reduce time between API calls about specific parcel.

Given such requirements what could you suggest about project architecture?

Right now the only solution I can think of is producer-consumer pattern. Like having one process selecting all records from parcel table in the infinite loop and then distribute fetching data task with something like Celery.

Majors downsides of this solution are:

possible deadlocks, as fetching data about the same task can be executed at the same time on different machines.
need in control of queue size

CodePudding user response：

This is a very broad topic, but I can give you a few pointers. Once you reach the limits of vertical scaling (scaling based on picking more powerful machines) you have to scale horizontally (scaling based on adding more machines to the same task). So for being able to design scalable architectures you have to learn about distributed systems. Here some topics to look into:

Infrastructure & processes for hosting distributed systems, such as Kubernetes, Containers, CI/CD.
Scalable forms of persistence. For example different forms of distributed NoSQL like key-value stores, wide-column stores, in-memory databases and novel scalable SQL stores.
Scalable forms of data flows and processing. For example event driven architectures using distributed message- / event-queues.

For your specific problem with packages I would recommend to consider a key-value store for your position data. Those could scale to billions of insertions and retrievals per day (when querying by key).

It also sounds like your data is somewhat temporary and could be kept in an in-memory hot-storage while the package is not delivered yet (and archived afterwards). A distributed in-memory DB could scale even further in terms insertion and queries.

Also you probably want to decouple data extraction (through your api) from processing and persistence. For that you could consider introducing stream processing systems.