Home > Software engineering >  Tip data structure in elasticsearch
Tip data structure in elasticsearch

Time:04-29

I need to create a catalog of products of different providers and different shops zones.

One product of a provider can have different prices in each shops zones.

In resume: Provider - Zones - Products

I have about 20 provider, each of them can have about 50 or 60 zones max. Each zones can have about 20.000 products max. Products are the same in each zones but can have different prices.

I have doubts about how to store the information. I need to update the prices daily for providers and each zone, but not all zones at the same time. Also, I need to search for products or categories and show only the price of their stores. The most common query is: List the products of a category with the prices of the chosen zone or give information of a product with the price of the chosen zone.

I am thinking of different scenarios to store the data.

Scenario A - Index Provider X

Create an index for each provider, have documents for each products and the prices of each zone in a nested object.

"id" : 53457,
"categories": [5563,5686],
"description": "bla bla bla",
....,
"zones": [ {"id": 259, "price": 4.55}, {"id": 260, "price": 4.45}]

Advantages:

  • Few indices.
  • No redundant information stored
  • Easier information maintenance.

Disadvantages:

  • Update and price search more complex and maybe lower performance.

Scenario B - Index for Provider zone X

Create an index for every zone.

"id" : 53457,
"categories": [5563,5686],
"description": "bla bla bla",
....,
"price": 4.55

Advantages:

  • Simple way to update prices and obtain the product catalog of a store.

Disadvantages:

  • Redudant information in each index.
  • Complex information maintenance.
  • Many indices

Could someone recommend me which scenario to choose or present an alternative?

CodePudding user response:

In the NoSQL world in general, and Elasticsearch in particular, "redundant" is not necessarily, if ever, deemed a disadvantage, as denormalization is key. So that would go in favor of option B, but it's not the whole story and pragmatism should prevail, because you know... it depends.

Also whether few or many indices is not necessarily a problem either, if designed correctly, it always depends on the use case and how many efforts are being invested into the data architecture design. With option A, you'd have 20 indices containing 1.2M documents each and with option B, you'd have ~1K indices with 20K documents. Not sure about your average document size and cluster architecture, but it'd seem like option B would be slightly less efficient given the usual queries that you'll likely be running.

Your queries will need to run on all indices all the time, so the fewer indices the better unless you have a huge cluster with ample resources, but for only 25M documents I don't think it's the case. So given the information you've shared above, I'd go with option first A.

Also keep in mind that your first priority is to make it easy for your users to find products, not for you to update your documents, so faster searches are more important than faster indexing, especially if you're only updating your documents once or twice a day.

  • Related