searching Mysql table with Elasticsearch-CodePudding

Lets say I have the following "expenses" MySQL Table:

id	amount	vendor	tag
1	100	google	foo
2	450	GitHub	bar
3	22	GitLab	fizz
4	75	AWS	buzz

I'm building an API that should return expenses based on partial "vendor" or "tag" filters, so vendor="Git" should return records 2&3, and tag="zz" should return records 3&4. I was thinking of utilizing elasticsearch capabilities, but I'm not sure the correct way..
most articles I read suggest replicating the table records (using logstash pipe or other methods) to elastic index.

So my API doesn't even query the DB and return an array of documents directly from ES?

Is this considered good practice? replicating the whole table to elastic? What about table relations... What If I want to filter by nested table relation?...

CodePudding user response：

So my API doesn't even query the DB and return an array of documents directly from ES?

Yes, As you are doing query to elasticsearch, you will get result only from Elasticsearch. Another way is, just get id from Elasticsearch and use id to retrive documeents from MySQL, but this might impact response time.

Is this considered good practice? replicating the whole table to elastic? What about table relations... What If I want to filter by nested table relation?...

It is not about good practice or bad practice, it is all about what type of functionality and use case you want to implement and based on that technology stack can be used and data can be duplicated. There is lots of company using Elasticsearch as secondary data source where they have duplicated data just because there usecase is best fit with Elasticsearh or other NoSQL db.

Elasticsearch is NoSQL DB and it is not mantain any relationship between data. Hence, you need to denormalize your data before indexing to the Elasticsearch. You can read this article for more about denormalizetion and why it is required.

ElasticSearch provide Nested and Join data type for parent child relationship but both have some limitation and performance impact.

Below is what they have mentioned for join field type:

The join field shouldn’t be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance. It can also trigger global ordinals to be built.

Below is what they have mentioned for nested field type:

When ingesting key-value pairs with a large, arbitrary set of keys, you might consider modeling each key-value pair as its own nested document with key and value fields. Instead, consider using the flattened data type, which maps an entire object as a single field and allows for simple searches over its contents. Nested documents and queries are typically expensive, so using the flattened data type for this use case is a better option.

most articles I read suggest replicating the table records (using logstash pipe or other methods) to elastic index.

Yes, You can use logstash or any language client like java, python etc, to sync data from DB to Elasticsearch. You can check this SO answer for more information on this.

Your Search Requirements

If you go ahead with Elasticsearch then you can use N-Gram Tokenizer or Regex Query and achieve your search requirements.