Match multi tokens with proximity search between them-CodePudding

Having a large corpus of texts (100k) and a ngrams, examples :

query - get all texts with the tokens ['united' , 'airlines']

I would like to retrieve only texts with a full match of both tokens ('united' , 'airlines') but i also want that the distance between any of the tokens (united -> airlines , or 'airlines-> united') will be up to K positions. lets say k=2

my query now is:

  query = {
      "size": limit,
      "query": {
          "query_string": {"query": query,
                           "phrase_slop":2,
                           "default_operator":"AND"}
      }
  }

But it seems that it is not the right method because I am getting results with more than 2 positions (tokens) between them.

Any idea?

CodePudding user response：

I have found the answer to my question:

When using the query string type queries in ElasticSearch we can use proximity search by adding ~k , when k is the number of maximum edit distance of words in a phrase.

For the query in the main question, adding proximity search:

  query = {
  "size": limit,
  "query": {
      "query_string": {"query":"united airlines"~2,
                       "phrase_slop":2,
                       "default_operator":"AND"}
  }
}

More information can be found in the documentation