Having a large corpus of texts (100k) and a ngrams, examples :
query - get all texts with the tokens ['united' , 'airlines']
I would like to retrieve only texts with a full match of both tokens ('united' , 'airlines') but i also want that the distance between any of the tokens (united -> airlines , or 'airlines-> united') will be up to K positions. lets say k=2
my query now is:
query = {
"size": limit,
"query": {
"query_string": {"query": query,
"phrase_slop":2,
"default_operator":"AND"}
}
}
But it seems that it is not the right method because I am getting results with more than 2 positions (tokens) between them.
Any idea?
CodePudding user response:
I have found the answer to my question:
When using the query string type queries in ElasticSearch we can use proximity search by adding ~k , when k is the number of maximum edit distance of words in a phrase.
For the query in the main question, adding proximity search:
query = {
"size": limit,
"query": {
"query_string": {"query":"united airlines"~2,
"phrase_slop":2,
"default_operator":"AND"}
}
}
More information can be found in the documentation