I have indexed all wikipedia pages on elasticsearch, and now I would like to search through them according to a list of keywords that I have created. The documents on elasticsearch have only three fields: id
for the page id, title
for the page title and content
for the page content (already clean of wikipedia markup).
My goal is to reproduce the mediawiki query api as much as possible, with parameters action=query
and list=search
. For instance, given the keywords "non riemannian metric spaces", a call to
gives a list of the most relevant pages for those keywords.
So far I have been using rather simple elasticsearch search queries, like for instance
POST _search
{
"query": {
"bool" : {
"must" : {
"match" : {
"content": {
"query": "non riemannian metric spaces"
}
}
},
"should" : {
"match" : {
"title": {
"query": "non riemannian metric spaces",
"boost": x
}
}
}
}
}
}
for several values of boost, like 1
, 2
or 0.5
. This gives already some decent results, in the sense that the pages I obtain are relevant to the keywords, but sometimes they are not quite the same I get with the mediawiki api.
I would be glad to hear some suggestions on how to fine-tune the elasticsearch query to mimic more accurately the mediawiki api behavior. Or even, since the mediawiki api itself is built with elasticsearch and cirrussearch, I would like to know whether the actual elasticsearch query for the entry point above with those specific parameters is openly available.
Thank you in advance!
UPDATE (after Robis Koopmans' answer): Seeing the actual query with cirrusDumpQuery
has indeed been very useful. I do however have some followup questions concerning the query:
The query has a set of similar
multi_match
clauses searching my keywords infields
like["title.plain^1", "title^3"]
. While I understand the^n
boost, I ignore what.plain
refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived fromtitle
at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.At some other point in the query, there is a
{"match": {"all": {...}}}
clause. What exactly is theall
key here? Is it a document field? Is it related with thematch_all
clause?What is the
suggest
field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?To be performed after the search, there is a
rescore
clause with two other score functions. One of them uses thepopularity_score
of a wikipedia page. What is that?And finally, the most relevant score that ends up ranking the pages is the output of the
sltr
clause. In it, there is a"model": "enwiki-20220421-20180215-query_explorer"
, and in the score explanation it is identified with aLtrModel: naive_additive_decision_tree
. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
Please feel free to answer whichever question you know the answer to, and again thanks a lot!
CodePudding user response:
You can add cirrusDumpQuery
to your query
example:
more information:
https://www.mediawiki.org/wiki/Extension:CirrusSearch#API
CodePudding user response:
You can't make Elasticsearch queries to Wikipedia directly, but CirrusSearch can generate many types of queries beyond fulltext search. It's not clear from your question exactly what type of query you are looking for, but it might be worth to look at sorting options, if you prefer to weight results by text similarity only, and not things like page views.