Home > other >  Elasticsearch analyzers not working on queries
Elasticsearch analyzers not working on queries

Time:06-17

So I am running Elasticsearch and Kibana locally on ports 9200 and 5601 respectively. I am attempting to process a JSONL file into Elasticsearch documents and apply an analyzer to some of the fields.

This is the body:

body = {
  "mappings": {
    "testdoc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "stop"
        }
        "content": {
          "type": "text",
          "analyzer": "stop"
        }
      }
    }
  }
}

I then create a new index (and I am deleting the index between tests so I know it's not that)

es.indices.create("testindex", body=body)

I then parse my JSONL object into documents and upload them to elasticsearch using

helpers.bulk(es, documents, index="textindex", doc_type="testdoc"

Finally I query like this

q = {"query": { "match-all": {}}}
print(es.search(index="testindex", body="query")

My result, for a sample sentence like "The quick brown fox" is unchanged when I'd expect it to be 'quick brown fox'.

When I run the same query in Kibana I also see it not working

GET /testindex/_search
{
  "query": {
    "match-all": {}
  }
}

Response:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 10,
    "max_score" : 4.128039,
    "hits" : [
      {
        "_index" : "testindex",
        "_type" : "textdocument",
        "_id" : "6bfkb4EBWF89_POuykkO",
        "_score" : 4.128039,
        "_source" : {
          "title" : "The fastest fox",
          "body" : "The fastest fox is also the brownest fox. They jump over lazy dogs."
        }
      }
    ]
  }
}

Now I do this query:

POST /testindex/_analyze
{
  "field": "title",
  "text": "The quick brown fox"
}

I get this response:

{
  "tokens" : [
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    }
  ]
}

Which is what I would expect.

But conversely if I do

POST testindex/testdoc
{
  "title":"The fastest fox",
  "content":"Test the a an migglybiggly",
  "published":"2015-09-15T10:17:53Z"
  
}

and then search for 'migglybiggly', the content field of the returned document has not dropped its stop words.

I am really at a loss as to what I am doing wrong here. I'm fairly new to elasticsearch and this is really dragging me down.

Thanks in advance!

Edit:

If I run

GET /testindex/_mapping

I see

{
  "textindex" : {
    "mappings" : {
      "testdoc" : {
        "properties" : {
          "title" : {
            "type" : "text",
            "analyzer" : "stop"
          },
          "content" : {
            "type" : "text",
            "analyzer" : "stop"
          }
        }
      }
    }
  }
}

So, to me, it looks like the mapping is getting uploaded correctly, so I don't think it's that?

CodePudding user response:

This is expected behavior because when you execute queries and get a response then it is your original content (_source) you receive and not analyzed field.

The analyzer is used for, how the Elasticsearch Index field into inverted index and it is not for changing your actual content. Same analyzer will be applied at query time as well, so when you pass the query, it will use stop analyzer and remove stopwords and search your query in inverted index.

This POST /testindex/_analyze API will show how your original content is analyzed / tokenized and store to inverted index. It will not change your original document.

So when you search match_all query, it will just get all the documents from Elasticsearch with _source which have original document content and give you as a response.

You can use match query for matching on specific field insted of match_all as match_all will give you all the document from index (by default 10).

{
  "query": {
    "match": {
      "title": "The quick brown fox"
    }
  }
}

Here, you can try query like quick brown fox or The quick etc.

Hope I have clear your understandings..

  • Related