Elasticsearch - count word occurrences in all texts from index-CodePudding

I need to get a count of word X from all texts in index Y, which has only one field "content". Note that I need a count of specific word, how many times it occurred in total across all documents. From what I've found ES is not well optimized for this (since this is a text type), but this is for university homework, so I have little choice.

So far I've tried (taken from here):

{
  "script_fields": {
    "phrase_Count": {
      "script": {
        "lang": "painless",
        "source": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count  ; return count;",
        "params": {
          "phrase": "ustawa"
        }
      }
    }
  }
}

The scripting approach returns:

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:88)",
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:40)",
          "if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) ",
          "       ^---- HERE"
        ],
        "script": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count  ; return count;",
        "lang": "painless",
        "position": {
          "offset": 22,
          "start": 15,
          "end": 104
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "bills",
        "node": "MXtcD7-zT-mhDyxMeRTMLw",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:88)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:40)",
            "if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) ",
            "       ^---- HERE"
          ],
          "script": "int count = 0; if(doc['content.keyword'].size() > 0 && doc['content'].value.indexOf(params.phrase)!=-1) count  ; return count;",
          "lang": "painless",
          "position": {
            "offset": 22,
            "start": 15,
            "end": 104
          },
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "No field found for [content.keyword] in mapping with types []"
          }
        }
      }
    ]
  },
  "status": 400
}

Above the content.keyword was used, since with plain content ES was complaining about the text type not being optimized for such searches.

I also tried using text statistics (from here), but I couldn't get this to work, it only counted documents with the word (which is not what I'm looking for).

As my last approach I tried search with aggregation (from here), but it also just returned the count of documents, not words:

{
  "query": {
    "query_string": {
      "fields": ["content"],
      "query": "ustawa"
    }
  },  
  "aggs": {
    "my-terms": {
      "terms": {
        "field": "content.keyword"
      }
    }
  }
}

How can I achieve this? I'm using Python, if it matters.

EDIT Mapping for index I'm using:

  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }

CodePudding user response：

In the Elasticsearch 7.11 anonsed runtime_mappings. With this feature you could build new field in runtime and after that count the words in all your documents with regular "sum" aggregation.

For example:

PUT test/_doc/1
{
  "field" : "test test test ss"

}
PUT test/_doc/2
{
  "field" : "test test test ss"

}
GET test/_search
{
  "size": 0, 
  "runtime_mappings": {
    "phrase_count": {
      "type": "long",
      "script": """
         String tmp = doc['field.keyword'].value;
         Matcher m = /(test)/.matcher(tmp);
         int count = 0;
         while (m.find()){
           count  ;
         }
         emit(count);
          """
    }
  },
  "query": {
    "match_all": {}
  }, 
  "aggs": {
    "word_count": {
      "sum": {
        "field": "phrase_count"
      }
    }
  }
}

The "test" in the Matcher - word, that you looking for and want to count.

CodePudding user response：

There is a build in API in Elasticsearch to retrieve such information, since document and term frequency are very relevant for the BM25 scoring in Elasticsearch. See the Term vectors API and the term statistics option. You are looking for the "total term frequency" value there.

If you only want to return the term statistics for specific terms and not all terms in existing documents, you can send an "artifical document" to the api that only contains the terms you are looking for.