Home > OS >  get distinct values from "text field" without remapping
get distinct values from "text field" without remapping

Time:07-19

I'm querying ~350 TB of documents.

Re-indexing is not an option.

Performance, within reason, is not a concern.

my documents have a field s3_filename {"type": "text"}. It doesn't have any subfields.

Setting fielddata: true for s3_filename gives the wrong results (it returns unique "words" from the value, not unique values)

e.g. s3_filename has values similar to ud20220711/long-file-name-20220711.json.gz

I want the aggregates on the value (ud20220711/long-file-name-20220711.json.gz), however, I get instead aggregates of the words:

returns buckets of parts of the s3_filename. E.g, [long, file, name, 20220711, json, gz] but no bucket for the value ud20220711/long-file-name-20220711.json.gz

I've tried simple aggs, and aggs with composites, but nothing works...

Edit: On discussions on elastic.co (https://discuss.elastic.co/t/get-distinct-values-from-text-field-without-remapping/309694/5) leandrojmp showed me that I could create "runtime fields" either in the index mapping, or in the query itself. I went with runtime fields in the query itself.

The following gave me exactly what I needed, however, the query time compared to other queries against the same data set took 18 times longer to run (74 minutes; perfectly acceptable compared to about 15 days to re-index 350 TB).

{
  "runtime_mappings": {
    "s3_filename": {
      "type": "keyword"
    }
  },
  "aggs": {
    "filenames": {
      "terms": {
        "field": "s3_filename",
        "size": 10000
      }
    }
  }
}

CodePudding user response:

The crux of the problem is that your field is analyzed by ES into individual tokens, and not a single value as you want. THat is whay the aggregations dont work.

You said reindexing is a problem. That rules out rebuilding the index. But have you considered updating the mapping of this field to use multi_fields and a "keyword" mapping.?

Solution 1:

PUT /my-index-000001/_mapping
{
  "properties": {
    "s3_filename": {
      "type": "text",
      "fields": {
        "raw": {
          "type": "keyword"
        }
      }
    }
  }
}

You can then aggregate on the "s3_filename.raw". Link to docs: Multi Mapping Update

Solution 2:

Construct a keyword field at runtime and aggregate on it.

PUT my-index-000001/
{
  "mappings": {
    "runtime": {
      "s3_keyword": {
        "type": "keyword",
        "script": {
          "source": "emit(ctx._source['s3_filename'])"
        }
      }
    }
  }
}

You now aggregate on this field. THis will be way slower on performance.

Observations:

Setting field-data to true, requires the whole field source to be loaded into memory,and will never be very performant.

Reindexing the data may seem like the best option if none of this is possible, simply because 350 tb is not that big a deal, if done with multithreads. Finally have you considered reindexing only the fields "s3_filename" into another index? You can use the /_reindex api to selectively reindex only some fields out of an existing index into a new index. That is not going to be very expensive, or time taking and you will get the results of your aggregations.

  • Related