Home > Mobile >  Elasticsearch sort by values in array
Elasticsearch sort by values in array

Time:11-15

Each of my records in Elasticsearch has an array of objects that looks like this:

{
  "counts_by_year": [
    {
      "year": 2022,
      "works_count": 22523,
      "cited_by_count": 18054
    },
    {
      "year": 2021,
      "works_count": 32059,
      "cited_by_count": 24817
    },
    {
      "year": 2020,
      "works_count": 27210,
      "cited_by_count": 30238
    },
    {
      "year": 2019,
      "works_count": 22592,
      "cited_by_count": 33631
    }
  ]
}

What I want to do is sort my records using the average of works_count where year is 2022 and year is 2021. Is this a case where I could use script based sorting? Or should I try to copy those values into a separate field and sort on that?

Edit - the mapping is:

{
  "mappings": {
    "_doc": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        .
        .
        .
        "counts_by_year": {
          "properties": {
            "cited_by_count": {
              "type": "integer"
            },
            "works_count": {
              "type": "integer"
            },
            "year": {
              "type": "integer"
            }
          }
        },
        .
        .
        .
      }
    }
  }
}

CodePudding user response:

Tldr;

It depends. Most likely yes, except if count_by_year is nested.

Solution

Something along those lines should do the trick

GET /_search
{
  "sort": {
    "_script": {
      "type": "number",
      "script": {
        "lang": "painless",
        "source": "doc['counts_by_year.works_count'].stream().mapToLong(x -> x).average().orElse(0);"
      }
    }
  }
}

Solution (nested fields)

PUT 74404793-2
{
  "mappings": {
      "properties": {
        "counts_by_year": {
          "type": "nested", 
          "properties": {
            "cited_by_count": {
              "type": "long"
            },
            "works_count": {
              "type": "long"
            },
            "year": {
              "type": "long"
            }
          }
        }
      }
    }
}

POST /74404793-2/_doc/
{
  "counts_by_year": [
    {
      "year": 2022,
      "works_count": 22523,
      "cited_by_count": 18054
    },
    {
      "year": 2021,
      "works_count": 32059,
      "cited_by_count": 24817
    },
    {
      "year": 2020,
      "works_count": 27210,
      "cited_by_count": 30238
    },
    {
      "year": 2019,
      "works_count": 22592,
      "cited_by_count": 33631
    }
  ]
}

I am using the _source to access the documents, it can severely impact the performances if you have big documents.

GET 74404793-2/_search
{
  "sort": {
    "_script": {
      "type": "number",
      "script": {
        "lang": "painless",
        "source": """
        params._source['counts_by_year']
        .stream()
        .filter(x -> x['year'] > 2020)
        .mapToLong(x -> x['works_count'])
        .average().orElse(0);"""
      }
    }
  }
}
  • Related