Is it possible to set new field value when analyzing document being indexed in Elasticsearch?-CodePudding

For example:

when indexing one document into elasticsearch;
i want to analyze a field named description in the document by uax_url_email tokenizer/analyzer;
if description does have any url, put the url into another field named urls array;
finish index this document;

Now i can check whether field urls is empty to know whether description has any url.

Is this possible? Or does analyzer only contributes to the inverted index, not other fields?

CodePudding user response：

You can use Ingest Pipeline Script processor with painless script. I hope this will help you.

POST _ingest/pipeline/_simulate?verbose
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "description": "Extract 'tags' from 'env' field",
          "lang": "painless",
          "source": """
            
            def m = /(http|ftp|https):\/\/([\w_-] (?:(?:\.[\w_-] ) ))([\w.,@?^=%&:\/~ #-]*[\w@?^=%&\/~ #-])/.matcher(ctx["content"]);
            ArrayList urls = new ArrayList();
            while(m.find())
            {
              urls.add(m.group());
            }
            ctx['urls'] = urls;
          """,
          "params": {
            "delimiter": "-",
            "position": 1
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "content": "My name is Sagar patel and i visit https://apple.com and https://google.com"
      }
    }
  ]
}

Above Pipeline will generate result like below:

{
  "docs": [
    {
      "processor_results": [
        {
          "processor_type": "script",
          "status": "success",
          "description": "Extract 'tags' from 'env' field",
          "doc": {
            "_index": "_index",
            "_id": "_id",
            "_source": {
              "urls": [
                "https://apple.com",
                "https://google.com"
              ],
              "content": "My name is Sagar patel and i visit https://apple.com and https://google.com"
            },
            "_ingest": {
              "pipeline": "_simulate_pipeline",
              "timestamp": "2022-07-13T12:45:00.3655307Z"
            }
          }
        }
      ]
    }
  ]
}