Elasticsearch prevent indexing of Markdown hyperlinks-CodePudding

I am building a Markdown file content search using Elasticsearch. Currently the whole content inside the MD file is indexed in Elasticsearch. But the problem is it shows results like this [Mylink](https://link-url-here.org), [Mylink2](another_page.md) in the search results.

I would like to prevent indexing of hyperlinks and reference to other pages. When someone search for "Mylink" it should only return the text without the URL. It would be great if someone could help me with the right solution for this.

CodePudding user response：

You need to render Markdown in your indexing application, then remove HTML tags and save it alongside with the markdown source.

CodePudding user response：

I think you have two main solutions for this problem. first: clean the data in your source code before indexing it into Elasticsearch. second: use the Elasticsearch filter to clean the data for you. the first solution is the easy one but if you need to do this process inside the Elasticsearch you need to create a ingest pipeline.

then you can use the Script processor to clean the data you need by a ruby script that can find your regex and remove it

CodePudding user response：

You could use an ingest pipeline with a script processor to extract the link text:

1. Set up the pipeline

PUT _ingest/pipeline/clean_links
{
  "description": "...",
  "processors": [
    {
      "script": {
        "source": """
          if (ctx["content"] == null) {
            // nothing to do here
            return
          }
          
          def content = ctx["content"];
          
          Pattern pattern = /\[([^\]\[] )\](\(((?:[^\()] ) )\))/;
          Matcher matcher = pattern.matcher(content);
          def purged_content = matcher.replaceAll("$1");
          
          ctx["purged_content"] = purged_content;
        """
      }
    }
  ]
}

The regex can be tested here and is inspired by this.

2. Include the pipeline when ingesting the docs

POST my-index/_doc?pipeline=clean_links
{
  "content": "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}

POST my-index/_doc?pipeline=clean_links
{
  "content": "[Mylink2](another_page.md)"
}

The python docs are here.

3. Verify

GET my-index/_search?filter_path=hits.hits._source

should yield

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "purged_content" : "Mylink anotherLink",
          "content" : "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
        }
      },
      {
        "_source" : {
          "purged_content" : "Mylink2",
          "content" : "[Mylink2](another_page.md)"
        }
      }
    ]
  }
}

You could instead replace the original content if you want to fully discard them from your _source.

In contrast, you could go a step further in the other direction and store the text link pairs in a nested field of the form:

{
  "content": "...",
  "links": [
    {
      "text": "Mylink",
      "href": "https://link-url-here.org"
    },
    ...
  ]
}

so that when you later decide to make them searchable, you'll be able to do so with precision.

Shameless plug: you can find other hands-on ingestion guides in my Elasticsearch Handbook.