In the last question that I asked I want to remove the HTML tags in my search results, After that I thought I could highlite the results with a common query, But in the highlighting field I got other html contents that you removed with script. Would you please help me to highlight the results without html tags that I saved in my db?
My mapping and settings:
{
"settings": {
"analysis": {
"filter": {
"my_pattern_replace_filter": {
"type": "pattern_replace",
"pattern": "\n",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
],
"char_filter": [
"html_strip"
]
},
"parsed_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"filter": [
"my_pattern_replace_filter"
]
}
}
}
},
"mappings": {
"properties": {
"html": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"raw": {
"type": "text",
"fielddata": true,
"analyzer": "parsed_analyzer"
}
}
}
}
}
}
Search Query:
POST idx_test/_search
{
"script_fields": {
"raw": {
"script": "doc['html.raw']"
}
},
"query": {
"match": {
"html": "more"
}
},"highlight": {
"fields": {
"*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
}
}
}
Result:
"hits": [
{
"_index": "idx_test2",
"_type": "_doc",
"_id": "GijDsYMBjgX3UBaguGxc",
"_score": 0.2876821,
"fields": {
"raw": [
"Test More test"
]
},
"highlight": {
"html": [
"<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"
]
}
}
]
Result that I want to get:
"hits": [
{
"_index": "idx_test2",
"_type": "_doc",
"_id": "GijDsYMBjgX3UBaguGxc",
"_score": 0.2876821,
"fields": {
"raw": [
"Test <strong>More</strong> test"
]
}
]
CodePudding user response:
I thought of another solution. You could index two fields, the original html and the html_extract which has only the text. You would have to use a processor to just index the text coming from the message and highligths would work.
Mapping
PUT idx_html_strip
{
"mappings": {
"properties": {
"html": {
"type": "text"
},
"html_extract": {
"type": "text"
}
}
}
}
Processor Pipeline
PUT /_ingest/pipeline/pipe_html_strip
{
"description": "_description",
"processors": [
{
"html_strip": {
"field": "html",
"target_field": "html_extract"
}
},
{
"script": {
"lang": "painless",
"source": "ctx['html_raw'] = ctx['html_raw'].replace('\n',' ').trim()"
}
}
]
}
Index Data
Note the use ?pipeline=pipe_html_strip
POST idx_html_strip/_doc?pipeline=pipe_html_strip
{
"html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"""
}
Query
GET idx_html_strip/_search?filter_path=hits.hits._source,hits.hits.highlight
{
"query": {
"multi_match": {
"query": "More",
"fields": ["html", "html_extract"]
}
},"highlight": {
"fields": {
"*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
}
}
}
Results
{
"hits": {
"hits": [
{
"_source": {
"html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>""",
"html_extract": "Test More test"
},
"highlight": {
"html": [
"""<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong><strong>More</strong></strong> test</span></body>"""
],
"html_extract": [
"Test <strong>More</strong> test"
]
}
}
]
}
}