How to aggregate matched terms in a query

I wish to search wildcard terms in a nested list of dict and then obtain a list of terms and its uuid grouped by matched wildcard.

I've the following mapping in my index:

"mappings": {
    "properties": {
        "uuid": {
            "type": "keyword"
        },
        "urls": {
            "type": "nested",
            "properties": {
                "url": {
                    "type": "keyword"
                },
                "is_visited": {
                    "type": "boolean"
                }
            }
        }           
    }
}

and a lot of data such this:

{
    "uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd"
    "urls": [
        {
            "is_visited": true,
            "url": "https://www.google.com"
        },
        {
            "is_visited": false,
            "url": "https://www.facebook.com"
        },
        {
            "is_visited": true,
            "url": "https://www.twitter.com"
        },              
    ]
},
{
    "uuid":"4a1c695d-756b-4d9d-b3a0-cf524d955884"
    "urls": [
        {
            "is_visited": true,
            "url": "https://www.stackoverflow.com"
        },
        {
            "is_visited": false,
            "url": "https://www.facebook.com"
        },
        {
            "is_visited": false,
            "url": "https://drive.google.com"
        },
        {
            "is_visited": false,
            "url": "https://maps.google.com"
        },                      
    ]
}
...

I wish to search via wildcard "*google.com OR *twitter.com" and obtain something like this:

"hits": [
    "*google.com": [
        {
            "uuid": "4a1c695d-756b-4d9d-b3a0-cf524d955884",
            "_source": {
                "is_visited": false,
                "url": "https://drive.google.com"
            }
        },
        {
            "id": "4a1c695d-756b-4d9d-b3a0-cf524d955884",
            "_source": {
                "is_visited": false,
                "url": "https://maps.google.com"
            }
        },
        {
            "uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd",
            "_source": {
                "is_visited": true,
                "url": "https://www.google.com"
            }
        }
    ]
    "*twitter.com": [
        {
            "uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd",
            "_source": {
                "is_visited": true,
                "url": "https://www.twitter.com"
            },  
        },
    ]
]

This is my (python) search query:

body = {
  #"_source": False,
  "size": 100,
  "query": {
        "nested": {
            "path": "urls",
            "query":{
                "query_string":{
                    "query": f"urls.url:{urlToSearch}",
                }
            }
            ,"inner_hits": {
                "size":100 # returns top 100 results
            }
        }
    }
}

but it returns an hit for each matched term instead of aggregate them in a list similar to what I would like to get.

EDIT This is my setting and mapping:

{
    "settings": {
        "analysis": {
            "char_filter": {
                "my_filter": {
                    "type": "mapping",
                    "mappings": [
                        "- => _",
                    ]
                },
            },
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "char_filter": [
                        "my_filter"
                    ],
                    "filter": [
                        "lowercase",
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "uuid": {
                "type": "keyword"
            },
            "urls": {
                "type": "nested",
                "properties": {
                    "url": {
                        "type": "keyword"
                    },
                    "is_visited": {
                        "type": "boolean"
                    }
                }
            }           
        }
    }
}

CodePudding user response：

Elasticsearch will not provide the output you want the way you set up the query. This scenario to be an aggregation. My suggestion was to apply the nested query and use aggregation on the results.

Attention point wildcard query:

Avoid beginning patterns with * or ?. This can increase the iterations needed to find matching terms and slow search performance.

{
  "size": 0,
  "query": {
    "nested": {
      "path": "urls",
      "query": {
        "bool": {
          "should": [
            {
              "wildcard": {
                "urls.url": {
                  "value": "*google.com"
                }
              }
            },
            {
              "wildcard": {
                "urls.url": {
                  "value": "*twitter.com"
                }
              }
            }
          ]
        }
      }
    }
  },
  "aggs": {
    "agg_providers": {
      "nested": {
        "path": "urls"
      },
      "aggs": {
        "google.com": {
          "terms": {
            "field": "urls.url",
            "include": ".*google.com",
            "size": 10
          }
        },
        "twitter.com": {
          "terms": {
            "field": "urls.url",
            "include": ".*twitter.com",
            "size": 10
          }
        }
      }
    }
  }
}

Results:

"aggregations": {
    "agg_providers": {
      "doc_count": 7,
      "twitter.com": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "https://www.twitter.com",
            "doc_count": 1
          }
        ]
      },
      "google.com": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "https://drive.google.com",
            "doc_count": 1
          },
          {
            "key": "https://maps.google.com",
            "doc_count": 1
          },
          {
            "key": "https://www.google.com",
            "doc_count": 1
          }
        ]
      }
    }
  }