Home > Blockchain >  Using multi-match query across parent and nested documents possible?
Using multi-match query across parent and nested documents possible?

Time:02-19

Assuming the model:

{
  "group" : "fans",
  "name": "Anne",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

with user being of "nested" type, I would like to use multi-match query to select matching documents (and inner hits) that match on both parent and innerhit.

Use case 1 - all match on parent

Searching fans Anne should give me the above document along with all inner hits because it matches completely on the parent level.

Use case 2 - all match on inner hit

Searching John Smith should give me the above document, but only with the first inner hit because it did not match on parent level nor did it match on the second inner hit.

Use case 3 - partial match on parent and inner hit

Searching fans Smith should give me the above document, but only with the first inner hit because the combined result matches in combination with the parent and first inner hit fields. It should NOT return the second inner hit since Smith is missing from both its own and parent fields.


Use cases 1 and 2 are quite easily solved with a boolquery that joins together a multi-match query on the parent level and another multi-match query inside a nested query (following is Java code):

boolQuery()
    .should(multiMatchQuery(searchTerm).operator(AND).type(CROSS_FIELDS))
    .should(nestedQuery("user", multiMatchQuery(searchTerm).operator(AND).type(CROSS_FIELDS), NONE))

It's the third use case that I am stuck on. The above query only works for the parent level or nested level separately, but not in combination. I have tried to add "include_in_parent" to the nested type to have it indexed together with the parent, but then it matches on searches like John Alice which I don't want.

CodePudding user response:

you cant handle nested field and non nested field inside a multimatch query. Due to the nature of nested documents.

So I think the only solution is to change your model and duplicate the group and name fields inside each nested document. So your request logic would be to join a multi-match query on the parent and a nested query on the fans searching in group/name/first/last fields.

I know that you certainly dont want to change the model, but when working with ElasticSearch you have to adapt the model to match the search features you want to provide. Not the other way around ;)

CodePudding user response:

I did a last ditch effort to try and manually get around this and actually managed to find a way, unexpectedly. It's quite involved, but this is how I currently solved it. (I will explain how it works below the code snippet.)

private static final String FULL_MATCH_ON_PARENT = "full-match-on-parent";
private static final String PARTIAL_MATCH_ON_PARENT = "partial-match-on-parent";
private static final String PARTIAL_MATCH_ON_NESTED = "partial-match-on-nested";
private final RestHighLevelClient client;
private final ObjectMapper objectMapper;

public List<ParentObject> search(String searchTerm) throws IOException {
    SearchSourceBuilder searchSourceBuilder = searchSource().size(20).query(queryForMatchingParents(searchTerm));
    SearchRequest request = new SearchRequest().indices("my-index").source(searchSourceBuilder);
    SearchResponse search = client.search(request, DEFAULT);
    return Stream.of(search.getHits().getHits())
            .map(hit -> {
                ParentObject parent = readJson(hit.getSourceAsString(), ParentObject.class);
                if (!Arrays.asList(hit.getMatchedQueries()).contains(FULL_MATCH_ON_PARENT)) {
                    List<String> nestedToKeep = Arrays.stream(hit.getMatchedQueries())
                            .filter(queryName -> queryName.startsWith(PARTIAL_MATCH_ON_PARENT))
                            .map(partialMatchOnParentQueryName -> partialMatchOnParentQueryName.replace("parent", "nested"))
                            .flatMap(partialMatchOnNestedQueryName -> Arrays.stream(hit.getInnerHits().get(partialMatchOnNestedQueryName).getHits()))
                            .map(innerHit -> readJson(innerHit.getSourceAsString(), NestedObject.class).getId())
                            .distinct()
                            .collect(toList());
                    parent.getNestedObjects().removeAll(parent.getNestedObjects().stream()
                            .filter(nested -> !nestedToKeep.contains(nested.getId()))
                            .collect(toList()));
                }
                return parent;
            })
            .filter(Objects::nonNull)
            .collect(toList());
}

private QueryBuilder queryForMatchingParents(String searchTerm) {
    BoolQueryBuilder superAggregateQuery = boolQuery();
    MultiMatchQueryBuilder matchParentQuery = multiMatchQuery(searchTerm)
            .operator(Operator.AND)
            .type(MultiMatchQueryBuilder.Type.CROSS_FIELDS).queryName(FULL_MATCH_ON_PARENT);
    superAggregateQuery.should(matchParentQuery);
    BoolQueryBuilder aggregateQuery = boolQuery();
    aggregateQuery.mustNot(matchParentQuery);
    BoolQueryBuilder matchNestedQuery = boolQuery();
    int counter = 1;
    for (Pair<String, String> searchTermPair : getParentNestedPairsOfSearchTerm(searchTerm)) {
        matchNestedQuery.should(queryPartialInnerHit(searchTermPair.getLeft(), searchTermPair.getRight(), counter));
        counter  ;
    }
    aggregateQuery.must(matchNestedQuery);
    superAggregateQuery.should(aggregateQuery);
    return superAggregateQuery;
}

private QueryBuilder queryPartialInnerHit(String parentSearchTerm, String nestedSearchTerm, int counter) {
    BoolQueryBuilder splitBoolQuery = boolQuery();
    if (StringUtils.isNotEmpty(parentSearchTerm)) {
        splitBoolQuery.must(multiMatchQuery(parentSearchTerm)
                .operator(Operator.AND)
                .type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)).queryName(PARTIAL_MATCH_ON_PARENT   "-"   counter);
    } else {
        // this is necessary because we still need the queryname to trigger for empty string in parentSearchTerm
        splitBoolQuery.must(matchAllQuery()).queryName(PARTIAL_MATCH_ON_PARENT   "-"   counter);
    }
    splitBoolQuery.must(nestedQuery("nested",
            multiMatchQuery(nestedSearchTerm)
                    .operator(Operator.AND)
                    .fuzzyTranspositions(false)
                    .type(MultiMatchQueryBuilder.Type.CROSS_FIELDS),
            ScoreMode.Min).innerHit(new InnerHitBuilder(PARTIAL_MATCH_ON_NESTED   "-"   counter).setExplain(true)));
    return splitBoolQuery;
}

private List<Pair<String, String>> getParentNestedPairsOfSearchTerm(String searchTerm) {
    Set<String> words = new HashSet<>(Arrays.asList(searchTerm.split(" ")));
    Set<Set<String>> powerSet = powerSet(words);
    powerSet = powerSet.stream().filter(set -> set.size() < words.size()).collect(Collectors.toSet());

    return powerSet.stream()
            .map(set -> {
                ArrayList<String> truncatedWords = new ArrayList<>(words);
                truncatedWords.removeAll(set);
                String words1 = String.join(" ", set);
                String words2 = String.join(" ", truncatedWords);
                return new ImmutablePair<>(words1, words2);
            })
            .collect(Collectors.toList());
}

private <T> Set<Set<T>> powerSet(Set<T> originalSet) {
    Set<Set<T>> sets = new HashSet<>();
    if (originalSet.isEmpty()) {
        sets.add(new HashSet<>());
        return sets;
    }
    List<T> list = new ArrayList<>(originalSet);
    T head = list.get(0);
    Set<T> rest = new HashSet<>(list.subList(1, list.size()));
    for (Set<T> set : powerSet(rest)) {
        Set<T> newSet = new HashSet<>();
        newSet.add(head);
        newSet.addAll(set);
        sets.add(newSet);
        sets.add(set);
    }
    return sets;
}

private <T> T readJson(String json, Class<T> objectClass) throws IOException {
    return objectMapper.readValue(json, objectClass);
}

For reasons explained in my question, Elasticsearch can't search in both the parent and nested objects at the same time for a given query. So I create a power set to find all combinations to search for in parent and nested objects. For example, if I enter 3 search terms (one two three), I would look for 2^3=8 combinations. Search for one in parent fields and two three in the nested fields and so on.

I assign a named query to each of the partial matches (on parent and nested level) with the same number suffix 'counter' to identify which ones belong together. This becomes important when we get back the search response.

Before I talk about how we interpret the results, I should mention that there are only 2^3-1 combinations for partial matches. I consider the full match on parent a special case since in that case I do not need to filter any of the returned nested objects. Hence it has a different named query FULL_MATCH_ON_PARENT.

In the response we extract for each hit the named queries for full or partial matches on the parent level. In case the full match is not present (based on the lack of FULL_MATCH_ON_PARENT matched query), then the nested objects are evaluated for discarding. Only the nested objects who have a matching partial parent hit should be kept, so loop over the present partial-match-on-parent-{number} named queries and retrieve the corresponding innerhits with nested objects. From there on it should be self explanatory from the code.

The extensive integration tests have all passed with this solution.

  • Related