Some documents not appear in atlas-search when query by few letters-CodePudding

I have a collection. The document structure is,

{
  model: {
    name: 'string name'
  }
}

I have enabled atlas search, Also created a search index for model.name field. Search works fine, But the only issue is couldn't get results for very minimal query letters.

Example:

I have a document,

{
  model: {
     name: "space1duplicate"
  }
}

If I query space, I couldn't get the result.

{
  index: 'search_index',
  compound: {
    must: [
      {
        text: {
          query: 'space',
          path: 'model.name'
        }
      }
    ]
  }
}

But If I query space1duplica, It returns the result.

CodePudding user response：

During indexing, full text search engine tokenizes the input by splitting up text into searchable chunks. Check out the relevant section in the documentation.

By default Atlas Search does not split words by digits, but if you need that, try to define a custom analyzer with the regex tokenizer and use it for your field:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "name": [
        {
          "analyzer": "digitSplitter",
          "type": "string"
        }
      ]
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "digitSplitter",
      "tokenFilters": [],
      "tokenizer": {
        "pattern": "[0-9] ",
        "type": "regexSplit"
      }
    }
  ]
}

Also note that you can use multiple analyzers for string fields, if needed.

CodePudding user response：

Atlas search uses Lucene to do the job. Documentation on mongodb site is mostly focused on mongo specific syntax to pass the query to Lucene and might be a bit confusing if you are not familiar with its query language.

First of all, there are number of tokenizers and analizers available, each serve specific purpose. You really need include index definition when you ask quetions about atlas search.

Default tokeniser uses word separators to build the index, then removes endings to store stems, again depending on language, English by default.

So in order to find "space1duplicate" by beginning of the word you can use "autocomplete" analizer with nGram tokens. The index should be created as following:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "name": {
        "tokenization": "nGram",
        "type": "autocomplete"
      }
    }
  },
  "storedSource": {
    "include": [
      "name"
    ]
  }
}

Once it's indexed (you may need to wait a bit you you have larger dataset), you can find the document with following search:

{
  index: 'search_index',
  compound: {
    must: [
      {
        autocomplete: {
          query: 'spa',
          path: 'name'
        }
      }
    ]
  }
}