Home > OS >  Partial string match in mongodb array using mongoose
Partial string match in mongodb array using mongoose

Time:11-10

I've got a MongoDB collection, which looks like this:

// sites
// note that these urls all have paths, this is important.
// The path can only be longer, e.g. amazon.com/Coffee-Mug
[
  {
    name: "MySite",
    urls: ['google.com/search', 'amazon.com/Coffee', 'amazon.com/Mug']
  },
  {
    name: "OtherSite",
    urls: ['google.com/search', 'microsoft.com/en-us']
  }
]

What I'm trying to do is the following:

class Service {
  /**
   * @param url Is a full url, like "https://www.google.com/search?q=stackoverflow"
   * or "https://www.amazon.com/Coffee-Program-Ceramic-Makes-Programmers/dp/B07D2XJLLG/"
   */
  public async lookup(findUrl: string) {
    const trimmed = trim(findUrl); // remove variables and https, but NOT the path!
    // return the "Site" in which the base url is matched with the full url
    // see description below
  }
}

For example, using these cases

Case 1:

  • url = 'https://www.amazon.com/Coffee-Program-Ceramic-Makes-Programmers/dp/B07D2XJLLG/'
  • returned site(s): [MySite]

Case 2:

  • url = 'https://www.google.com/search?q=stackoverflow'
  • returned site(s): [MySite, OtherSite]

Case 3 (same as case 1 but with other value):

  • url = 'https://www.microsoft.com/en-us/surface'
  • returned site(s): [OtherSite]

Case 4 (when not to match):

  • url = 'https://microsoft.com/nl-nl' OR
  • url = 'https://microsoft.com'
  • returned site(s): []

I've tried to do something like this:

Site.find({ url: { $in: trimmed }})

Above kind of works, but the problem is, this only does exact matches. I want to match the url from MongoDB with the url provided by the function. How does one do this?

I've received the suggestion to use check if field is substring of a string or text search on MongoDB, but this is too inaccurate. I can basically enter the base domain without a path and it will find it, this is definitely not supposed to be happening.

CodePudding user response:

One option is to use $reduce and $filter to iterate both over the sites and urls, matching them with regexMatch. Then, since the requested output is an array of urls, not array of objects, we can use $facet to handle the case of no matches at all.

db.collection.aggregate([
  {$match: {$expr: {
      $gt: [{
        $size: {
          $reduce: {
            input: urls_array,
            initialValue: [],
            in: {$concatArrays: [
                "$$value",
                {$filter: {
                    input: "$urls",
                    as: "url",
                    cond: {$regexMatch: {input: "$$this", regex: "$$url"}}
                 }}
             ]}
           }
         }
      }, 0]
  }}},
  {$facet: {res: [{$group: {_id: 0, res: {$addToSet: "$name"}}}]}},
  {$replaceRoot: {newRoot: {$mergeObjects: [{res: []}, {$first: "$res"}]}}}
])

See how it works on the playground example

CodePudding user response:

I don't think of any straight way, You can use aggregation operators, but it will execute slowly because it will not use the index, if you really wanted to then I have added an approach below,

  • $expr enables to use of aggregation operators in the query part
  • $map to iterate loop of urls array, this will return a boolean values
  • $replaceAll to replace / to . in the URL string
  • $regexMatch to match the input string with an above-generated string that will input as a regular expression string, this will return true if the string pattern matches otherwise false
  • $in to check if the above value has true or not
Site.find({
  $expr: {
    $in: [
      true,
      {
        $map: {
          input: "$urls",
          in: {
            $regexMatch: {
              input: trimmed, // your input variable
              regex: {
                $replaceAll: {
                  input: "$$this",
                  find: "/",
                  replacement: "."
                }
              },
              options: "i"
            }
          }
        }
      }
    ]
  }
})

Playground

  • Related