Home > database >  Remove Duplicate character from string in Mongodb
Remove Duplicate character from string in Mongodb

Time:11-12

I want to remove duplicate characters from strings in MongoDB. Example: Input string: xxxyzzxcdv Output string: xyzcdv

CodePudding user response:

Let's assume we have the following collection and the records inside of it:

db.messages.insertMany([
    {
        "message": "heelllo theeere"
    },
    {
        "message": "may thhhee forrrce be wiithh yyouuu"
    },
    {
        "message": "execute orrrrdder 66"
    }
])

Due to uncertainty, I am dropping solutions for both manipulating while querying and updating the records (permanently).

If you want to remove them while running your aggregation query:

In addition to @Takis's solution, using $function pipeline operator can be another option if your MongoDB version is 4.4 or higher.

Further readings on $function operator

// Query via Aggregation Framework
db.messages.aggregate([
    {
        $match: {
            message: {
                $ne: null
            }
        }
    },
    {
        $addFields: {
            distinctChars: {
                $function: {
                    body: function (message) {
                        return message
                            .split("")
                            .filter(function (item, pos, self) {
                                return self.indexOf(item) == pos;
                            })
                            .join("");
                    },
                    args: ["$message"],
                    lang: "js"
                }
            }
        }
    },
])

If you want to remove them via an update operation:

// Update each document using cursor
db.messages.find({ message: { $ne: null } })
    .forEach(doc => {
        var distinctChars = doc.message
            .split("")
            .filter(function (item, pos, self) {
                return self.indexOf(item) == pos;
            })
            .join("");

        db.messages.updateOne({ _id: doc._id }, [{ $set: { distinctChars: distinctChars } }]);
    });

A quick reminder: Above script just shows an easy way to update the records to reach the goal without focusing on other details. It can be an expensive operation depending on your real world collection's size and configurations, sharding for instance. Consider to improve it with your own way.

Result

For both way, the result should be like the following:

[
  {
    "_id": {
      "$oid": "618d95ccdedc26d80875b75a"
    },
    "message": "heelllo theeere",
    "distinctChars": "helo tr"
  },
  {
    "_id": {
      "$oid": "618d95ccdedc26d80875b75b"
    },
    "message": "may thhhee forrrce be wiithh yyouuu",
    "distinctChars": "may theforcbwiu"
  },
  {
    "_id": {
      "$oid": "618d95ccdedc26d80875b75c"
    },
    "message": "execute orrrrdder 66",
    "distinctChars": "excut ord6"
  }
]

CodePudding user response:

Query

  • reduce on range(count string)
  • keep 2 values {"previous": [], "string": ""} (initial value of reduce)
  • get the cur-char {"$substrCP": ["$mystring", "$$this", 1]} this is the current index on the string, and i take the next char
  • if it is in previous kep "string" as it is, else concat to add the new character
heelo

reduce on (0 1 2 3 4) `{"$range": [0, {"$strLenCP": "$mystring"}]}`
we start from  `{"previous": [], "string": ""}`

- get 1 character start from index 0  
 `{"$substrCP": ["$mystring", "$$this", 1]}}`  = "h"
- if this character is on previous don't add it  
 `{"$in": ["$$cur_char", "$$value.previous"]}`
- else add it on previous and on the string the 2 concats in code

Repeat for `index($$this)`= 1
- get 1 character start from index 1  
 `{"$substrCP": ["$mystring", "$$this", 1]}}` = "e"
.....

Test code here

aggregate(
[{"$set": 
   {"mystring": 
     {"$getField": 
       {"field": "string",
        "input": 
         {"$reduce": 
           {"input": {"$range": [0, {"$strLenCP": "$mystring"}]},
            "initialValue": {"previous": [], "string": ""},
            "in": 
             {"$let": 
               {"vars": {"cur_char": {"$substrCP": ["$mystring", "$$this", 1]}},
                "in": 
                {"$cond": 
                  [{"$in": ["$$cur_char", "$$value.previous"]},
                   "$$value",
                   {"previous": 
                     {"$concatArrays": ["$$value.previous", ["$$cur_char"]]},
                   "string": 
                   {"$concat": ["$$value.string", "$$cur_char"]}}]}}}}}}}}}])

Edit

The second query removed only the duplicates we choose.

Query

aggregate(
[{"$set": 
   {"mystring": 
     {"$getField": 
       {"field": "string",
        "input": 
        {"$reduce": 
          {"input": {"$range": [0, {"$strLenCP": "$mystring"}]},
           "initialValue": {"previous": [], "string": ""},
           "in": 
           {"$let": 
             {"vars": {"cur_char": {"$substrCP": ["$mystring", "$$this", 1]}},
              "in": 
              {"$cond": 
                [{"$and": 
                   [{"$in": ["$$cur_char", "$$value.previous"]},
                    {"$in": ["$$cur_char", ["x"]]}]},
                 "$$value",
                 {"previous": 
                   {"$concatArrays": ["$$value.previous", ["$$cur_char"]]},
                  "string": 
                  {"$concat": ["$$value.string", "$$cur_char"]}}]}}}}}}}}}])
  • Related