Remove Duplicate character from string in Mongodb-CodePudding

I want to remove duplicate characters from strings in MongoDB. Example: Input string: xxxyzzxcdv Output string: xyzcdv

CodePudding user response：

Let's assume we have the following collection and the records inside of it:

db.messages.insertMany([
    {
        "message": "heelllo theeere"
    },
    {
        "message": "may thhhee forrrce be wiithh yyouuu"
    },
    {
        "message": "execute orrrrdder 66"
    }
])

Due to uncertainty, I am dropping solutions for both manipulating while querying and updating the records (permanently).

If you want to remove them while running your aggregation query:

In addition to @Takis's solution, using $function pipeline operator can be another option if your MongoDB version is 4.4 or higher.

If you want to remove them via an update operation:

// Update each document using cursor
db.messages.find({ message: { $ne: null } })
    .forEach(doc => {
        var distinctChars = doc.message
            .split("")
            .filter(function (item, pos, self) {
                return self.indexOf(item) == pos;
            })
            .join("");

        db.messages.updateOne({ _id: doc._id }, [{ $set: { distinctChars: distinctChars } }]);
    });

A quick reminder: Above script just shows an easy way to update the records to reach the goal without focusing on other details. It can be an expensive operation depending on your real world collection's size and configurations, sharding for instance. Consider to improve it with your own way.

Result

For both way, the result should be like the following:

[
  {
    "_id": {
      "$oid": "618d95ccdedc26d80875b75a"
    },
    "message": "heelllo theeere",
    "distinctChars": "helo tr"
  },
  {
    "_id": {
      "$oid": "618d95ccdedc26d80875b75b"
    },
    "message": "may thhhee forrrce be wiithh yyouuu",
    "distinctChars": "may theforcbwiu"
  },
  {
    "_id": {
      "$oid": "618d95ccdedc26d80875b75c"
    },
    "message": "execute orrrrdder 66",
    "distinctChars": "excut ord6"
  }
]

CodePudding user response：

Query

reduce on range(count string)
keep 2 values {"previous": [], "string": ""} (initial value of reduce)
get the cur-char {"$substrCP": ["$mystring", "$$this", 1]} this is the current index on the string, and i take the next char
if it is in previous kep "string" as it is, else concat to add the new character

heelo

reduce on (0 1 2 3 4) `{"$range": [0, {"$strLenCP": "$mystring"}]}`
we start from  `{"previous": [], "string": ""}`

- get 1 character start from index 0  
 `{"$substrCP": ["$mystring", "$$this", 1]}}`  = "h"
- if this character is on previous don't add it  
 `{"$in": ["$$cur_char", "$$value.previous"]}`
- else add it on previous and on the string the 2 concats in code

Repeat for `index($$this)`= 1
- get 1 character start from index 1  
 `{"$substrCP": ["$mystring", "$$this", 1]}}` = "e"
.....

Test code here

aggregate(
[{"$set": 
   {"mystring": 
     {"$getField": 
       {"field": "string",
        "input": 
         {"$reduce": 
           {"input": {"$range": [0, {"$strLenCP": "$mystring"}]},
            "initialValue": {"previous": [], "string": ""},
            "in": 
             {"$let": 
               {"vars": {"cur_char": {"$substrCP": ["$mystring", "$$this", 1]}},
                "in": 
                {"$cond": 
                  [{"$in": ["$$cur_char", "$$value.previous"]},
                   "$$value",
                   {"previous": 
                     {"$concatArrays": ["$$value.previous", ["$$cur_char"]]},
                   "string": 
                   {"$concat": ["$$value.string", "$$cur_char"]}}]}}}}}}}}}])

Edit

The second query removed only the duplicates we choose.

Query

aggregate(
[{"$set": 
   {"mystring": 
     {"$getField": 
       {"field": "string",
        "input": 
        {"$reduce": 
          {"input": {"$range": [0, {"$strLenCP": "$mystring"}]},
           "initialValue": {"previous": [], "string": ""},
           "in": 
           {"$let": 
             {"vars": {"cur_char": {"$substrCP": ["$mystring", "$$this", 1]}},
              "in": 
              {"$cond": 
                [{"$and": 
                   [{"$in": ["$$cur_char", "$$value.previous"]},
                    {"$in": ["$$cur_char", ["x"]]}]},
                 "$$value",
                 {"previous": 
                   {"$concatArrays": ["$$value.previous", ["$$cur_char"]]},
                  "string": 
                  {"$concat": ["$$value.string", "$$cur_char"]}}]}}}}}}}}}])