I want to remove duplicate characters from strings in MongoDB. Example: Input string: xxxyzzxcdv Output string: xyzcdv
CodePudding user response:
Let's assume we have the following collection and the records inside of it:
db.messages.insertMany([
{
"message": "heelllo theeere"
},
{
"message": "may thhhee forrrce be wiithh yyouuu"
},
{
"message": "execute orrrrdder 66"
}
])
Due to uncertainty, I am dropping solutions for both manipulating while querying and updating the records (permanently).
If you want to remove them while running your aggregation query:
In addition to @Takis's solution, using $function
pipeline operator can be another option if your MongoDB version is 4.4 or higher.
Further readings on $function operator
// Query via Aggregation Framework
db.messages.aggregate([
{
$match: {
message: {
$ne: null
}
}
},
{
$addFields: {
distinctChars: {
$function: {
body: function (message) {
return message
.split("")
.filter(function (item, pos, self) {
return self.indexOf(item) == pos;
})
.join("");
},
args: ["$message"],
lang: "js"
}
}
}
},
])
If you want to remove them via an update operation:
// Update each document using cursor
db.messages.find({ message: { $ne: null } })
.forEach(doc => {
var distinctChars = doc.message
.split("")
.filter(function (item, pos, self) {
return self.indexOf(item) == pos;
})
.join("");
db.messages.updateOne({ _id: doc._id }, [{ $set: { distinctChars: distinctChars } }]);
});
A quick reminder: Above script just shows an easy way to update the records to reach the goal without focusing on other details. It can be an expensive operation depending on your real world collection's size and configurations, sharding for instance. Consider to improve it with your own way.
Result
For both way, the result should be like the following:
[
{
"_id": {
"$oid": "618d95ccdedc26d80875b75a"
},
"message": "heelllo theeere",
"distinctChars": "helo tr"
},
{
"_id": {
"$oid": "618d95ccdedc26d80875b75b"
},
"message": "may thhhee forrrce be wiithh yyouuu",
"distinctChars": "may theforcbwiu"
},
{
"_id": {
"$oid": "618d95ccdedc26d80875b75c"
},
"message": "execute orrrrdder 66",
"distinctChars": "excut ord6"
}
]
CodePudding user response:
Query
- reduce on range(count string)
- keep 2 values
{"previous": [], "string": ""}
(initial value of reduce) - get the cur-char
{"$substrCP": ["$mystring", "$$this", 1]}
this is the current index on the string, and i take the next char - if it is in previous kep
"string"
as it is, elseconcat
to add the new character
heelo
reduce on (0 1 2 3 4) `{"$range": [0, {"$strLenCP": "$mystring"}]}`
we start from `{"previous": [], "string": ""}`
- get 1 character start from index 0
`{"$substrCP": ["$mystring", "$$this", 1]}}` = "h"
- if this character is on previous don't add it
`{"$in": ["$$cur_char", "$$value.previous"]}`
- else add it on previous and on the string the 2 concats in code
Repeat for `index($$this)`= 1
- get 1 character start from index 1
`{"$substrCP": ["$mystring", "$$this", 1]}}` = "e"
.....
aggregate(
[{"$set":
{"mystring":
{"$getField":
{"field": "string",
"input":
{"$reduce":
{"input": {"$range": [0, {"$strLenCP": "$mystring"}]},
"initialValue": {"previous": [], "string": ""},
"in":
{"$let":
{"vars": {"cur_char": {"$substrCP": ["$mystring", "$$this", 1]}},
"in":
{"$cond":
[{"$in": ["$$cur_char", "$$value.previous"]},
"$$value",
{"previous":
{"$concatArrays": ["$$value.previous", ["$$cur_char"]]},
"string":
{"$concat": ["$$value.string", "$$cur_char"]}}]}}}}}}}}}])
Edit
The second query removed only the duplicates we choose.
Query
aggregate(
[{"$set":
{"mystring":
{"$getField":
{"field": "string",
"input":
{"$reduce":
{"input": {"$range": [0, {"$strLenCP": "$mystring"}]},
"initialValue": {"previous": [], "string": ""},
"in":
{"$let":
{"vars": {"cur_char": {"$substrCP": ["$mystring", "$$this", 1]}},
"in":
{"$cond":
[{"$and":
[{"$in": ["$$cur_char", "$$value.previous"]},
{"$in": ["$$cur_char", ["x"]]}]},
"$$value",
{"previous":
{"$concatArrays": ["$$value.previous", ["$$cur_char"]]},
"string":
{"$concat": ["$$value.string", "$$cur_char"]}}]}}}}}}}}}])