I am creating a query to extract description of customers in mongodb. Unfortunately, the description is in HTML Format. Is there a way to replace all HTML tags and make it as " ". Either replace it with " " or remove HTML Tags.
Below is a sample document
{
"_id" : ObjectId("61f72aefdc85500a8baa6bb8")
"CustomerPin" : "22010871",
"CustomerName" : "TestLastName, TestFirstName",
"Age" : 39.0,
"Gender" : "Male",
"Description" : "<p><span>This will be a test description</span><br/></p>",
}
The output should remove "p", "span", and "br". Is there a function in mongodb to remove them all at once without repeating $project
This is the expected output:
{
"_id" : ObjectId("61f72aefdc85500a8baa6bb8")
"CustomerPin" : "22010871",
"CustomerName" : "TestLastName, TestFirstName",
"Age" : 39.0,
"Gender" : "Male",
"Description" : "This will be a test description",
}
Thanks!
CodePudding user response:
One way to do it is by removing all tags by regex in pre hook of save method
Description.replace(/(<([^>] )>)/gi, "");
See hooks here
CodePudding user response:
If you use Mongo 4.2 then you have to find the exact regex which will extract content from HTML. Below you can find an aggregate pipeline and the regex also.
db.getCollection("name_of_your_collection").aggregate({
$set: {
contentRegex: {
$regexFind: { input: "$Description", regex: /([^<>] )(?!([^<] )?>)/gi }
}
}
},
{
$set: {
content: { $ifNull: ["$contentRegex.match", "$Description"] }
}
},
{
$unset: [ "contentRegex" ]
}
)