Remove HTML Tags MondoDB-CodePudding

I am creating a query to extract description of customers in mongodb. Unfortunately, the description is in HTML Format. Is there a way to replace all HTML tags and make it as " ". Either replace it with " " or remove HTML Tags.

Below is a sample document

{ 
        "_id" : ObjectId("61f72aefdc85500a8baa6bb8")
        "CustomerPin" : "22010871", 
        "CustomerName" : "TestLastName, TestFirstName", 
        "Age" : 39.0, 
        "Gender" : "Male", 
        "Description" : "<p><span>This will be a test description</span><br/></p>", 
}

The output should remove "p", "span", and "br". Is there a function in mongodb to remove them all at once without repeating $project

This is the expected output:

{ 
        "_id" : ObjectId("61f72aefdc85500a8baa6bb8")
        "CustomerPin" : "22010871", 
        "CustomerName" : "TestLastName, TestFirstName", 
        "Age" : 39.0, 
        "Gender" : "Male", 
        "Description" : "This will be a test description", 
}

Thanks!

CodePudding user response：

One way to do it is by removing all tags by regex in pre hook of save method

Description.replace(/(<([^>] )>)/gi, "");

See hooks here

CodePudding user response：

If you use Mongo 4.2 then you have to find the exact regex which will extract content from HTML. Below you can find an aggregate pipeline and the regex also.

db.getCollection("name_of_your_collection").aggregate({
    $set: {
        contentRegex: {
            $regexFind: { input: "$Description", regex: /([^<>] )(?!([^<] )?>)/gi }
        }
    }
},
    {
        $set: {
            content: { $ifNull: ["$contentRegex.match", "$Description"] }
        }
    },
    {
        $unset: [ "contentRegex" ]
    }
)