Search with regex but replace only a portion of the string with sed-CodePudding

I'm trying to replace any occurrence of a cwe.mitre.org.*.html (regex) URL and remove the .html extension and not change any other type of URL.

Example:

https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html

Expectation:

https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

Is there a way to do this in sed or another tool?

I've tried sed -Ei 's/cwe.mitre.org.*.html/<REPLACEMENT>/g' file.txt, but that won't work. Is there a way for the <REPLACEMENT> to be a regular expression? The sed manual doesn't seem to suggest that?

EDIT: I was wrong about the sed manual. It does mention it, see "5.7 Back-references and Subexpressions" section of https://www.gnu.org/software/sed/manual/sed.html.

CodePudding user response：

$ sed 's/\(cwe\.mitre\.org.*\)\.html/\1/' file
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

google sed capture groups.

CodePudding user response：

GNU AWK solution, let file.txt content be

https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html

then

awk '/cwe.mitre.org.*.html/{sub(/\.html$/,"")}{print}' file.txt

gives output

https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

Explanation: If you find provided regex in line, replace .html followed by end of line ($) using empty string. Every line, changed or not, print.

(tested in GNU Awk 5.0.1)