I'm trying to replace any occurrence of a cwe.mitre.org.*.html
(regex) URL and remove the .html
extension and not change any other type of URL.
Example:
https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html
Expectation:
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
Is there a way to do this in sed or another tool?
I've tried sed -Ei 's/cwe.mitre.org.*.html/<REPLACEMENT>/g' file.txt
, but that won't work. Is there a way for the <REPLACEMENT>
to be a regular expression? The sed
manual doesn't seem to suggest that?
EDIT: I was wrong about the sed manual. It does mention it, see "5.7 Back-references and Subexpressions" section of https://www.gnu.org/software/sed/manual/sed.html.
CodePudding user response:
$ sed 's/\(cwe\.mitre\.org.*\)\.html/\1/' file
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
google sed capture groups.
CodePudding user response:
GNU AWK
solution, let file.txt
content be
https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html
then
awk '/cwe.mitre.org.*.html/{sub(/\.html$/,"")}{print}' file.txt
gives output
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html
Explanation: If you find provided regex in line, replace .html
followed by end of line ($
) using empty string. Every line, changed or not, print
.
(tested in GNU Awk 5.0.1)