I have a HTML PAGE which I have extracted in unix using wget command, in that after the word "Check list" I need to remove all of the text and with the remaining I am trying to grep some data. I am unable to think on a way which can be helpful for removing the text after a keyword. if I do
s/Check list.*//g
It just removes the line , I want everything below that to be gone. How do I perform this?
CodePudding user response:
Depending on which sed
version you have, maybe
sed -z 's/Check list.*//'
The /g
flag is useless as you only want to replace everything once.
If your sed
does not have the -z
option (which says to use the ASCII null character as line terminator instead of newline; this hinges on your file not containing any actual nulls, but that should trivially be true for any text file), try Perl:
perl -0777 -pe 's/Check list.*//s'
Unlike sed -z
, this explicitly says to slurp the entire file into memory (the argument to -0
is the octal character code of a terminator character, but 777
is not a valid terminator character at all, so it always reads the entire file as a single "line") so this works even if there are spurious nulls in your file. The final s
flag says to include newline in what .
matches (otherwise s/.*//
would still only substitute on the matching physical line).
I assume you are aware that removing everything will violate the integrity of the HTML file; it needs there to be a closing tag for every start tag near the beginning of the document (so if it starts with <html><body>
you should keep </body></html>
just before the end of the file, for example).
CodePudding user response:
With awk
you could make use of RS
variable and then set field separator to regex with word boundaries and then print the very first field as per need.
awk -v RS="^$" -v FS='\\<check_list\\>' '{print $1}' Input_file
CodePudding user response:
The other solutions you have so far require non-POSIX-mandatory tools (GNU sed, GNU awk, or perl) so YMMV with their availability and will read the whole file into memory at once.
These will work in any awk in any shell on every Unix box and only read 1 line at a time into memory:
awk -F 'Check list' '{print $1} NF>1{exit}' file
or:
awk 'sub(/Check list.*/,""){f=1} {print} f{exit}' file
With GNU awk for multi-char RS you could do:
awk -v RS='Check list' '{print; exit}' file
but that would still read all of the text before Check list
into memory at once.
CodePudding user response:
You might use q
to instruct GNU sed
to quit, thus ending processing, consider following simple example, let file.txt
content be
123
456
789
and say you want to jettison everything beyond 5
, then you could do
sed '/5/{s/5.*//;q}' file.txt
which gives output
123
4
Explanation: for line having 5
, substitute 5
and everything beyond it with empty string (i.e. delete it), then q
. Observe that lowercase q
is used to provide printing of altered line before quiting.
(tested in GNU sed 4.7)