How to truncate rest of the text in a file after finding a specific text pattern, in unix?-CodePudding

I have a HTML PAGE which I have extracted in unix using wget command, in that after the word "Check list" I need to remove all of the text and with the remaining I am trying to grep some data. I am unable to think on a way which can be helpful for removing the text after a keyword. if I do

s/Check list.*//g

It just removes the line , I want everything below that to be gone. How do I perform this?

CodePudding user response：

Depending on which sed version you have, maybe

sed -z 's/Check list.*//'

The /g flag is useless as you only want to replace everything once.

If your sed does not have the -z option (which says to use the ASCII null character as line terminator instead of newline; this hinges on your file not containing any actual nulls, but that should trivially be true for any text file), try Perl:

perl -0777 -pe 's/Check list.*//s'

Unlike sed -z, this explicitly says to slurp the entire file into memory (the argument to -0 is the octal character code of a terminator character, but 777 is not a valid terminator character at all, so it always reads the entire file as a single "line") so this works even if there are spurious nulls in your file. The final s flag says to include newline in what . matches (otherwise s/.*// would still only substitute on the matching physical line).

I assume you are aware that removing everything will violate the integrity of the HTML file; it needs there to be a closing tag for every start tag near the beginning of the document (so if it starts with <html><body> you should keep </body></html> just before the end of the file, for example).

CodePudding user response：

With awk you could make use of RS variable and then set field separator to regex with word boundaries and then print the very first field as per need.

awk -v RS="^$" -v FS='\\<check_list\\>' '{print $1}'  Input_file

CodePudding user response：

The other solutions you have so far require non-POSIX-mandatory tools (GNU sed, GNU awk, or perl) so YMMV with their availability and will read the whole file into memory at once.

These will work in any awk in any shell on every Unix box and only read 1 line at a time into memory:

awk -F 'Check list' '{print $1} NF>1{exit}' file

or:

awk 'sub(/Check list.*/,""){f=1} {print} f{exit}' file

With GNU awk for multi-char RS you could do:

awk -v RS='Check list' '{print; exit}' file

but that would still read all of the text before Check list into memory at once.

CodePudding user response：

You might use q to instruct GNU sed to quit, thus ending processing, consider following simple example, let file.txt content be

123
456
789

and say you want to jettison everything beyond 5, then you could do

sed '/5/{s/5.*//;q}' file.txt

which gives output

123
4

Explanation: for line having 5, substitute 5 and everything beyond it with empty string (i.e. delete it), then q. Observe that lowercase q is used to provide printing of altered line before quiting.

(tested in GNU sed 4.7)