Home > OS >  How to print titles of a website with sed
How to print titles of a website with sed

Time:11-17

I am currently writing a sed script in which i have to print the 30titles of a website following a certain kind of print. I have the following error " sed : file news.sed line 1: unknown options to 's'. Here is my code :

curl -sL news.ycombinator.com |
sed -nE '/\n/!s/><a[^>]*>[^<]*</\n&\n/g;/^/P;D' |
sed -E 's/><a href="([^"]*)" >([^<]*)</**\2**\n\1/'

Do you know how i can fix it? Btw i can only use sed to solve this issue and not html parser.

CodePudding user response:

This might work for you (GNU sed):

cat <<\! > news.sed
/\n/!s/class="title"><a[^>]*>[^<]*</\n&\n/g
/^class="title"/{
h
x
s/^class="title"><a href="([^"]*)" hljs-string">"[^>]*>([^<]*)<.*/**\2**\n\1/p
x
}
D
!
curl -sL news.ycombinator.com | sed -Enf news.sed

This combines the 2 sed invocations into one sed script and applies it using the -f option.

N.B. This is GNU sed specific. It also uses a little known idiom that treats each line with a global substitution which inserts newlines into the pattern space. The D command is invoked and deletes up to and including the first newline, but does not complete the current sed cycle until the pattern space is empty (this basically chomps the pattern space by each newline inserted and if the start of line matches another regexp applies the bracketed expressions). The bracketed expression. makes a copy of the pattern space in the hold space, swaps to the hold space, formats the start of the hold space to deliver 2 formated lines, reverts back to the pattern space and chomps up to the next newline and then repeats.

This is a very rough and ready solution and may not cater for all HTML which can be returned via the curl command.

CodePudding user response:

I have the following error " sed : file news.sed line 1: unknown options to 's'.

You have a carriage return character at the end of the third script line (at least), which, as it immediately follows the s/…/…/ command, is interpreted as an option to it. You can eliminate the CRs in the script file e. g. with sed -i 's/\r//' news.sed.

  • Related