Bash script that downloads an RSS feed and saves each entry as a separate html file-CodePudding

I'm trying to create a bash script that downloads an RSS feed and saves each entry as a separate html file. Here's what I've been able to create so far:

curl -L https://news.ycombinator.com//rss > hacke.txt

grep -oP '(?<=<description>).*?(?=</description>)' hacke.txt | sed 's/<description>/\n<description>/g' | grep '<description>' | sed 's/<description>//g' | sed 's/<\/description>//g' | while read description; do
  title=$(echo "$description" | grep -oP '(?<=<title>).*?(?=</title>)')
  if [ ! -f "$title.html" ]; then
    echo "$description" > "$title.html"
  fi
done

Unfortunately, it doesn't work at all :( Please suggest me where my mistakes are.

CodePudding user response：

Please suggest me where my mistakes are.

Your single mistake is trying to parse XML with regular expressions. You can't parse XML/HTML with RegEx! Please use an XML/HTML-parser like xidel instead.

The first <item>-element-node (not "variable" as you call them):

$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]' \
  --output-node-format=xml --output-node-indent
<item>
  <title>Show HN: I made an Ethernet transceiver from logic gates</title>
  <link>https://imihajlov.tk/blog/posts/eth-to-spi/</link>
  <pubDate>Sun, 18 Dec 2022 07:00:52  0000</pubDate>
  <comments>https://news.ycombinator.com/item?id=34035628</comments>
  <description>&lt;a href=&quot;https://news.ycombinator.com/item?id=34035628&quot;&gt;Comments&lt;/a&gt;</description>
</item>

$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]/description'
<a href="https://news.ycombinator.com/item?id=34035628">Comments</a>

Note that while the output of the first command is XML, the output for the second command is ordinary text!

With the integrated EXPath File Module you could then save this text(!) to an HTML-file:

$ xidel -s "https://news.ycombinator.com/rss" -e '
  //item/file:write-text(
    replace(title,"[<>:&quot;/\\\|\?\*]",())||".html",   (: remove invalid characters :)
    description
  )
'

But you can also save it as proper HTML by parsing the <description>-element-node and using file:write() instead:

$ xidel -s "https://news.ycombinator.com/rss" -e '
  //item/file:write(
    replace(title,"[<>:&quot;/\\\|\?\*]",())||".html",
    parse-html(description),
    {"indent":true()}
  )
'

$ xidel -s "Show HN I made an Ethernet transceiver from logic gates.html" -e '$raw'
<html>
  <head/>
  <body>
    <a href="https://news.ycombinator.com/item?id=34035628">Comments</a>
  </body>
</html>

CodePudding user response：

With xmlstarlet:

# download rss feed to file rss.xml
curl -s 'https://news.ycombinator.com/rss' -o rss.xml

# get number of items from rss.xml
num_of_items=$(xmlstarlet select --template --copy-of "count(//rss/channel/item)" -n rss.xml)

# loop over all items
for ((i=1; i<=$num_of_items; i  )); do

  # get title from this item
  title=$(xmlstarlet select --template --value-of "//rss/channel/item[$i]/title" -n rss.xml);

  # get description from this item and unescape it
  desc=$(xmlstarlet select --template --value-of "//rss/channel/item[$i]/description" -n rss.xml | xmlstarlet unescape);

  # check if new file exists
  if [ ! -f "$title.html" ]; then
    # write description to new file
    echo "$desc" > "$title.html";
  fi
done

I have not included a test to see if, for example, the title contains a not allowed /.

See: xmlstarlet select --help

CodePudding user response：

I would first convert the RSS feed, which is in XML format, to something that bash can read accurately (for ex. key/value pairs with TSV escaping) with a XPath processor like Xidel (availale here for Linux/Mac/Win):

curl -s -L https://news.ycombinator.com/rss |

xidel --xpath '
    //item ! string-join(
        * ! ( name() || "=" || replace(replace(replace(.,"\\","\\\\"),"\t","\\t"),"\n","\\n") ),
        codepoints-to-string(9)
    )
'

The result will be something like:

title=AVR-GCC Compiler Makes Questionable Code  link=https://www.bigmessowires.com/2022/12/16/avr-gcc-compiler-makes-questionable-code/ pubDate=Sat, 17 Dec 2022 17:12:02  0000 comments=https://news.ycombinator.com/item?id=34029750  description=<a href="https://news.ycombinator.com/item?id=34029750">Comments</a>
title=Trains, trucks, and supply chains link=https://gadallon.substack.com/p/trains-trucks-and-supply-chains    pubDate=Fri, 16 Dec 2022 14:13:23  0000 comments=https://news.ycombinator.com/item?id=34014633  description=<a href="https://news.ycombinator.com/item?id=34014633">Comments</a>
title=etc...

Then I would read & unescape & parse this stream with bash in a while loop:

#!/bin/bash

while IFS=$'\t' read -r -a array
do
    declare -A hash

    for kvp in "${array[@]}"
    do
        printf -v kvp %b "$kvp"

        [[ $kvp =~ ^([^=] )=(.*)$ ]]

        hash[${BASH_REMATCH[1]}]=${BASH_REMATCH[2]}
    done

    [[ -n "${hash[title]: X}" ]] && [[ -n "${hash[description]: X}" ]] || exit 1

    filename=${hash[title]//\//}.html

    [[ -f "$filename" ]] && continue

    printf '%s\n' "${hash[description]}" > "$filename"
done < <(
    curl -s -L '...' |
    xidel --xpath '...'
)

remark: I know that the whole process might be hard to understand, but trust me, it takes care of a lot of "details" that might harm you.