I'm trying to create a bash script that downloads an RSS feed and saves each entry as a separate html file. Here's what I've been able to create so far:
curl -L https://news.ycombinator.com//rss > hacke.txt
grep -oP '(?<=<description>).*?(?=</description>)' hacke.txt | sed 's/<description>/\n<description>/g' | grep '<description>' | sed 's/<description>//g' | sed 's/<\/description>//g' | while read description; do
title=$(echo "$description" | grep -oP '(?<=<title>).*?(?=</title>)')
if [ ! -f "$title.html" ]; then
echo "$description" > "$title.html"
fi
done
Unfortunately, it doesn't work at all :( Please suggest me where my mistakes are.
CodePudding user response:
Please suggest me where my mistakes are.
Your single mistake is trying to parse XML with regular expressions. You can't parse XML/HTML with RegEx! Please use an XML/HTML-parser like xidel instead.
The first <item>
-element-node (not "variable" as you call them):
$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]' \
--output-node-format=xml --output-node-indent
<item>
<title>Show HN: I made an Ethernet transceiver from logic gates</title>
<link>https://imihajlov.tk/blog/posts/eth-to-spi/</link>
<pubDate>Sun, 18 Dec 2022 07:00:52 0000</pubDate>
<comments>https://news.ycombinator.com/item?id=34035628</comments>
<description><a href="https://news.ycombinator.com/item?id=34035628">Comments</a></description>
</item>
$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]/description'
<a href="https://news.ycombinator.com/item?id=34035628">Comments</a>
Note that while the output of the first command is XML, the output for the second command is ordinary text!
With the integrated EXPath File Module you could then save this text(!) to an HTML-file:
$ xidel -s "https://news.ycombinator.com/rss" -e '
//item/file:write-text(
replace(title,"[<>:"/\\\|\?\*]",())||".html", (: remove invalid characters :)
description
)
'
But you can also save it as proper HTML by parsing the <description>
-element-node and using file:write()
instead:
$ xidel -s "https://news.ycombinator.com/rss" -e '
//item/file:write(
replace(title,"[<>:"/\\\|\?\*]",())||".html",
parse-html(description),
{"indent":true()}
)
'
$ xidel -s "Show HN I made an Ethernet transceiver from logic gates.html" -e '$raw'
<html>
<head/>
<body>
<a href="https://news.ycombinator.com/item?id=34035628">Comments</a>
</body>
</html>
CodePudding user response:
With xmlstarlet
:
# download rss feed to file rss.xml
curl -s 'https://news.ycombinator.com/rss' -o rss.xml
# get number of items from rss.xml
num_of_items=$(xmlstarlet select --template --copy-of "count(//rss/channel/item)" -n rss.xml)
# loop over all items
for ((i=1; i<=$num_of_items; i )); do
# get title from this item
title=$(xmlstarlet select --template --value-of "//rss/channel/item[$i]/title" -n rss.xml);
# get description from this item and unescape it
desc=$(xmlstarlet select --template --value-of "//rss/channel/item[$i]/description" -n rss.xml | xmlstarlet unescape);
# check if new file exists
if [ ! -f "$title.html" ]; then
# write description to new file
echo "$desc" > "$title.html";
fi
done
I have not included a test to see if, for example, the title contains a not allowed /
.
See: xmlstarlet select --help
CodePudding user response:
I would first convert the RSS feed, which is in XML format, to something that bash
can read accurately (for ex. key/value pairs with TSV escaping) with a XPath processor like Xidel (availale here for Linux/Mac/Win):
curl -s -L https://news.ycombinator.com/rss |
xidel --xpath '
//item ! string-join(
* ! ( name() || "=" || replace(replace(replace(.,"\\","\\\\"),"\t","\\t"),"\n","\\n") ),
codepoints-to-string(9)
)
'
The result will be something like:
title=AVR-GCC Compiler Makes Questionable Code link=https://www.bigmessowires.com/2022/12/16/avr-gcc-compiler-makes-questionable-code/ pubDate=Sat, 17 Dec 2022 17:12:02 0000 comments=https://news.ycombinator.com/item?id=34029750 description=<a href="https://news.ycombinator.com/item?id=34029750">Comments</a>
title=Trains, trucks, and supply chains link=https://gadallon.substack.com/p/trains-trucks-and-supply-chains pubDate=Fri, 16 Dec 2022 14:13:23 0000 comments=https://news.ycombinator.com/item?id=34014633 description=<a href="https://news.ycombinator.com/item?id=34014633">Comments</a>
title=etc...
Then I would read & unescape & parse this stream with bash
in a while
loop:
#!/bin/bash
while IFS=$'\t' read -r -a array
do
declare -A hash
for kvp in "${array[@]}"
do
printf -v kvp %b "$kvp"
[[ $kvp =~ ^([^=] )=(.*)$ ]]
hash[${BASH_REMATCH[1]}]=${BASH_REMATCH[2]}
done
[[ -n "${hash[title]: X}" ]] && [[ -n "${hash[description]: X}" ]] || exit 1
filename=${hash[title]//\//}.html
[[ -f "$filename" ]] && continue
printf '%s\n' "${hash[description]}" > "$filename"
done < <(
curl -s -L '...' |
xidel --xpath '...'
)
remark: I know that the whole process might be hard to understand, but trust me, it takes care of a lot of "details" that might harm you.