Home > database >  What does this specific sed command do exactly? Using sed to parse full HTML pages in a bash script
What does this specific sed command do exactly? Using sed to parse full HTML pages in a bash script

Time:12-26

subreddit=$(curl -sL "https://www.reddit.com/search/?q=${query}&type=sr"|tr "<" "\n"|
    sed -nE 's@.*>r/([^<]*)@\1@p'|gum filter)

I've been learning bash and have been making pretty good progress. One thing that just seems far too daunting is these complex sed commands. It's unfortunate because I really want to use them to do things like parse HTML but it quickly becomes a mess. this is a little snippet of a script that queries Reddit, pipes it through sed and returns just the names of the subreddits that were a result of the search on a new line.

My main question is.. What is it that this is actually cutting/replacing and what does the beginning part mean 's@.'?

What I tried:

I used curl to search for a subreddit name so that I could see the raw output from that command and then I tried to pipe it into sed using little snippets of the full command to see if I could reconstruct the logic behind the command and all I really figured out was that I am lacking in my knowledge of sed beyond basic replacements.

I'm trying to re-write this script (for learning purposes only, the script works just fine) that allows you to search reddit and view the image posts in your terminal using Kitty. Mostly everything is pretty readable but the sed commands just trip me up.

I'll attach the full script below in case anyone is interested and I welcome any advice or explanations that could help me fully understand and re-construct it.

I'm really curious about this. I'm also wondering if it would just be better to call a Python script from bash that could return the images using beautiful soup... or maybe using "htmlq" would be a better idea?

Thanks!

#!/bin/sh

get_input() {
  [ -z "$*" ] && query=$(gum input --placeholder "Search for a subreddit") || query=$*
  query=$(printf "%s" "$query"|tr ' ' ' ')
  subreddit=$(curl -sL "https://www.reddit.com/search/?q=${query}&type=sr"|tr "<" "\n"|
    sed -nE 's@.*>r/([^<]*)@\1@p'|gum filter)
  xml=$(curl -s "https://www.reddit.com/r/$subreddit.rss" -A "uwu"|tr "<|>" "\n")

  post_href=$(printf "%s" "$xml"|sed -nE '/media:thumbnail/,/title/{p;n;p;}'|
    sed -nE 's_.*href="([^"] )".*_\1_p;s_.*media:thumbnail[^>] url="([^"] )".*_\1_p; /title/{n;p;}'|
    sed -e 'N;N;s/\n/\t/g' -e 's/&amp;/\&/g'|grep -vE '.*\.gif.*')
  [ -z "$post_href" ] && printf "No results found for \"%s\"\n" "$query" && exit 1
}

readc() {
  if [ -t 0 ]; then
    saved_tty_settings=$(stty -g)
    stty -echo -icanon min 1 time 0
  fi
  eval "$1="
  while
    c=$(dd bs=1 count=1 2> /dev/null; echo .)
    c=${c%.}
    [ -n "$c" ] &&
        eval "$1=\${$1}"'$c
    [ "$(($(printf %s "${'"$1"'}" | wc -m)))" -eq 0 ]'; do
    continue
  done
  [ -t 0 ] && stty "$saved_tty_settings"
}

download_image() {
  downloadable_link=$(curl -s -A "uwu" "$1"|sed -nE 's@.*.*<a href="([^"] )".*@\1@p')
  curl -s -A "uwu" "$downloadable_link" -o "$(basename "$downloadable_link")"
  [ -z "$downloadable_link" ] && printf "No image found\n" && exit 1
  tput clear && gum style \
      --foreground 212 --border-foreground 212 --border double \
        --align center --width 50 --margin "1 2" --padding "2 4" \
          'Your image has been downloaded!' "Image saved to $(basename "$downloadable_link")"
  # shellcheck disable=SC2034
  printf "Press Enter to continue..." && read -r useless
}

cleanup() {
  tput cnorm && exit 0
}

trap cleanup EXIT INT HUP
get_input "$@"

i=1 && tput clear
while true; do
  tput civis
  [ "$i" -lt 1 ] && i=$(printf "%s" "$post_href"|wc -l)
  [ "$i" -gt "$(printf "%s" "$post_href"|wc -l)" ] && i=1
  link=$(printf "%s" "$post_href"|sed -n "$i"p|cut -f1)
  post_link=$(printf "%s" "$post_href"|sed -n "$i"p|cut -f2)
  gum style \
    --foreground 212 --border-foreground 212 --border double \
    --align left --width 50 --margin "20 1" --padding "2 4" \
    'Press (j) to go to next' 'Press (k) to go to previous' 'Press (d) to download' \
    'Press (o) to open in browser' 'Press (s) to search for another subreddit' 'Press (q) to quit'
  kitty  kitten icat --scale-up --place 60x40@69x3 --transfer-mode file "$link"
  readc key
  # shellcheck disable=SC2154
  case "$key" in
    j) i=$((i 1)) && tput clear ;;
    k) i=$((i-1)) && tput clear ;;
    d) download_image "$post_link" ;;
    o) xdg-open "$post_link" || open "$post_link" ;;
    s) get_input ;;
    q) exit 0 && tput clear ;;
    *) ;;
  esac
done

"gum filter" is essentially a fuzzy finder like fzf and "gum style" draws pretty text and nice boxes that work kind of like css.

CodePudding user response:

What does this specific sed command do exactly?
sed -nE 's@.*>r/([^<]*)@\1@p'

It does two things:

  1. Select all lines that contain the literal string >r/.
  2. For those lines, print only the part after ...>r/.

Basically, it translates to ... (written inefficiently on purpose)

grep '>r/' |
sed 's/.*>r\///'

what does the beginning part mean 's@.'?

You are looking at the (beginning of the) substitution command. Normally, it is written as s/search/replace/ but the delimiter / can be chosen (mostly) freely. s/…/…/ and s@…@…@ are equivalent.
Here, @ has the benefit of not having to escape the / in …>r/.

The . belongs to the search pattern. The .* in the beginning selects everything from the start of the line, so that it can be deleted when doing the substitution. Here we delete the beginning of the line up to (and including) …>r/.
The \1 in the replacement pattern is a placeholder for the string that was matched by the group ([^<]*) (longest <-free substring directly after …>r/).
That part is unnecessarily complicated. Because sed is preceded by tr "<" "\n" there is no point in dealing with the < inside sed. It could be simplified to

sed -n 's@.*>r/@@p'

Speaking about simplifications:

I really want to use them [sed commands] to do things like parse HTML

Advice: Don't. For one-off jobs where you know the exact formatting (!) of your html files, regexes are ok. But even then, they make only sense if you are quicker to write them than using a proper tool.

I'm also wondering if it would just be better to call a Python script from bash that could return the images using beautiful soup... or maybe using "htmlq" would be a better idea?

You are right! In general, regexes are not powerful enough to reliably parse html.

Whether you use python or bash is up to you. Personally, I find easier to use a "proper" language for bigger projects. But then I use only that. Writing half in python and half in bash only increases complexity in my opinion.

If you stick with bash, I'd recommend something a bit more mature and widespread than htmlq (first released in Sep 2021, currently at version 0.4). E.g. install libxml and use an XPath expression with post-processing:

curl -sL "https://www.reddit.com/search/?q=QUERY&type=sr" |
xmllint --html --xpath '//h6[@]/text()' - 2>/dev/null |
sed 's@^r/@@'

But then again, parsing HTML isn't necessary in the first place, since reddit has an API that can return JSON, which you can process using jq.

curl -sL "https://www.reddit.com/subreddits/search.json?q=QUERY" |
jq -r '.data.children | map(.data.display_name) | .[]'
  • Related