I'm trying to extract a list of names from a website using sed, but I'm not sure how to go about replacing the tab characters separating them. This code:
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#p" | html2text
gives me the names for September 12th, but they are separated by a tab character:
Åsa Åslög
If I change the sed script to replace tabs with comma and space, like this:
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#" -e 's/\t/, /p' | html2text
it works as expected:
Åsa, Åslög
However, if I try on a day that only has one name, such as September 13th:
curl -s "https://namnidag.se/?year=2022&month=9&day=13" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#" -e 's/\t/, /p' | html2text
I get no output; the first sed script without the tab replacement works fine in this case though. What am I doing wrong here? I'm using GNU sed 4.8, if that helps.
Thanks!
CodePudding user response:
You need to remove the p
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#p" | sed -e 's/\t/, /'
CodePudding user response:
curl -s "https://namnidag.se/?year=2022&month=9&day=12" > f1
cat > ed1 <<EOF
71W f2
q
EOF
ed -s f1 < ed1
cat f2 | tail -c 20 | head -c -6 > file
rm -v ./ed1
rm -v ./f2
This will give you the names, whether there are two of them or not; and if there are, you can just seperate them with cut.