I would like to convert a (ISO 8601) date from YYYY-MM-DD
to something that looks like Mon, 03 Dec 2021 00:00:00 -0600
(RFC-822 / RFC 5322). I have not been able to find questions that address this specific issue.
I have an RSS XML file that looks like this:
bash-5.1$ cat feed.xml
<rss version="2.0">
<channel>
<title>Title string</title>
<link>https://domain/feed.xml</link>
<description>Description string here</description>
<language>en-us</language>
<item>
<title>title string here</title>
<link>link string with https style information</link>
<guid>link string with https style information</guid>
<pubDate>2021-12-03</pubDate>
</item>
<item>
<title>title string here</title>
<link>link string with https style information</link>
<guid>link string with https style information</guid>
<pubDate>2019-08-13</pubDate>
</item>
<item>
<title>title string here</title>
<link>link string with https style information</link>
<guid>link string with https style information</guid>
<pubDate>2018-11-23</pubDate>
</item>
</channel>
</rss>
The output I expect after parsing would be something like this:
...
<pubDate>Fri, 03 Dec 2021 00:00:00 -0600</pubDate>
...
<pubDate>Tue, 13 Aug 2019 00:00:00 -0500</pubDate>
...
<pubDate>Fri, 23 Nov 2018 00:00:00 -0600</pubDate>
...
This output can usually be achieved with the bash date
command like so:
bash-5.1$ date -d "2018-11-23" "%a, %d %b %Y %T %z"
Fri, 23 Nov 2018 00:00:00 -0600
I am trying to accomplish this with awk
because I learned that command substitution is possible, and I believe I am close, but not quite:
bash-5.1$ awk -F "[><]" -v date="$(date "%a, %d %b %Y %T %z" -d "$3")" '/pubDate/ {print $3date}' feed.xml
2021-12-03Sun, 05 Dec 2021 00:00:00 -0600
2019-08-13Sun, 05 Dec 2021 00:00:00 -0600
2018-11-23Sun, 05 Dec 2021 00:00:00 -0600
It seems like the pattern match executes successfully as well as the date command, but the awk $3
field looks like is not being passed into the shell date
command and so the current time is showing instead of the transformed time.
How do I pass the $3
field from the awk pattern match into the date command so that it can convert the date based on the field value?
Any help is greatly appreciated!
CodePudding user response:
As commented, it is recommended to use xml parsing tool speaking generally.
If the xml file is well aligned as the provided example, bash
or awk
may work under limited conditions.
With bash:
#!/bin/bash
while IFS= read -r line; do
if [[ $line =~ (.*<pubDate>)([0-9]{4}-[0-9]{2}-[0-9]{2})(</pubDate>.*) ]]; then
datestr=$(date -d "${BASH_REMATCH[2]}" "%a, %d %b %Y %T %z")
line="${BASH_REMATCH[1]}$datestr${BASH_REMATCH[3]}"
fi
echo "$line"
done < feed.xml
The condition $line =~ (.*<pubDate>)([0-9]{4}-[0-9]{2}-[0-9]{2})(</pubDate>.*)
matches the <pubdate>
line assiging bash variable ${BASH_REMATCH[@]
to
the parenthesized substrings. Then line
is reconstructed using the reformatted date string.
If gawk
which supports mktime()
and strftime()
functions, you can
also say with gawk
:
awk '
{
if (match($0, /^(.*<pubDate>)([0-9]{4})-([0-9]{2})-([0-9]{2})(<\/pubDate>.*)/, a) ) {
ts = mktime(a[2] " " a[3] " " a[4] " 00 00 00") # timestamp since the epoch
datestr = strftime("%a, %d %b %Y %T %z", ts)
$0 = a[1] datestr a[5]
}
} 1' feed.xml
CodePudding user response:
Altering your awk
code slightly.
$ awk -v date="$(date "%a, %d %b %Y %T %z")" 'BEGIN {FS=OFS=">"}$1~/pubDate/{split($2,a,"<"); a[1]=date; $2=a[1]"<"a[2]}1' input_file
<rss version=2.0>
<channel>
<title>Title string</title>
<link>https://domain/feed.xml</link>
<description>Description string here</description>
<language>en-us</language>
<item>
<title>title string here</title>
<link>link string with https style information</link>
<guid>link string with https style information</guid>
<pubDate>Mon, 06 Dec 2021 00:00:00 0000</pubDate>
</item>
<item>
<title>title string here</title>
<link>link string with https style information</link>
<guid>link string with https style information</guid>
<pubDate>Mon, 06 Dec 2021 00:00:00 0000</pubDate>
</item>
<item>
<title>title string here</title>
<link>link string with https style information</link>
<guid>link string with https style information</guid>
<pubDate>Mon, 06 Dec 2021 00:00:00 0000</pubDate>
</item>
</channel>
</rss>
sed
can also be used if the variable is set.
$ date="$(date "%a, %d %b %Y %T %z")"
$ sed "/pubDate/ s/....-..-../$date/" input_file