Home > Blockchain >  Passing field from awk pattern match into shell date command substitution
Passing field from awk pattern match into shell date command substitution

Time:12-06

I would like to convert a (ISO 8601) date from YYYY-MM-DD to something that looks like Mon, 03 Dec 2021 00:00:00 -0600 (RFC-822 / RFC 5322). I have not been able to find questions that address this specific issue.

I have an RSS XML file that looks like this:

bash-5.1$ cat feed.xml                                                                                                                                                                                              
<rss version="2.0">                                                                                                                                                                                 
    <channel>                                                                                                                                                               
        <title>Title string</title>                                                                                                                                                                                               
        <link>https://domain/feed.xml</link>                                                                                                                                                                                                
        <description>Description string here</description>                                                                                                                                                    
        <language>en-us</language>                                                                                                                                                                                                             
<item>                                                                                                                                          
<title>title string here</title>                                                                                                                                                                                 
<link>link string with https style information</link>                                                                                                                                                                 
<guid>link string with https style information</guid>                                                                                                                                                                 
<pubDate>2021-12-03</pubDate>                                                                                                                                                                                                               
</item>                                                                                                                                             
<item>                                                                                                                                          
<title>title string here</title>                                                                                                                                                                                 
<link>link string with https style information</link>                                                                                                                                                           
<guid>link string with https style information</guid>                                                                                                                                                           
<pubDate>2019-08-13</pubDate>                                                                                                                                                                                                               
</item>       
<item>                                                                                                                                          
<title>title string here</title>                                                                                                                                                                                 
<link>link string with https style information</link>                                                                                                                                                           
<guid>link string with https style information</guid>                                                                                                                                                           
<pubDate>2018-11-23</pubDate>                                                                                                                                                                                                              
</item>                                                                                                                                                                                                                                                                                
</channel>
</rss>

The output I expect after parsing would be something like this:

...
<pubDate>Fri, 03 Dec 2021 00:00:00 -0600</pubDate> 
...
<pubDate>Tue, 13 Aug 2019 00:00:00 -0500</pubDate> 
...
<pubDate>Fri, 23 Nov 2018 00:00:00 -0600</pubDate> 
...

This output can usually be achieved with the bash date command like so:

bash-5.1$ date -d "2018-11-23"  "%a, %d %b %Y %T %z"
Fri, 23 Nov 2018 00:00:00 -0600

I am trying to accomplish this with awk because I learned that command substitution is possible, and I believe I am close, but not quite:

bash-5.1$ awk -F "[><]" -v date="$(date  "%a, %d %b %Y %T %z" -d "$3")" '/pubDate/ {print $3date}' feed.xml 
2021-12-03Sun, 05 Dec 2021 00:00:00 -0600
2019-08-13Sun, 05 Dec 2021 00:00:00 -0600
2018-11-23Sun, 05 Dec 2021 00:00:00 -0600

It seems like the pattern match executes successfully as well as the date command, but the awk $3 field looks like is not being passed into the shell date command and so the current time is showing instead of the transformed time.

How do I pass the $3 field from the awk pattern match into the date command so that it can convert the date based on the field value?

Any help is greatly appreciated!

CodePudding user response:

As commented, it is recommended to use xml parsing tool speaking generally. If the xml file is well aligned as the provided example, bash or awk may work under limited conditions.
With bash:

#!/bin/bash

while IFS= read -r line; do
    if [[ $line =~ (.*<pubDate>)([0-9]{4}-[0-9]{2}-[0-9]{2})(</pubDate>.*) ]]; then
        datestr=$(date -d "${BASH_REMATCH[2]}"  "%a, %d %b %Y %T %z")
        line="${BASH_REMATCH[1]}$datestr${BASH_REMATCH[3]}"
    fi
    echo "$line"
done < feed.xml

The condition $line =~ (.*<pubDate>)([0-9]{4}-[0-9]{2}-[0-9]{2})(</pubDate>.*) matches the <pubdate> line assiging bash variable ${BASH_REMATCH[@] to the parenthesized substrings. Then line is reconstructed using the reformatted date string.

If gawk which supports mktime() and strftime() functions, you can also say with gawk:

awk '
{
    if (match($0, /^(.*<pubDate>)([0-9]{4})-([0-9]{2})-([0-9]{2})(<\/pubDate>.*)/, a) ) {
        ts = mktime(a[2] " " a[3] " " a[4] " 00 00 00")    # timestamp since the epoch
        datestr = strftime("%a, %d %b %Y %T %z", ts)
        $0 = a[1] datestr a[5]
    }
} 1' feed.xml

CodePudding user response:

Altering your awk code slightly.

$ awk -v date="$(date  "%a, %d %b %Y %T %z")" 'BEGIN {FS=OFS=">"}$1~/pubDate/{split($2,a,"<"); a[1]=date; $2=a[1]"<"a[2]}1' input_file
<rss version=2.0>                                                                                                                                                                            
    <channel>
        <title>Title string</title>                                                                                                                                                          
        <link>https://domain/feed.xml</link>                                                                                                                                                 
        <description>Description string here</description>                                                                                                                                   
        <language>en-us</language>                                                                                                                                                           
<item>
<title>title string here</title>                                                                                                                                                             
<link>link string with https style information</link>                                                                                                                                        
<guid>link string with https style information</guid>                                                                                                                                        
<pubDate>Mon, 06 Dec 2021 00:00:00  0000</pubDate>                                                                                                                                           
</item>
<item>
<title>title string here</title>                                                                                                                                                             
<link>link string with https style information</link>                                                                                                                                        
<guid>link string with https style information</guid>                                                                                                                                        
<pubDate>Mon, 06 Dec 2021 00:00:00  0000</pubDate>                                                                                                                                           
</item>
<item>
<title>title string here</title>                                                                                                                                                             
<link>link string with https style information</link>                                                                                                                                        
<guid>link string with https style information</guid>                                                                                                                                        
<pubDate>Mon, 06 Dec 2021 00:00:00  0000</pubDate>                                                                                                                                           
</item>                                                                                                                                                                                      
</channel>
</rss>

sed can also be used if the variable is set.

$ date="$(date  "%a, %d %b %Y %T %z")" 
$ sed "/pubDate/ s/....-..-../$date/" input_file
  • Related