Here is the document file.txt
<td ><a href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-173459">7.11.0-0-2022-04-22-173459> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-173459"><i title="Inconsistency detected! Click for more details" ></i></a></td> <td >Accepted</td> <td title="2022-04-22T17:34:59Z">3 days ago</td>
<td ><a href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-135611">7.11.0-0-2022-04-22-135611> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-135611"><i title="Inconsistency detected! Click for more details" ></i></a></td> <td >Accepted</td> <td title="2022-04-22T13:56:11Z">3 days ago</td>
<td ><a href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-101342">7.11.0-0-2022-04-22-101342> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-101342"><i title="Inconsistency detected! Click for more details" ></i></a></td> <td >Accepted</td> <td title="2022-04-22T10:13:42Z">3 days ago</td>
<td ><a href="/releasestream/8.10.0-0.-/release/8.10.0-0-2022-04-19-164705">8.10.0-0-2022-04-19-164705></td> <td >Accepted</td> <td title="2022-04-19T16:47:05Z">6 days ago</td>
<td ><a href="/releasestream/9.11.0-0.nightly-ppc64le/release/9.11.0-0.-2021-08-11-224744">9.11.0-0-2021-08-11-224744></td> <td >Accepted</td> <td title="2021-08-11T22:47:44Z">8 months ago</td>
I want to have output as
7.11.0-0-2022-04-22-173459 2022-04-22T17:34:59Z
7.11.0-0-2022-04-22-135611 2022-04-22T13:56:11Z
7.11.0-0-2022-04-22-101342 2022-04-22T10:13:42Z
8.10.0-0-2022-04-19-164705 2022-04-19T16:47:05Z
9.11.0-0-2021-08-11-224744 2021-08-11T22:47:44Z
Is there any easy way to do that? I also want to sort them according to universal time zone I tried
cat file.txt |grep "<td" | grep "href=" | sed 's/<\/a//' |awk 'BEGIN{FS="href="}{print $2}' | awk 'BEGIN{FS=">"}{print $2} {print $5} {print $6}
But it's not giving desired output
CodePudding user response:
With your shown samples only, please try following awk
code. Also experts always advice to use languages which know how to parse html in case you don't have that facility(to install packages or use latest languages) you could use this one, but again this is clearly written and tested with shown samples only.
Written and tested in GNU awk
. Using its FPAT
function where we can use regex to decide whichever fields we need for our uses, it will only make fields in lines which are satisfying the pattern(regex) and ignore others(not required ones).
awk -v FPAT='a ]*" href="/[^"]*|<td title="[^"]*' '
{
num1=split($1,arr1,"/")
num2=split($2,arr2,"\"")
print arr1[num1],arr2[num2]
}
' Input_file
CodePudding user response:
Using sed
$ sed 's|\([^/]*/\)\{4\}[^>]*>\([^>]*\).*title[^"]*"\([^"]*\).*|\2 \3|' input_file
7.11.0-0-2022-04-22-173459 2022-04-22T17:34:59Z
7.11.0-0-2022-04-22-135611 2022-04-22T13:56:11Z
7.11.0-0-2022-04-22-101342 2022-04-22T10:13:42Z
8.10.0-0-2022-04-19-164705 2022-04-19T16:47:05Z
9.11.0-0-2021-08-11-224744 2021-08-11T22:47:44Z
CodePudding user response:
Firstly, if your document is legal HTML then if not forbidden you should use HTML parser rather than grep
or sed
or awk
. If this is not possible then you can use GNU AWK
for this task following way, let file.txt
content be
<td ><a href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-173459">7.11.0-0-2022-04-22-173459> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-173459"><i title="Inconsistency detected! Click for more details" ></i></a></td> <td >Accepted</td> <td title="2022-04-22T17:34:59Z">3 days ago</td>
<td ><a href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-135611">7.11.0-0-2022-04-22-135611> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-135611"><i title="Inconsistency detected! Click for more details" ></i></a></td> <td >Accepted</td> <td title="2022-04-22T13:56:11Z">3 days ago</td>
<td ><a href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-101342">7.11.0-0-2022-04-22-101342> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-101342"><i title="Inconsistency detected! Click for more details" ></i></a></td> <td >Accepted</td> <td title="2022-04-22T10:13:42Z">3 days ago</td>
<td ><a href="/releasestream/8.10.0-0.-/release/8.10.0-0-2022-04-19-164705">8.10.0-0-2022-04-19-164705></td> <td >Accepted</td> <td title="2022-04-19T16:47:05Z">6 days ago</td>
<td ><a href="/releasestream/9.11.0-0.nightly-ppc64le/release/9.11.0-0.-2021-08-11-224744">9.11.0-0-2021-08-11-224744></td> <td >Accepted</td> <td title="2021-08-11T22:47:44Z">8 months ago</td>
then
awk 'match($0,/[0-9]\.[0-9]{2}\.[0-9]-[0-9]-[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{6}/){printf "%s ",substr($0,RSTART,RLENGTH)}match($0,/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z/){print substr($0,RSTART,RLENGTH)}' file.txt
output
7.11.0-0-2022-04-22-173459 2022-04-22T17:34:59Z
7.11.0-0-2022-04-22-135611 2022-04-22T13:56:11Z
7.11.0-0-2022-04-22-101342 2022-04-22T10:13:42Z
8.10.0-0-2022-04-19-164705 2022-04-19T16:47:05Z
9.11.0-0-2021-08-11-224744 2021-08-11T22:47:44Z
Explanation: I firstly crafted 2 regular expression which do describe values we are looking for, [0-9]
is digit, \.
is literal dot, {2}
is number of repetition. If first value is found I use printf
to output that followed by space (I do not use print
as this would add trailing newline which is unwanted), if second value is found I do print
it.
Disclaimer: I assume values that you are looking for can be described using shown regular expressions and are present in each line. Feel free to adjust it to your needs. If you want to know more about match
or substr
read String Functions (The GNU Awk User's Guide)
(tested in gawk 4.2.1)