Home > Net >  Want to parse HTML document on bash shell
Want to parse HTML document on bash shell

Time:04-27

Here is the document file.txt

 <td ><a  href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-173459">7.11.0-0-2022-04-22-173459> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-173459"><i title="Inconsistency detected! Click for more details" ></i></a></td>                 <td >Accepted</td>                 <td title="2022-04-22T17:34:59Z">3 days ago</td>
 <td ><a  href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-135611">7.11.0-0-2022-04-22-135611> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-135611"><i title="Inconsistency detected! Click for more details" ></i></a></td>                 <td >Accepted</td>                 <td title="2022-04-22T13:56:11Z">3 days ago</td>
 <td ><a  href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-101342">7.11.0-0-2022-04-22-101342> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-101342"><i title="Inconsistency detected! Click for more details" ></i></a></td>                 <td >Accepted</td>                 <td title="2022-04-22T10:13:42Z">3 days ago</td>
 <td ><a  href="/releasestream/8.10.0-0.-/release/8.10.0-0-2022-04-19-164705">8.10.0-0-2022-04-19-164705></td>                 <td >Accepted</td>                 <td title="2022-04-19T16:47:05Z">6 days ago</td>
 <td ><a  href="/releasestream/9.11.0-0.nightly-ppc64le/release/9.11.0-0.-2021-08-11-224744">9.11.0-0-2021-08-11-224744></td>                 <td >Accepted</td>                 <td title="2021-08-11T22:47:44Z">8 months ago</td>

I want to have output as

7.11.0-0-2022-04-22-173459 2022-04-22T17:34:59Z
7.11.0-0-2022-04-22-135611 2022-04-22T13:56:11Z
7.11.0-0-2022-04-22-101342 2022-04-22T10:13:42Z
8.10.0-0-2022-04-19-164705 2022-04-19T16:47:05Z
9.11.0-0-2021-08-11-224744 2021-08-11T22:47:44Z

Is there any easy way to do that? I also want to sort them according to universal time zone I tried

cat file.txt |grep "<td" | grep "href=" | sed 's/<\/a//' |awk  'BEGIN{FS="href="}{print $2}'  | awk 'BEGIN{FS=">"}{print $2} {print $5} {print $6}

But it's not giving desired output

CodePudding user response:

With your shown samples only, please try following awk code. Also experts always advice to use languages which know how to parse html in case you don't have that facility(to install packages or use latest languages) you could use this one, but again this is clearly written and tested with shown samples only.

Written and tested in GNU awk. Using its FPAT function where we can use regex to decide whichever fields we need for our uses, it will only make fields in lines which are satisfying the pattern(regex) and ignore others(not required ones).

awk -v FPAT='a ]*" href="/[^"]*|<td title="[^"]*' '
{
  num1=split($1,arr1,"/")
  num2=split($2,arr2,"\"")
  print arr1[num1],arr2[num2]
}
' Input_file

CodePudding user response:

Using sed

$ sed 's|\([^/]*/\)\{4\}[^>]*>\([^>]*\).*title[^"]*"\([^"]*\).*|\2 \3|' input_file
7.11.0-0-2022-04-22-173459 2022-04-22T17:34:59Z
7.11.0-0-2022-04-22-135611 2022-04-22T13:56:11Z
7.11.0-0-2022-04-22-101342 2022-04-22T10:13:42Z
8.10.0-0-2022-04-19-164705 2022-04-19T16:47:05Z
9.11.0-0-2021-08-11-224744 2021-08-11T22:47:44Z

CodePudding user response:

Firstly, if your document is legal HTML then if not forbidden you should use HTML parser rather than grep or sed or awk. If this is not possible then you can use GNU AWK for this task following way, let file.txt content be

 <td ><a  href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-173459">7.11.0-0-2022-04-22-173459> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-173459"><i title="Inconsistency detected! Click for more details" ></i></a></td>                 <td >Accepted</td>                 <td title="2022-04-22T17:34:59Z">3 days ago</td>
 <td ><a  href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-135611">7.11.0-0-2022-04-22-135611> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-135611"><i title="Inconsistency detected! Click for more details" ></i></a></td>                 <td >Accepted</td>                 <td title="2022-04-22T13:56:11Z">3 days ago</td>
 <td ><a  href="/releasestream/7.11.0-0.-/release/7.11.0-0-2022-04-22-101342">7.11.0-0-2022-04-22-101342> <a href="/releasestream/7.11.0-0.-/inconsistency/7.11.0-0-2022-04-22-101342"><i title="Inconsistency detected! Click for more details" ></i></a></td>                 <td >Accepted</td>                 <td title="2022-04-22T10:13:42Z">3 days ago</td>
 <td ><a  href="/releasestream/8.10.0-0.-/release/8.10.0-0-2022-04-19-164705">8.10.0-0-2022-04-19-164705></td>                 <td >Accepted</td>                 <td title="2022-04-19T16:47:05Z">6 days ago</td>
 <td ><a  href="/releasestream/9.11.0-0.nightly-ppc64le/release/9.11.0-0.-2021-08-11-224744">9.11.0-0-2021-08-11-224744></td>                 <td >Accepted</td>                 <td title="2021-08-11T22:47:44Z">8 months ago</td>

then

awk 'match($0,/[0-9]\.[0-9]{2}\.[0-9]-[0-9]-[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{6}/){printf "%s ",substr($0,RSTART,RLENGTH)}match($0,/[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z/){print substr($0,RSTART,RLENGTH)}' file.txt

output

7.11.0-0-2022-04-22-173459 2022-04-22T17:34:59Z
7.11.0-0-2022-04-22-135611 2022-04-22T13:56:11Z
7.11.0-0-2022-04-22-101342 2022-04-22T10:13:42Z
8.10.0-0-2022-04-19-164705 2022-04-19T16:47:05Z
9.11.0-0-2021-08-11-224744 2021-08-11T22:47:44Z

Explanation: I firstly crafted 2 regular expression which do describe values we are looking for, [0-9] is digit, \. is literal dot, {2} is number of repetition. If first value is found I use printf to output that followed by space (I do not use print as this would add trailing newline which is unwanted), if second value is found I do print it.

Disclaimer: I assume values that you are looking for can be described using shown regular expressions and are present in each line. Feel free to adjust it to your needs. If you want to know more about match or substr read String Functions (The GNU Awk User's Guide)

(tested in gawk 4.2.1)

  • Related