Home > Software design >  RegEx to match all href elements in a string
RegEx to match all href elements in a string

Time:10-28

My string is this:

string1 <- <a href=\"javascript:openurl('/Xplore/accessinfo.jsp')\" class=\"topUnderlineLinks\">                                            <A href=\"/iel5/4235/4079606/04079617.pdf?tp=&arnumber=4079617&isnumber=4079606\" class=\"bodyCopy\">PDF</A>(3141 KB)&nbsp;
                        <A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
                <td width=\"33%\" ><div align=\"right\"> <a href=\"/xplorehelp/Help_start.html#Help_searchresults.html\" class=\"subNavLinks\" target=\"blank\">Help</a>&nbsp;&nbsp;&nbsp;<a href=\"/xpl/contactus.jsp\" class=\"subNavLinks\">Contact Kimya ile ilgili çeþitli temel referans
<a href=\"http://search.epnet.com/login.asp?profile=web&amp;defaultdb=geh\"
<a href=\"http://iimpft.chadwyck.com/\" target=\"_parent\">International
<a href=\"standartlar.html#tse\" target=\"_parent\">NFPA Standartlarý</a>
<a href=\"http://www.gutenberg.org/\" target=\"_parent\">Project Gutenberg</a>
<a href=\"http://proquestcombo.safaribooksonline.com/?portal=proquestcombo&amp;uicode=istanbultek\"
<a href=\"http://www.scitation.org\" target=\"_parent\">Scitation</a>
dergilerin listesini görmek için <a href=\"/online/aip.html\">bu yolu</a>
<a href=\"http://www3.interscience.wiley.com/journalfinder.html\"
               <td width=\"46%\"><a href=\"/xpl/periodicals.jsp\" class=\"dropDownNav\" accesskey=\"j\">Journals &amp; Magazines
               <td><a href=\"http://www.ieee.org/products/onlinepubs/resources/XploreTutorial.pdf\" class=\"dropDownNav\">IEEE Xplore Demo</a></td>" 

And I want to extract all the href="..."orhref='...' in this string.

CodePudding user response:

The best thing to do when using regex to parse html is...don't. There are better, faster, more reliable ways to do it.

For example:

library(rvest)

read_html(string) %>% 
  html_nodes("a") %>%
  html_attr("href")

#>  [1] "javascript:openurl('/Xplore/accessinfo.jsp')"                                       
#>  [2] "/iel5/4235/4079606/04079617.pdf?tp=&arnumber=4079617&isnumber=4079606"              
#>  [3] "/xpl/RecentCon.jsp?punumber=10417"                                                  
#>  [4] "/xplorehelp/Help_start.html#Help_searchresults.html"                                
#>  [5] "/xpl/contactus.jsp"                                                                 
#>  [6] "http://search.epnet.com/login.asp?profile=web&defaultdb=geh"                        
#>  [7] "standartlar.html#tse"                                                               
#>  [8] "http://www.gutenberg.org/"                                                          
#>  [9] "http://proquestcombo.safaribooksonline.com/?portal=proquestcombo&uicode=istanbultek"
#> [10] "/online/aip.html"                                                                   
#> [11] "http://www3.interscience.wiley.com/journalfinder.html"                              
#> [12] "/xpl/periodicals.jsp"                                                               
#> [13] "http://www.ieee.org/products/onlinepubs/resources/XploreTutorial.pdf"

If you really need the opening href = " and terminal " you can add these easily with paste.

Created on 2021-10-27 by the reprex package (v2.0.0)

CodePudding user response:

/href=\\?(['"]).*?\1/ demo

We've used \1 to refer to the previously matched ' OR ".

  • Related