My string is this:
string1 <- <a href=\"javascript:openurl('/Xplore/accessinfo.jsp')\" class=\"topUnderlineLinks\"> <A href=\"/iel5/4235/4079606/04079617.pdf?tp=&arnumber=4079617&isnumber=4079606\" class=\"bodyCopy\">PDF</A>(3141 KB)
<A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
<td width=\"33%\" ><div align=\"right\"> <a href=\"/xplorehelp/Help_start.html#Help_searchresults.html\" class=\"subNavLinks\" target=\"blank\">Help</a> <a href=\"/xpl/contactus.jsp\" class=\"subNavLinks\">Contact Kimya ile ilgili çeþitli temel referans
<a href=\"http://search.epnet.com/login.asp?profile=web&defaultdb=geh\"
<a href=\"http://iimpft.chadwyck.com/\" target=\"_parent\">International
<a href=\"standartlar.html#tse\" target=\"_parent\">NFPA Standartlarý</a>
<a href=\"http://www.gutenberg.org/\" target=\"_parent\">Project Gutenberg</a>
<a href=\"http://proquestcombo.safaribooksonline.com/?portal=proquestcombo&uicode=istanbultek\"
<a href=\"http://www.scitation.org\" target=\"_parent\">Scitation</a>
dergilerin listesini görmek için <a href=\"/online/aip.html\">bu yolu</a>
<a href=\"http://www3.interscience.wiley.com/journalfinder.html\"
<td width=\"46%\"><a href=\"/xpl/periodicals.jsp\" class=\"dropDownNav\" accesskey=\"j\">Journals & Magazines
<td><a href=\"http://www.ieee.org/products/onlinepubs/resources/XploreTutorial.pdf\" class=\"dropDownNav\">IEEE Xplore Demo</a></td>"
And I want to extract all the href="..."or
href='...' in this string.
CodePudding user response:
The best thing to do when using regex to parse html is...don't. There are better, faster, more reliable ways to do it.
For example:
library(rvest)
read_html(string) %>%
html_nodes("a") %>%
html_attr("href")
#> [1] "javascript:openurl('/Xplore/accessinfo.jsp')"
#> [2] "/iel5/4235/4079606/04079617.pdf?tp=&arnumber=4079617&isnumber=4079606"
#> [3] "/xpl/RecentCon.jsp?punumber=10417"
#> [4] "/xplorehelp/Help_start.html#Help_searchresults.html"
#> [5] "/xpl/contactus.jsp"
#> [6] "http://search.epnet.com/login.asp?profile=web&defaultdb=geh"
#> [7] "standartlar.html#tse"
#> [8] "http://www.gutenberg.org/"
#> [9] "http://proquestcombo.safaribooksonline.com/?portal=proquestcombo&uicode=istanbultek"
#> [10] "/online/aip.html"
#> [11] "http://www3.interscience.wiley.com/journalfinder.html"
#> [12] "/xpl/periodicals.jsp"
#> [13] "http://www.ieee.org/products/onlinepubs/resources/XploreTutorial.pdf"
If you really need the opening href = "
and terminal "
you can add these easily with paste
.
Created on 2021-10-27 by the reprex package (v2.0.0)
CodePudding user response:
/href=\\?(['"]).*?\1/
demo
We've used \1
to refer to the previously matched '
OR "
.