Home > Back-end >  How to efficient access to web url specified in the content?
How to efficient access to web url specified in the content?

Time:10-09

Just want to through keyword search page in the specified content of the site, of course, the keywords in the page appeared more than once, you can loop through from beginning to end, url is not fixed,

I understand the web access is not difficult, can get idhttp, mainly is the string processing, thank you first!

Keywords: specify the content of website extraction cycle efficiency

CodePudding user response:

Very interesting topic, web resolution should be able to search some source code, never occurred, some of them are to make the crawler, function is very broad, and perhaps beyond your need,

I want to, need to distinguish between link url or the url in the text are to be found, if the former, or need to parse HTML specification to label, did the crawler is a bit like, if the latter, that is to go to a search string, but not clear what you said "content" is what meaning, if is a key word, the question comes down to find keywords after part of how to determine if it's web site, and how to determine the site before and after the border, in this way, problem boils down to how to determine the signature string again, may need to do a bit of syntax analysis,

CodePudding user response:

Not doing a reptile, don't need to find all the connection address, just as you said, "the specified content" is the "key words",
Continue to solve!

CodePudding user response:

I understand what you say is like & lt; A href="http://www.baidu.com" & gt; Baidu once, you'll know & lt;/a> , you baidu search string, then the result can match to http://www.baidu.com,

In this way, you can look over the current page all link href and text to intercept out - & gt; Whether exist in an array, and then loop through the text you need keywords, and then get a new array, or at the time of interception to judge can, also can use the dom cycle in the form of judgment

CodePudding user response:

Can't update the data in the lottery hehe

CodePudding user response:

Traverse the DOM tree is still feeling, whether to find text embedded in the web site? Might as well leave aside the problem of the first, go to traverse the nodes in the DOM tree, node with url whenever possible, to analyze its properties, as the upstairs said href attribute, and the SRC attribute, etc., as long as the removed attribute values, it is the website, and then see whether specified keywords,

Traverse the DOM node, using the recursive algorithm is simple: write a recursive function, starting from the root node to find the child node, to find the child node, the function itself again to find the child nodes of the son, in this way, a recursive function can finish traversal,

CodePudding user response:

Myself with a circular way can preliminary finish required functionality, but feel the efficiency is low, and on the web is more, the keyword more cases, efficiency is more important, then there is a different web page can't guarantee the extraction effect,

None01 dom tree traversal and jinghai1776 said, I have no contact with the power which can give a case of study, thank you!

CodePudding user response:

I check the information now has the following several ways to extract the page link text and links,
1, string manipulation,
2, the regular expression
3, by MSHTML HTML parsing
4, the dom tree (haven't contacted)

Who's got talent has research to the above a certain way, give advice or comments please!

CodePudding user response:

If have been familiar with regular use idhttp nature good, if not directly open the calendar through all links with that whether the innertext contains a specified keyword,
 doc:=webbroser. Oleobject. Document. The links; 
For I=0 to doc. Do the begin the count - 1
If pos (' keyword 'doc. Item (I). The innertext) & gt; 0 then
Memo1. Lines. The add (doc. Item (I). The innertext + ', '+ doc. The item (I). The href);
end;

CodePudding user response:

refer to the eighth floor devhp response:
if have been familiar with regular use idhttp nature good, if not directly open the calendar all links with that whether the innertext contains a specified keyword,


Delphi/Pascal code
?



12345

Doc:=webbroser oleobject. Document. The links; For I=0 to doc. Do the begin if pos count - 1 'keyword', d...


Thank devhp, your code is learning, I was at the regular these days, I can only extract with regular to the href, but not to come out the link text, which are free to help take a look at how to use regular extract links and link text at the same time,

CodePudding user response:

Saw your post, as is the case with me ah, I am also doing the HTML page data fetching,
If it is a small range of data, use regular expressions to,
But to grab a larger amount of data is to grab the entry of more cases, efficiency become need must ensure that the problem,
Don't know what your problem is solved not, is there any relevant information or experience to teach, thank you.

CodePudding user response:

references to the tenth floor willhuo response:
read your post, as is the case with me, I am also doing HTML page data fetching,
If it is a small range of data, use regular expressions to,
But to grab a larger amount of data is to grab the entry of more cases, efficiency become need must ensure that the problem,
Don't know what your problem is solved not, is there any relevant information or experience to teach, thank you.


Wu quite a open source HTML parsing, useful,

CodePudding user response:

You can use IDHTTP,
  • Related