How do I build a crawler that will go on infinitely?-CodePudding

I want to make a crawler that will just keep going infinitely until a page has no links. Every time it crawls a page it returns the html of the webpage so I can parse it and get the title, meta tags, and information from article or p tags. I basically want it to look like this:

while(num_links_in_page > 0){
 html = page.content
 /* code to parse html */
 insert_in_db(html, meta, title, info, url)
}

I am using php, javascript, and MySQL for the DB but I have no problem switching to python or any other language, I do not have much money for distributed systems, but I need it to be fast and not take 20 minutes to crawl 5 links like my current crawler I made from scratch does, which also stops after about 50 links.

CodePudding user response：

What have you tried to further your progress? You need a lot more than what you have above. You need something along these lines below:

// Database Structure 
CREATE TABLE 'webpage_details' (
 'link' text NOT NULL,
 'title' text NOT NULL,
 'description' text NOT NULL,
 'internal_link' text NOT NULL,
) ENGINE=MyISAM AUTO_INCREMENT=5 DEFAULT CHARSET=latin1

    <?php
     $main_url="http://samplesite.com";
     $str = file_get_contents($main_url);
     
     // Gets Webpage Title
     if(strlen($str)>0)
     {
      $str = trim(preg_replace('/\s /', ' ', $str)); // supports line breaks inside <title>
      preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
      $title=$title[1];
     }
        
     // Gets Webpage Description
     $b =$main_url;
     @$url = parse_url( $b );
     @$tags = get_meta_tags($url['scheme'].'://'.$url['host'] );
     $description=$tags['description'];
        
     // Gets Webpage Internal Links
     $doc = new DOMDocument; 
     @$doc->loadHTML($str); 
     
     $items = $doc->getElementsByTagName('a'); 
     foreach($items as $value) 
     { 
      $attrs = $value->attributes; 
      $sec_url[]=$attrs->getNamedItem('href')->nodeValue;
     }
     $all_links=implode(",",$sec_url);
     
     // Store Data In Database
     $host="localhost";
     $username="root";
     $password="";
     $databasename="sample";
     $connect=mysql_connect($host,$username,$password);
     $db=mysql_select_db($databasename);
    
     mysql_query("insert into webpage_details values('$main_url','$title','$description','$all_links')");
    
    ?>

http://talkerscode.com/webtricks/create-simple-web-crawler-using-php-and-mysql.php

CodePudding user response：

The slowest part of your crawler is fetching the page. This can be solved via multiple processes (threads) running independently of one another. Perl can spawn processes. PHP 8 has "parallel". Shell scripts (at least on Linux-like OSs) can run things in the "background". I recommend 10 simultaneous processes as a reasonable tradeoff among various competing resource limits, etc.

For Perl, CPAN's "Mechanize" will do all the parsing for you and provide an array of links. The second slowest part is inserting rows into the table one at a time. Collect them up and build a multi-row "batch" INSERT. I recommend limiting the batch to 100 rows.

You also need to avoid crawling the same site repeatedly. To assist with that, I suggest a TIMESTAMP be included with each row. (Plus other logic.)

With the above advice, I would expect at least 10 new links per second. And no "stop after 50". OTOH, there are several things that can cause slowdowns or hiccups -- huge pages, distant domains, access denied, etc.

Also, don't pound on a single domain. Perhaps a DOS monitor saw 50 requests in a few seconds and blacklisted your IP address. So, be sure to delay several seconds between following a link to any domain you have recently fetched a page from.

Even without my above advice, your "20", "5", and "50" seem to point to other bugs.