I currently got this far in scraping with htmldom (as far as examples go)
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://nitter.absturztau.be/chillartaholic');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
However instead of retrieving a title and image, I'd like to instead get all lines in the target page that begin with:
<a
and display the lines scraped - in their entirety - top to bottom below.
(First scraped line would then be:
> <a
> href="/ChillArtaholic/status/1413973360841744390#m"></a>
Is this possible with htmldom (or are there limitations on the scrapeable number of lines et all?)
CodePudding user response:
Yes, it is possible to scrape all lines that begin with <a using htmldom. You can use the find() method to select all elements with that class, and then use the plaintext property to get the text content of the element. Here's an example:
require 'simple_html_dom.php';
$html = file_get_html('https://nitter.absturztau.be/chillartaholic');
foreach($html->find('a.tweet-link') as $element) {
echo $element->plaintext . "<br>\n";
}
This code will find all elements with the class "tweet-link" and print their text content to the screen, followed by a line break.
Other methods
$url = 'https://nitter.absturztau.be/chillartaholic';
$html = file_get_contents($url);
preg_match_all('/<a (.*?)<\/a>/s', $html, $matches);
foreach ($matches[0] as $match) {
echo $match . "<br>";
}
Another one
$url = 'https://nitter.absturztau.be/chillartaholic';
$html = file_get_contents($url);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[@]');
foreach ($nodes as $node) {
echo $node->nodeValue . "<br>";
}
I hope it helps you