Home > database >  Selenium C# web scraping - failed to resolve/parse html PageSource
Selenium C# web scraping - failed to resolve/parse html PageSource

Time:11-07

I have made myself a simple .NET console app in C# to scrape a dynamic page for personal use using Selenium C#.

The selenium navigation works perfectly fine, but when I am about to resolve the resulting page source and retrieve a list of real estate addresses, it returns null. And on top of that, it also gives warning and errors relating to chrome browser.

Full code:

public static void SeleniumExtract()
{
    // initial setup
    IWebDriver driver = new ChromeDriver();

    driver.Navigate().GoToUrl("https://www.knightfrank.co.uk/");

    WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(1));

    // dropdown
    var dropdown1 = driver.FindElement(By.Id("cpMain_ucc1_ctl00_liResidentialFront"));
    dropdown1.Click();
    
    // enter search query
    var search = driver.FindElement(By.Id("cpMain_ucc1_ctl00_txtResidentialSearchBox"));
    search.Click();
    search.SendKeys("London");

    // select search suggection from dropdown menu
    wait.Until(SeleniumExtras.WaitHelpers.ExpectedConditions.ElementExists(By.Id("ui-id-4")));
    var suggestion = driver.FindElement(By.Id("ui-id-4"));
    suggestion.Click();

    // submit search
    var submit = driver.FindElement(By.XPath("//div[@id='cpMain_ucc1_ctl00_pnlContentResidential']//a[@class='search-button']"));
    submit.Click();

    // get the data
    var elements = driver.FindElements(By.XPath("//div[@class='grid-address']"));
    
    foreach(var item in elements)
    {
        Console.WriteLine(item.Text);
    }

}

Problem #1 - Selenium:

Selenium does not return any result to console. The xpath //div[@class='grid-address'] is absolutely accurate, so no typo mistakes. I don't know why it doesn't output the foreach items result to console, i.e. this part of the code doesn't work:

// get the data
var elements = driver.FindElements(By.XPath("//div[@class='grid-address']"));

foreach(var item in elements)
{
    Console.WriteLine(item.Text);
}

Problem #2 - Html Agility pack:

Alternatively, I have tried to use Html Agility Pack to parse the PageSource, it just returns empty null exception.

First, I return the page source from SeleniumExtract():

// export current pagesource
var currentPage = driver.PageSource;
return currentPage;

And then, I load the page source into Html Agility Pack to work with. It returns nothing!

public static void HapParse(string currentPage)
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.Load(currentPage);

    var address = htmlDoc.DocumentNode
        .SelectNodes("//div[@class='grid-address']")
        .ToList();

    foreach(var item in address)
    {
        Console.WriteLine(item.InnerText);
    }
    
}

Problem #3 - AngleSharp:

Tried to do the same with Angle Sharp, and still not working.

public static async void AngleSharpParse(string currentPage)
{
    var config = Configuration.Default;
    var context = BrowsingContext.New(config);

    var document = await context.OpenAsync(req => req.Content(currentPage));

    var elements = document.QuerySelectorAll("div.grid-address");

    foreach (var item in elements)
    {
        Console.WriteLine(item.TextContent);
    }
}

Problem #4 - Warning and error relating to Chrome browser:

The console also returns the following errors everytime I execute the code:

[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(230)] crbug.com/1216328: Checking Bluetooth availability started. Please report if there is no report that this ends.
[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(233)] crbug.com/1216328: Checking Bluetooth availability ended.
[39536:28224:1103/183909.271:ERROR:chrome_browser_main_extra_parts_metrics.cc(236)] crbug.com/1216328: Checking default browser status started. Please report if there is no report that this ends.
[39536:39768:1103/183909.275:ERROR:device_event_log_impl.cc(214)] [18:39:09.274] USB: usb_service_win.cc:389 Could not read device interface GUIDs: The system cannot find the file specified. (0x2)
[39536:39768:1103/183909.280:ERROR:device_event_log_impl.cc(214)] [18:39:09.280] USB: usb_device_handle_win.cc:1048 
Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[39536:28224:1103/183909.291:ERROR:chrome_browser_main_extra_parts_metrics.cc(240)] crbug.com/1216328: Checking default browser status ended.
[39536:39768:1103/183909.310:ERROR:device_event_log_impl.cc(214)] [18:39:09.310] USB: usb_device_handle_win.cc:1048 
Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)

Please can someone tell me why this very simple code just doesn't return anyresult?

p.s.

I actually tried to code the exact same thing in Python using Selenium and Beautiful Soup and everything works perfectly.

What am I missing here?

CodePudding user response:

I simply needed to add a wait timer after Selenium submitted the search query, such as Thread.Sleep(3000), to let the page to fully load before parsing the HTML.

  • Related