Home > Mobile >  HTTP client not working for retrieving content from e-commerce website
HTTP client not working for retrieving content from e-commerce website

Time:02-11

I am trying to implement a web page watcher in Rust. Basic idea is that when certain string is not found in the page content, I would get a notification.

The basic logic is working for most situations but for certain e-commerce site (argos.co.uk in this case), it always return a page with "You don't have permission to access" in it.

The same page, of course, works fine with Safari. So I did Copy as cURL which gave me the following:

-X 'GET' \
-H 'Accept: text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'

Running the copied cURL command works fine as expected. So I added those 2 headers in to my Rust code:

let cli = reqwest::Client::new();
let resp = cli
    .get(url)
    .header(USER_AGENT, r#"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15"#)
    .header(ACCEPT, r#"text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8"#)
    .send().await?;

And I still got the same "you don't have permission..." page as above.

Using the above code with httpbin.org/get shows that Rust reqwest is indeed sending the right header. So I am at lost where to look for next. What could have gone wrong in my situation?

EDIT

I tried using the cURL command with httpbin as suggested below and got the following back.

   "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8",
    "Host": "httpbin.org",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15",
    "X-Amzn-Trace-Id": "Root=1-62011d4b-5ea67cec2c79826d5d00e959"

I don't believe X-Amzn-Trace-Id was sent by cURL but am more than willing to be proven wrong.

CodePudding user response:

As already suggested, this is most likely a basic header validation to prevent scraping. The following works for me:

use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let cli = reqwest::Client::new();
    let resp = cli
        .get("https://www.argos.co.uk/product/<YOUR_ID>")
        .header(reqwest::header::ACCEPT, "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8")
        .header(reqwest::header::ACCEPT_ENCODING, "identity")
        .header(reqwest::header::ACCEPT_LANGUAGE, "en-US,en;q=0.5")
        .header(reqwest::header::USER_AGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0")
        .send().await?;

    println!("{:?}", resp.status());
    println!("{}", resp.text().await?);

    Ok(())
}

Note that I've switched Accept-Encoding from gzip to identity. You can (and most likely should) use the gzip feature.

And of course make sure to respect the robots.txt and usage terms in regards to scraping.

  • Related