I am trying to implement a web page watcher in Rust. Basic idea is that when certain string is not found in the page content, I would get a notification.
The basic logic is working for most situations but for certain e-commerce site (argos.co.uk in this case), it always return a page with "You don't have permission to access" in it.
The same page, of course, works fine with Safari. So I did Copy as cURL
which gave me the following:
-X 'GET' \
-H 'Accept: text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'
Running the copied cURL command works fine as expected. So I added those 2 headers in to my Rust code:
let cli = reqwest::Client::new();
let resp = cli
.get(url)
.header(USER_AGENT, r#"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15"#)
.header(ACCEPT, r#"text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8"#)
.send().await?;
And I still got the same "you don't have permission..." page as above.
Using the above code with httpbin.org/get
shows that Rust reqwest is indeed sending the right header. So I am at lost where to look for next. What could have gone wrong in my situation?
EDIT
I tried using the cURL
command with httpbin as suggested below and got the following back.
"Accept": "text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15",
"X-Amzn-Trace-Id": "Root=1-62011d4b-5ea67cec2c79826d5d00e959"
I don't believe X-Amzn-Trace-Id
was sent by cURL
but am more than willing to be proven wrong.
CodePudding user response:
As already suggested, this is most likely a basic header validation to prevent scraping. The following works for me:
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let cli = reqwest::Client::new();
let resp = cli
.get("https://www.argos.co.uk/product/<YOUR_ID>")
.header(reqwest::header::ACCEPT, "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8")
.header(reqwest::header::ACCEPT_ENCODING, "identity")
.header(reqwest::header::ACCEPT_LANGUAGE, "en-US,en;q=0.5")
.header(reqwest::header::USER_AGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0")
.send().await?;
println!("{:?}", resp.status());
println!("{}", resp.text().await?);
Ok(())
}
Note that I've switched Accept-Encoding
from gzip
to identity
. You can (and most likely should) use the gzip feature.
And of course make sure to respect the robots.txt
and usage terms in regards to scraping.