When I visit
https://www.duckduckgo.com/?q=!ducky goodreads quotes A Promised Land Barack Obama&t=h_&ia=web
It ultimately redirects to
https://www.goodreads.com/work/quotes/86336100-a-promised-land
Using Ruby, is there a way to pass in the first duckduckgo page, but collect the final url that this redirects to?
I tried using
res = Net::HTTP.get_response(URI('https://www.duckduckgo.com/?q=!ducky goodreads quotes A Promised Land Barack Obama&t=h_&ia=web'))
puts res['location']
but this only outputs the same duckduckgo link.
CodePudding user response:
TL;DR
DuckDuckGo is sending you on multiple redirects, and some are through javascript. You'll need to either follow all these redirects manually with Net::HTTP
and try to pull URLs out of the javascript, or use a different tool like Selenium Ruby or Capybara which can execute the javascript.
Using Ruby, is there a way to pass in the first duckduckgo page, but collect the final url that this redirects to?
In short, it would be quite difficult. You'd have to write quite a bit of custom code, and there are much better tools for this.
What the browser is doing (the full story)
Here's the full story of what DuckDuckGo is doing with your requests:
Request #1
URL: https://www.duckduckgo.com/?q=!ducky goodreads quotes A Promised Land Barack Obama&t=h_&ia=web
. This returns a 301 redirect to: https://duckduckgo.com/?q=!ducky goodreads quotes A Promised Land Barack Obama&t=h_&ia=web
. Note that the 'www' is not in the returned URL.
Request #2
So we make a request to the new URL without the 'www':
redirected_url = res['location']
res = Net::HTTP.get_response(URI(redirected_url))
If we were in the browser, we would be directed again to the goodreads site. However, if we inspect the response (i.e. res
) they're not sending a true 301 redirect, duckduckgo is actually doing it with javascript. Here's the output of res.body
:
<html><head><meta http-equiv='Content-Type' content='text/html; charset=utf-8'><meta name='referrer' content='origin'><meta name='robots' content='noindex, nofollow'><meta http-equiv='refresh' content='0; url=/l/?uddg=https://www.goodreads.com/work/quotes/86336100-a-promised-land&rut=c5d81c30df243e291a04d906995e775c2f7c1bec359e8efa2ac0451aa701a8bf'></head><body><script language='JavaScript'>function ffredirect(){window.location.replace('/l/?uddg=https://www.goodreads.com/work/quotes/86336100-a-promised-land&rut=c5d81c30df243e291a04d906995e775c2f7c1bec359e8efa2ac0451aa701a8bf');}setTimeout('ffredirect()',100);</script></body></html>
If you scroll through the text above, you'll notice a <script>
tag with a window.location.replace(...)
. This uses javascript to redirect our browser to another URL.
Request #3
Now, our browsers will follow the URL that's within that window.location.replace
javascript call. However, that's tough to do with Ruby. It's likely DuckDuckGo implemented this as a security measure to prevent scraping, or as a way to track data. Either way, it would be tough to parse this from the javascript.
The result is that we are sent to a new page, something like: https://duckduckgo.com/l/?uddg=https://www.goodreads.com/work/quotes/86336100-a-promised-land&rut=c5d81c30df243e291a04d906995e775c2f7c1bec359e8efa2ac0451aa701a8bf
Request #4
Finally, this last page is the one that redirects us to GoodReads, again through Javascript.
In short, check out Selenium Ruby or Capybara. They have excellent documentation, and it should be able to support what you're trying to do! Good luck!