Home > Back-end >  Is it possible to extract the download syllabus link with requests or scrapy without selenium
Is it possible to extract the download syllabus link with requests or scrapy without selenium

Time:07-22

I am trying to extract the download syllabus link from this website- enter image description here

the link is hidden with '#' in page source.

So I am not sure, how to extract the links. I know that I won't be able to extract using xpath or css. Can someone help me?

CodePudding user response:

In this particular case,

The URL link is base64 encoded in the div with id = "DownloadSyllabus2", on its data-url attribute (right above the a href you are mentioning):

aHR0cHM6Ly93d3cuc2ltcGxpbGVhcm4uY29tL2ljZTkvcGRmcy9hZ2VuZGEvb25saW5lL0RhdGElMjBTY2llbmNlJTIwd2l0aCUyMFB5dGhvbi5wZGY=

You need to decode it using base64 and you will get the desired url: https://www.simplilearn.com/ice9/pdfs/agenda/online/Data Science with Python.pdf

This is not a magic solution for all cases like this, but it works for this website and is the best solution here.

So, Data extraction is not always about the solutions, but to be able to reverse engineer the website logic.

Happy Scraping :)

  • Related