I started working on my first web scraper with python and selenium. I'm sure it will be painfully obvious without me saying, but I'm still very new to this. This web scraper navigates to a website, performs a search to find a bunch of political candidate committee pages, clicks the first search result, then scrapes some text data into a dictionary and then a csv. There are 50 results per page and I could pass 50 different id's into the code and it would work (I've tested it). But of course, I want this to be at least somewhat automated. I'd like it to loop through the 50 search results (candidate committee pages) and scrape them one after the other.
I thought a for loop would work well here. I would just need to loop through each of the 50 search result elements with the code that I know works. This is where I'm having issues. When I copy the html element corresponding to the search result link, this is what I get.
a id="_ctl0_Content_dgdSearchResults__ctl2_lnkCandidate" href="javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')">ALLEN, KEVIN</a>
As you can see from the html above, the href attribute isn't a normal link. It's some sort of javascript Postback thing that I don't really understand. After some googling, I still don't really get it. Some people are saying this means you have to make the program wait before you click the link, but my original code doesn't do that. My code performs the search and clicks the first link without issue. I just have to pass it the id.
I thought a good first step would be to scrape the search results page to get a list of links. Then I could iterate through a list of links with the rest of the scraping code. After some messing around I tried this:
links = driver.find_elements_by_tag_name('a')
for i in links:
print(i.get_attribute('href'))
This gives me a list of all the links on the page, and after playing with the list a little bit, it narrows down to a list of 50 of these corresponding to the 50 search results (notice the id's change by 1 number):
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl3$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl4$lnkCandidate','')
etc
That's what the href attribute gives me...but are those even links? How do I work with them? Is this the wrong way to go about through iterating through search results? I feel like I am so close to getting this to work! I'd appreciate any suggestions you have. Thanks!
CodePudding user response:
__doPostBack()
function
Postback is the functionality where the page contents are posted to the server due to an occurrence of an event in a page control. As an example can be, a button click or a index change event when AutoPostBack
value is set to true. All the webcontrols except Button and ImageButton control can call a javascript function called __doPostBack()
to post the form to server. Button and ImageButton control will use the browsers ability and submit the form to the server. ASP.Net runtime automatically inserts the definition of __doPostBack()
function in the HTML output when there is a control that can initiate a postback in the page.
An example defination of __doPostBack
:
<html>
<body>
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
</div>
<a id="LinkButton1" href="javascript:__doPostBack('LinkButton1','')">LinkButton</a>
</form>
</body>
</html>
This usecase
Extracting the value of the href attributes will always give the similar output:
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl2$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl3$lnkCandidate','')
javascript:__doPostBack('_ctl0$Content$dgdSearchResults$_ctl4$lnkCandidate','')
A pottential solution will be to:
- Open the
<a>
tags in the adjascent tab using CONTROL Click - Switch to the adjascent tab
- Extract the
current_url
- Switch back to the main tab.
CodePudding user response:
No, these aren't links that you can just load in a browser or with Selenium. From what I can tell with a little googling, the first argument to __doPostBack()
is the id
for a button (or maybe another element) on the page.
In more general terms, "post back" refers to making a POST request back to the exact same URL as the current page. From what I can tell, __doPostBack()
performs a post back by simulating a click on the specified element.