Life has been tough on me lately. I have tried to scrape a website (https://osf.io/preprints/discover?subject=bepress|Social and Behavioral Sciences) with some of the HTML code below for a week now and have tried multiple things to get it to work. I need the div ID ember499
all the way at the bottom. This div is the one that contains the whole website, and if I cant access it, I cant scrape anything. There are 4 divs in the main body tag, MaxJax_Message
, ember-bootstrap-wormhole
, ember-basic-dropdown-wormhole
and ember499
as seen below:
<body >
<div id="MathJax_Message" style="display: none;"></div>
<noscript>
<p>
For full functionality of this site it is necessary to enable JavaScript.
Here are
<a href='https://www.enable-javascript.com/' target='_blank'> instructions for enabling JavaScript in your web browser</a>.
</p>
</noscript>
<script> window.prerenderReady = false; </script>
<script src="//cdnjs.cloudflare.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/ember.js/2.18.0/ember.min.js"></script>
<script>
(function() {
var encodedConfig = document.head.querySelector("meta[name$='/config/environment']").content;
var config = JSON.parse(unescape(encodedConfig));
var assetSuffix = config.ASSET_SUFFIX ? '-' config.ASSET_SUFFIX : '';
var origin = window.location.origin;
window.isProviderDomain = !~config.OSF.url.indexOf(origin);
var prefix = '/' (window.isProviderDomain ? '' : 'preprints/') 'assets/';
[
'vendor',
'preprint-service'
].forEach(function (name) {
var script = document.createElement('script');
script.src = prefix name assetSuffix '.js';
script.async = false;
document.body.appendChild(script);
var link = document.createElement('link');
link.rel = 'stylesheet';
link.href = prefix name assetSuffix '.css';
document.head.appendChild(link);
});
})();
</script><script src="/preprints/assets/vendor-f46d275519d6cf7078493fc4564ccd3c7dc419ed.js"></script><script src="/preprints/assets/preprint-service-f46d275519d6cf7078493fc4564ccd3c7dc419ed.js"></script>
<script src="https://cdn.ravenjs.com/3.22.1/ember/raven.min.js"></script>
<script>
var encodedConfig = document.head.querySelector("meta[name$='/config/environment']").content;
var config = JSON.parse(unescape(encodedConfig));
if (config.sentryDSN) {
Raven.config(config.sentryDSN, config.sentryOptions || {}).install();
}
</script>
<div id="ember-basic-dropdown-wormhole"></div>
<div id="ember-bootstrap-wormhole"></div>
<div id="ember499" >
<!---->
<div id="ember538" ><div >
<nav id="navbarScope" role="navigation" >
<div >
<div >
<a href="/" aria-label="Go home" >
<span ></span>
</a>
<div >
<a href="https://osf.io/preprints/">
<span > OSF </span>
I have tried printing all of the divs that are contained in the main body for example:
wormhole = driver.find_element(By.CLASS_NAME, 'ember-application')
divs = wormhole.find_elements(By.TAG_NAME, 'div')
I have tried finding via XPATH
, ID
, more or less everything. When I print the ID of each div and append it into a list I get this:
['MathJax_Message', 'ember-basic-dropdown-wormhole', 'ember-bootstrap-wormhole']
Funnily enough, when I print len(divs)
i get 3 back, but when I dont append them into a list it takes an extra 2-3 seconds to finish executing once it reaches div[3]
, this generally does not tend to happen with other sites:
OUTPUT
MathJax_Message
ember-basic-dropdown-wormhole
ember-bootstrap-wormhole
Process finished with exit code 0
I have tried scrolling to the middle of the page in case its hidden, finding out what is in each of the 3 divs above it, going directly to the class names that I want, finding all elements of the webpage using find_elements(By.XPATH, '//*')
. They all either only return the same 3 divs mentioned above, or they say 'element not found'. I cant think of what else to do/try.
Please guide me Stack Gods.
CodePudding user response:
You need to provide delay
time.sleep()
driver.get('https://osf.io/preprints/discover?subject=bepress|Social and Behavioral Sciences')
time.sleep(5)
print(len(driver.find_elements(By.CSS_SELECTOR,".ember-application div")))
for ele in driver.find_elements(By.CSS_SELECTOR,".ember-application div"):
print(ele.text)
Output:
275
OSF PREPRINTS
Add a Preprint
Search
Support
Donate
Sign Up Sign In
Preprint Archive Search
powered by
Search
2,365,609 searchable as of November 01, 2022
Partner Repositories
Previous
Next
Sort by: Relevance
Active Filters:
Clear filters
Social and Behavioral Sciences
Refine your search by
Providers
OSF Preprints (50,302)
AfricArXiv (394)
AgriXiv (426)
Arabixiv (328)
arXiv (1,324,846)
BioHackrXiv (29)
bioRxiv (48,109)
BodoArXiv (110)
Cogprints (283)
CoP (1)
EarthArXiv (1,755)
EcoEvoRxiv (940)
ECSarXiv (241)
EdArXiv (1,088)
engrXiv (2,088)
FocUS Archive (52)
Frenxiv (148)
INA-Rxiv (16,605)
IndiaRxiv (148)
LawArXiv (1,374)
LIS Scholarship Archive (310)
MarXiv (456)
MediArXiv (201)
MetaArXiv (457)
MindRxiv (286)
NutriXiv (84)
PaleorXiv (219)
PeerJ (4,747)
Preprints.org (21,880)
PsyArXiv (25,458)
RePEc (848,335)
SocArXiv (11,629)
SportRxiv (386)
Thesis Commons (1,857)
Subject
Architecture
Arts and Humanities
Business
Education
Engineering
Law
Life Sciences
Medicine and Health Sciences
Physical Sciences and Mathematics
Social and Behavioral Sciences
Do you want to add your own research as a preprint?
Add a preprint
The Corporate Social Responsibility is just a twist in a M\"obius Strip
Solferino, NazariaSolferino, Viviana
Last edited: Oct 13, 2015 UTC
Finance Social and Behavioral Sciences Economics
In recent years economics agents and systems have became more and more interacting and juxtaposed, therefore the social sciences need to rely on the studies of physical sciences to analyze this complexity in the relationships. According to this point of view we rely on the geometrical model of the M ...
arXiv
Some suggestions on dealing with measurement error in linkage analyses
Marko BachlMichael Scharkow
Last edited: Jul 2, 2018 UTC
Social and Behavioral Sciences Communication
Linkage analysis is a sophisticated media effect research design that reconstructs the likely exposure to relevant media messages of individual survey respondents by complementing the survey data with a content analysis. It is an important improvement over survey-only designs: Instead of predicting ...
OSF Preprints
Imagined Interdependence: Manipulating Discourse Changes How People Construe Interdependence
Jiří Münich..so on
CodePudding user response:
First of all, the div you are trying to grab has id
which changes every time you load page. So to grab that particular div you have to use XPATH
(you can use any other selector also if you want, but i prefer XPATH
).
Please find the code below, Let me know if you have any query.
driver.get("https://osf.io/preprints/discover?subject=bepress|Social and Behavioral Sciences")
ember = wait.until(EC.presence_of_element_located(
(By.XPATH, '//*[@]/div[contains(@id,"ember") and @]')))
print(ember.get_attribute('innerHTML'))
time.sleep(2)
driver.quit()
You should also use explicit wait to get your desired HTML element. time.sleep()
is not advisable as it may only halt the process for a time being without realizing whether the element is appeared or not.