i am scrapping a series of URL's with this code :
df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs?
page=1&refinementList[profession_name.fr.Tech][]=Data Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")
for elem in elems:
urls = elem.get_attribute("href")
print(urls)
This returns the correct results that i want to see, the problem is that when i try to put this "urls" in my empty dataframe "df1" with the following code :
df_test = df1.append({'URLS' : urls}, ignore_index = True)
df_test.head()
It does not show me the urls that i want (it doesn't return an error but the result doesn't really make sense)
I am beginning at python so there is probably some simple answer to my question i guess, i hope i was clear
CodePudding user response:
The problem with your code is that you are overwriting the urls
variable and then appending to the DataFrame
only the last scraped URL. Move the df1.append
statement to inside the for
block:
df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs?
page=1&refinementList[profession_name.fr.Tech][]=Data Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")
for elem in elems:
url = elem.get_attribute("href") # <--- get the url from the <a> tag
df1 = df1.append({'URLS': url}, ignore_index=True) # <--- add the url to the dataframe in the URLS column