Home > Back-end >  Put a set of datas (URLS) in an empty dataframe Python Pandas
Put a set of datas (URLS) in an empty dataframe Python Pandas

Time:10-27

i am scrapping a series of URL's with this code :

df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs? 
page=1&refinementList[profession_name.fr.Tech][]=Data Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")

for elem in elems:
    urls = elem.get_attribute("href")
    print(urls)

This returns the correct results that i want to see, the problem is that when i try to put this "urls" in my empty dataframe "df1" with the following code :

df_test = df1.append({'URLS' : urls}, ignore_index = True)
df_test.head()

It does not show me the urls that i want (it doesn't return an error but the result doesn't really make sense)

I am beginning at python so there is probably some simple answer to my question i guess, i hope i was clear

CodePudding user response:

The problem with your code is that you are overwriting the urls variable and then appending to the DataFrame only the last scraped URL. Move the df1.append statement to inside the for block:

df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs? 
page=1&refinementList[profession_name.fr.Tech][]=Data Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")

for elem in elems:
    url = elem.get_attribute("href")  # <--- get the url from the <a> tag
    df1 = df1.append({'URLS': url}, ignore_index=True) # <--- add the url to the dataframe in the URLS column
  • Related