Home > other >  add row values based on a value from a column in Pandas in Selenium web scraping
add row values based on a value from a column in Pandas in Selenium web scraping

Time:07-27

I have a dataframe that contains links for movies.

data = {"link":["http://www.boxofficemojo.com/movies/?id=ateam.htm",
    "http://www.boxofficemojo.com/movies/?id=acod.htm","http://www.boxofficemojo.com/movies/?id=ai.htm",
    "http://www.boxofficemojo.com/movies/?id=axl.htm","http://www.boxofficemojo.com/movies/?id=aaa.htm"]}

dataframe = pd.DataFrame(data)

I want to loop over each link, find the genre of each movie and then create a new column and add the genre to each respective movie. Some movies have two or more genres, some only one so it is not the same each time.

The code I am using is the following :

lst = []
for i in dataframe['link']:
    driver = webdriver.Chrome("C:\SeleniumDrivers\chromedriver.exe")
    driver.get(i)
    tag = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, 
    "//span[text()='Genres']//following::span[1]"))).get_attribute("innerHTML")
    a = tag.split("\n")
    for ii in a:
        ii = ii.strip()
        ii = ii.split("\n")
        for o in ii :
            if len(o)>1 : ## I use that to remove space that might be included from the splitting
                 lst.append(o)
    driver.close()

I am getting the overall list for all the movies.

['Action',
'Adventure',
'Thriller',
'Comedy',
'Drama',
'Sci-Fi',
'Action',
'Adventure',
'Drama',
'Family',
'Sci-Fi',
'Thriller',
'Comedy',
'Drama',
'Romance']

I want to get the genres for each movie and add them to a new column. if there are three genres for instance, I want to get them all in a row that correspond to the link.

CodePudding user response:

Create an empty column Genre. Loop through each row in the data frame and use .loc to enter the genre into the specific column

CODE

import numpy as np

dataframe["genre"] = np.nan

for index, row in dataframe.iterrows():
    link = row["link"]
    temp_list = []

    driver = webdriver.Chrome("C:\SeleniumDrivers\chromedriver.exe")
    driver.get(link)
    tag = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH,
    "//span[text()='Genres']//following::span[1]"))).get_attribute("innerHTML")
    a = tag.split("\n")
    for ii in a:
        ii = ii.strip()
        ii = ii.split("\n")
        for o in ii :
            if len(o)>1 : ## I use that to remove space that might be included from the splitting
                 temp_list.append(o)
    dataframe.loc[index, "genre"] = temp_list
    driver.close()
  • Related