add row values based on a value from a column in Pandas in Selenium web scraping-CodePudding

I have a dataframe that contains links for movies.

data = {"link":["http://www.boxofficemojo.com/movies/?id=ateam.htm",
    "http://www.boxofficemojo.com/movies/?id=acod.htm","http://www.boxofficemojo.com/movies/?id=ai.htm",
    "http://www.boxofficemojo.com/movies/?id=axl.htm","http://www.boxofficemojo.com/movies/?id=aaa.htm"]}

dataframe = pd.DataFrame(data)

I want to loop over each link, find the genre of each movie and then create a new column and add the genre to each respective movie. Some movies have two or more genres, some only one so it is not the same each time.

The code I am using is the following :

lst = []
for i in dataframe['link']:
    driver = webdriver.Chrome("C:\SeleniumDrivers\chromedriver.exe")
    driver.get(i)
    tag = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, 
    "//span[text()='Genres']//following::span[1]"))).get_attribute("innerHTML")
    a = tag.split("\n")
    for ii in a:
        ii = ii.strip()
        ii = ii.split("\n")
        for o in ii :
            if len(o)>1 : ## I use that to remove space that might be included from the splitting
                 lst.append(o)
    driver.close()

I am getting the overall list for all the movies.

['Action',
'Adventure',
'Thriller',
'Comedy',
'Drama',
'Sci-Fi',
'Action',
'Adventure',
'Drama',
'Family',
'Sci-Fi',
'Thriller',
'Comedy',
'Drama',
'Romance']

I want to get the genres for each movie and add them to a new column. if there are three genres for instance, I want to get them all in a row that correspond to the link.

CodePudding user response：

Create an empty column Genre. Loop through each row in the data frame and use .loc to enter the genre into the specific column

CODE

import numpy as np

dataframe["genre"] = np.nan

for index, row in dataframe.iterrows():
    link = row["link"]
    temp_list = []

    driver = webdriver.Chrome("C:\SeleniumDrivers\chromedriver.exe")
    driver.get(link)
    tag = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH,
    "//span[text()='Genres']//following::span[1]"))).get_attribute("innerHTML")
    a = tag.split("\n")
    for ii in a:
        ii = ii.strip()
        ii = ii.split("\n")
        for o in ii :
            if len(o)>1 : ## I use that to remove space that might be included from the splitting
                 temp_list.append(o)
    dataframe.loc[index, "genre"] = temp_list
    driver.close()