I have a dataframe that contains links for movies.
data = {"link":["http://www.boxofficemojo.com/movies/?id=ateam.htm",
"http://www.boxofficemojo.com/movies/?id=acod.htm","http://www.boxofficemojo.com/movies/?id=ai.htm",
"http://www.boxofficemojo.com/movies/?id=axl.htm","http://www.boxofficemojo.com/movies/?id=aaa.htm"]}
dataframe = pd.DataFrame(data)
I want to loop over each link, find the genre of each movie and then create a new column and add the genre to each respective movie. Some movies have two or more genres, some only one so it is not the same each time.
The code I am using is the following :
lst = []
for i in dataframe['link']:
driver = webdriver.Chrome("C:\SeleniumDrivers\chromedriver.exe")
driver.get(i)
tag = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH,
"//span[text()='Genres']//following::span[1]"))).get_attribute("innerHTML")
a = tag.split("\n")
for ii in a:
ii = ii.strip()
ii = ii.split("\n")
for o in ii :
if len(o)>1 : ## I use that to remove space that might be included from the splitting
lst.append(o)
driver.close()
I am getting the overall list for all the movies.
['Action',
'Adventure',
'Thriller',
'Comedy',
'Drama',
'Sci-Fi',
'Action',
'Adventure',
'Drama',
'Family',
'Sci-Fi',
'Thriller',
'Comedy',
'Drama',
'Romance']
I want to get the genres for each movie and add them to a new column. if there are three genres for instance, I want to get them all in a row that correspond to the link.
CodePudding user response:
Create an empty column Genre. Loop through each row in the data frame and use .loc
to enter the genre into the specific column
CODE
import numpy as np
dataframe["genre"] = np.nan
for index, row in dataframe.iterrows():
link = row["link"]
temp_list = []
driver = webdriver.Chrome("C:\SeleniumDrivers\chromedriver.exe")
driver.get(link)
tag = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH,
"//span[text()='Genres']//following::span[1]"))).get_attribute("innerHTML")
a = tag.split("\n")
for ii in a:
ii = ii.strip()
ii = ii.split("\n")
for o in ii :
if len(o)>1 : ## I use that to remove space that might be included from the splitting
temp_list.append(o)
dataframe.loc[index, "genre"] = temp_list
driver.close()