i am trying to learn Python for data analysis/data science. I'm working on a project where I would be webscraping key movie information (director, original language, budget, revenue, etc.) off of TMDb and IMDb using bs4. I would like to do this for a list of various movies that I have rated and downloaded into a csv file. The csv file contains columns like "Type" and "TMDb ID" that would be needed to construct the URLs that I want to scrape.
like so:
TMDb ID | IMDb ID | Type | Name |
---|---|---|---|
11282 | tt0366551 | movie | Harold & Kumar Go To White Castle |
the URL would be
url = "https://api.themoviedb.org/3/" type "/" id "?api_key=" API_KEY "&language=en-US/"
So I'm attempting to do this by iterating through the respective columns and constructing a URL from that, and using that list of URLs to webscrape. I got stuck on printing all the URLs correctly. Depending on if I put the print statement inside the for loop or outside of it, I either get:
- the last URL in the csv file printed over and over again (109 of the same last URL) OR
- the correct URLs except they each get printed the same amount of times as the length of the csv file (109 rows x 109 urls)
This is what I have so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.request import Request, urlopen
API_KEY = 'xxx'
tmdb_export = pd.read_csv('/Users/xxx/Downloads/xxx.csv')
tmdb_export.drop(['Season Number','Episode Number'], axis=1, inplace=True)
tmdb = tmdb_export['TMDb ID']
type = tmdb_export['Type']
urls = []
# pulls TMDb IDs from df column
for i, tmdbID in tmdb.iteritems():
id = str(tmdbID)
url = "https://api.themoviedb.org/3/" type "/" id "?api_key=" API_KEY "&language=en-US/"
urls.append(url)
print(urls)
Do I have to include a nested for loop in the urls.append(url) ?? What am I missing? I feel like this is a silly mistake I'm making because I have a hard time with for loops and understanding how they work. so I've decided to stop lurking on here and ask y'all for help! I'm open to any suggestions, guidance, explanations and advice that I can get. Thank you in advance!!
CodePudding user response:
I would recommend that you convert the dataframe value into a list example:
id = list(df['id'])
t = list(df['t'])
Then, use zip the two lists and iterate over
for a, b in zip(id, t):
# todo here, you can assign a as an id value and b as a type value
CodePudding user response:
You should not use type as the variable name as its a predefined keyword. And one crisp way of doing this is using apply function that you can leverage from pandas and create a column of URL in the dataframe and extract it and cast it into list.
def createUrl(tmdbID,Type):
Tid = str(tmdbID)
url = "https://api.themoviedb.org/3/" Type "/" Tid "?api_key=" API_KEY "&language=en-US/"
return
tmdb_export['URL'] = tmdb_export.apply(lambda x: f(x['TMDb ID'], x['Type']), axis=1)
urls=list(tmdb_export['URL'])
CodePudding user response:
Try taking the print(URLs) out of the loop:
for i, tmdbID in tmdb.iteritems():
id = str(tmdbID)
url = "https://api.themoviedb.org/3/" type "/" id "?api_key=" API_KEY "&language=en-US/"
urls.append(url)
print(urls)