I'm trying to web-scrape a webpage and while doing so I'm looking to extract specific information like the location name, latitude, longitude and film name. However, when extracting this information across multiple web-pages I'm unsure to which film the three previous values belong to.
I had thought of a way to overcome this by creating an empty string after all the values of the first three for each film, where I can then split these into lists per film when it reaches an empty string.
Though, I'm having difficulties trying to get the empty string right, here's what I have done:
test = ['https://www.latlong.net/location/10-things-i-hate-about-you-locations-250',
'https://www.latlong.net/location/12-angry-men-locations-818',
'https://www.latlong.net/location/12-monkeys-locations-501']
for i in range(0, len(test), 1):
r = requests.get(test[i])
testone = {'location name':[],'film':[]}
soup = BeautifulSoup(r.content, 'lxml')
for th in soup.select("td"):
testone['location name'].append(th.text.strip())
testone['location name'].append('')
for h in soup.select_one("h3"):
testone['film'].append(h)
However this seems to append an empty string after each value:
'location name': ["1117 Broadway (Gil's Music Shop)",
'',
'47.252495',
'',
'-122.439644',
'',
"2715 North Junett St (Kat and Bianca's House)",
'',
'47.272591',
'',
'-122.474480', ....
My expectation:
'location name': ["1117 Broadway (Gil's Music Shop)",
'47.252495',
'-122.439644',
"2715 North Junett St (Kat and Bianca's House)",
'47.272591',
'-122.474480',
'Aurora Bridge',
'47.646713',
'-122.347435',
'Buckaroo Tavern (closed)',
'47.657841',
'-122.350327',
'Century Ballroom',
'47.615028',
'-122.319855',
'Fremont Place Books (closed)',
'47.650452',
'-122.350510',
'Fremont Troll',
'47.651093',
'-122.347435',
'Gas Works Park',
'47.645561',
'-122.334496',
'Kerry Park',
'47.629402',
'-122.360008',
'Kingdome',
'47.595993',
'-122.333649',
'Paramount Theatre',
'47.613235',
'-122.331451',
'Seattle',
'47.601871',
'-122.341248',
'Stadium High School',
'47.265991',
'-122.448570',
'Tacoma',
'47.250828',
'-122.449135',
'',
'New York City',
'40.742298',
'-73.982559',
'New York County Courthouse',
'40.714310',
'-74.001930',
'', ................],
'film': ['10 Things I Hate About You Locations Map','12 Angry Men Locations Map'...]}
CodePudding user response:
Use extned()
instead of append()
; since the strip()
function returns a list
,and you want to append all items of the list to testone['location name']
Try this:
for i in range(0, len(test), 1):
r = requests.get(test[i])
testone = {'location name':[],'film':[]}
soup = BeautifulSoup(r.content, 'lxml')
for th in soup.select("td"):
testone['location name'].extend(th.text.strip())
# Do nothing
for h in soup.select_one("h3"):
testone['film'].append(h)
CodePudding user response:
The problem is that you're appending an empty string ''
after each table
cell you read. In this way, since you have 3 separate cell for location name, longitude and latitude, you are inserting an empty string between each one.
An optimal solution could be to add a counter and store everything inside a map instead of two lists:
test = ['https://www.latlong.net/location/10-things-i-hate-about-you-locations-250',
'https://www.latlong.net/location/12-angry-men-locations-818',
'https://www.latlong.net/location/12-monkeys-locations-501']
for i in range(0, len(test), 1):
r = requests.get(test[i])
testone = {}
cells = soup.select("td")
soup = BeautifulSoup(r.content, 'lxml')
for h in soup.select_one("h3"):
testone[h] = list()
for i in range(3):
testone[h].append(cells.pop(0))
In this way you can have all the information about a film by using testone[<filmname>]