I have a webpage which I would like to extract and store it's values into separate columns. Furthermore, I want to extract the movie title and insert it as a new column, but it must iterate over the rows of which the elements from the title were collected.
For example (expected output):
Location Name Latitude Longitude \
0 1117 Broadway (Gil's Music Shop) 47.252495 -122.439644
1 2715 North Junett St (Kat and Bianca's House) 47.272591 -122.474480
2 Aurora Bridge 47.646713 -122.347435
3 Buckaroo Tavern (closed) 47.657841 -122.350327
movie
0 10-things-i-hate-about-you-locations-250
1 10-things-i-hate-about-you-locations-250
2 10-things-i-hate-about-you-locations-250
3 10-things-i-hate-about-you-locations-250
.
.
.
What I have tried:
url = ['https://www.latlong.net/location/10-cloverfield-lane-locations-553',
'https://www.latlong.net/location/10-things-i-hate-about-you-locations-250',
'https://www.latlong.net/location/12-angry-men-locations-818']
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df['movie'] = test[i].split('/')[-1]
However, this gives only the output:
Location Name Latitude Longitude \
0 New York City 40.742298 -73.982559
1 New York County Courthouse 40.714310 -74.001930
movie
0 12-angry-men-locations-818
1 12-angry-men-locations-818
Which is missing the rest of the results
I get the feeling it's because the data is split in the pandas dataframe, so I have tried merging before appending the columns using:
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df = pd.merge(df, how='inner')
df['movie'] = test[i].split('/')[-1]
But I get the following error:
TypeError: merge() missing 1 required positional argument: 'right'
CodePudding user response:
Try:
test = ['https://www.latlong.net/location/10-cloverfield-lane-locations-553',
'https://www.latlong.net/location/10-things-i-hate-about-you-locations-250',
'https://www.latlong.net/location/12-angry-men-locations-818']
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df['movie'] = test[i].split('/')[-1]
url_test.append(df)
final_df = pd.concat(url_test, ignore_index=True)
print(final_df)