I'm trying to get the webpage titles for a column of URLs in a dataframe.
Using:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def geturl(x):
return (BeautifulSoup(urlopen(x)).title.get_text())
geturl('https://msn.com')
Returns: 'MSN | Outlook, Office, Skype, Bing, Breaking News, and Latest Videos'
However, when actually working with a dataframe:
data = [['1001','https://msn.com'],['1002','https://google.com'],['1003','https://yahoo.com']]
df = pd.DataFrame(data, columns=['ID', 'URL'])
df
ID URL
0 1001 https://msn.com
1 1002 https://google.com
2 1003 https://yahoo.com
df['title'] = df['url'].apply(geturl())
Results in an error. Any help would be greatly appreciated.
CodePudding user response:
When I try to run your script I get below error:
File "C:\Users\user\PycharmProjects\test\test.py", line 235, in <module>
df['title'] = df['url'].apply(geturl())
File "C:\Users\user\PycharmProjects\test\venv\lib\site-packages\pandas\core\frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\user\PycharmProjects\test\venv\lib\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'url'
At your DF you setup column as URL but at below line you call with df["url"]
df['title'] = df['url'].apply(geturl())
Since its key sensitive its generating KeyError