Home > Back-end >  Error on Getting Title for URL in Dataframe (Pandas / Python)
Error on Getting Title for URL in Dataframe (Pandas / Python)

Time:08-05

I'm trying to get the webpage titles for a column of URLs in a dataframe.

Using:

from urllib.request import urlopen
from bs4 import BeautifulSoup

def geturl(x):
    return (BeautifulSoup(urlopen(x)).title.get_text())

geturl('https://msn.com')

Returns: 'MSN | Outlook, Office, Skype, Bing, Breaking News, and Latest Videos'

However, when actually working with a dataframe:

data = [['1001','https://msn.com'],['1002','https://google.com'],['1003','https://yahoo.com']]
df = pd.DataFrame(data, columns=['ID', 'URL'])
df

ID  URL
0   1001    https://msn.com
1   1002    https://google.com
2   1003    https://yahoo.com

df['title'] = df['url'].apply(geturl())

Results in an error. Any help would be greatly appreciated.

CodePudding user response:

When I try to run your script I get below error:

  File "C:\Users\user\PycharmProjects\test\test.py", line 235, in <module>
    df['title'] = df['url'].apply(geturl())
  File "C:\Users\user\PycharmProjects\test\venv\lib\site-packages\pandas\core\frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\user\PycharmProjects\test\venv\lib\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'url'

At your DF you setup column as URL but at below line you call with df["url"]

df['title'] = df['url'].apply(geturl())

Since its key sensitive its generating KeyError

  • Related