The last column of my Excel file is filled with url links. I want to read text from these urls, so that I can search key words in the text. The problem is that requests.get cannot read a column of urls. Can you help me on this? Thank you!!!
My current code is here:
import pandas as pd
data=pd.read_excel('/Users/LE/Downloads/url.xlsx')
url=data.URL
res=requests.get(url, headers=headers)
html=res.text
soup = BeautifulSoup(html, 'lxml')
It cannot work because 'url' is a column.
CodePudding user response:
As you noticed, this line will give you the entire column:
url=data.URL
However, you can iterate over the column and access each URL individually, like so:
import pandas
data = pandas.read_excel("PATH/TO/XLSX")
for url in data.URL:
print(url)
CodePudding user response:
You did great with opening the file and extracting the column with the urls,
one last step is to loop through them - repeat the request for each url in the urls -
import requests
import pandas as pd
# open the file
data = pd.read_excel('/Users/LE/Downloads/url.xlsx')
# get the urls
urls = data.URL
# go through every url in the urls
for url in urls:
# do the request for this url
res = requests.get(url, headers=headers)
# soup-it
html = res.text
soup = BeautifulSoup(html, 'lxml')
CodePudding user response:
This line assigns the URL column of the Dataframe to 'url':
url=data.URL
'url' is now a Pandas Series object and can be iterated through with a for loop:
for u in url:
# your request here
See the Pandas documentation on Series for more info: https://pandas.pydata.org/docs/reference/series.html
Note it might be easier to save the content of the text files located at the URLs locally and then afterwards search those saved files in order to avoid executing multiple requests for the same files.