I have an excel file with two columns. One with the URL ID and the other with the URL itself. The task is to extract data from those files and put it in a text file. The name of the text file should be the URL_ID present in the first columns.
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
df = pd.read_csv('D:\Arhamdocs\Projects\question\Input.csv')
data = df.URL
name = df.URL_ID
for url in data:
page = requests.get(url,headers=headers).text
soup = bs4.BeautifulSoup(page,"html.parser")
article_title = soup.find('h1',class_='entry-title')
article = soup.find('div',class_='td-post-content')
# print(article)
for i in name:
file=open('%i.txt'%i,'w')
for article_body in soup.find_all('p'):
title = article_title.text
body = article.text
file.write(title)
file.write(body)
file.close()
I used the following code,but I always get the article of the last link. Help me out
CodePudding user response:
for url, i in zip(data, name):
page = requests.get(url,headers=headers).text
soup = bs4.BeautifulSoup(page,"html.parser")
article_title = soup.find('h1',class_='entry-title')
article = soup.find('div',class_='td-post-content')
with open('%i.txt'%i,'w') as fp
for article_body in soup.find_all('p'):
title = article_title.text
body = article.text
fp.write(title)
fp.write(body)
I wrote it as in you code but I think the title should be outside the for loop.