Home > OS >  scraping data from multiple url's present in an excel file
scraping data from multiple url's present in an excel file

Time:09-18

I have an excel file with two columns. One with the URL ID and the other with the URL itself. The task is to extract data from those files and put it in a text file. The name of the text file should be the URL_ID present in the first columns.

headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
df = pd.read_csv('D:\Arhamdocs\Projects\question\Input.csv')
data = df.URL
name = df.URL_ID
for url in data:
    page = requests.get(url,headers=headers).text
    soup = bs4.BeautifulSoup(page,"html.parser")
    article_title = soup.find('h1',class_='entry-title')
    article = soup.find('div',class_='td-post-content')
    # print(article)
    for i in name:
        file=open('%i.txt'%i,'w')
        for article_body in soup.find_all('p'):
            title = article_title.text
            body = article.text
            file.write(title)
            file.write(body)
        file.close()

I used the following code,but I always get the article of the last link. Help me out

CodePudding user response:

for url, i in zip(data, name):
    page = requests.get(url,headers=headers).text
    soup = bs4.BeautifulSoup(page,"html.parser")
    article_title = soup.find('h1',class_='entry-title')
    article = soup.find('div',class_='td-post-content')
    with open('%i.txt'%i,'w') as fp
        for article_body in soup.find_all('p'):
            title = article_title.text
            body = article.text
            fp.write(title)
            fp.write(body)

I wrote it as in you code but I think the title should be outside the for loop.

  • Related