Home > Back-end >  How to scrape multiple RSS feeds and store results respectively in their CSVs?
How to scrape multiple RSS feeds and store results respectively in their CSVs?

Time:04-03

Is there a way one can scrape data from multiple RSS feeds and store results?

I'm scraping data from multiple RSS feeds and storing them respectively in their CSVs in the worst way possible - Separate .py files for each feed to their CSVs and running all .py files in the folder.

I have multiple py files like this in a folder with only the url different. I'm not sure how to run them in a loop and store the results in their respective CSVs

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'anyRSSFeedLink.com'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

output = []
for entry in soup.find_all('entry'):

    item = {
        'Title': entry.find('title', {'type': 'html'}).text,
        'Pubdate': entry.find('published').text,
        'Content': entry.find('content').text,
        'Link': entry.find('link')['href']
    }

    output.append(item)


df = pd.DataFrame(output)

df.to_csv('results/results_feed01.csv', index=False)

How can I read from a CSV that has all the RSS feed links like this:

RSSFeedlinks

And run them in a single scraping file while storing in their respective result's CSVs that looks something like this?

RSSResults

CodePudding user response:

Is there a way one can scrape data from multiple RSS feeds and store results?

Yes it is - Simply read your urls into a list or iterate directly over each line in your csv.

feeds = ['http://feeds.bbci.co.uk/news/rss.xml','http://www.cbn.com/cbnnews/us/feed/']

for url in feeds:
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'xml')

    output = []
    for entry in soup.find_all('item'):

        item = {
            'Title': entry.find('title').text,
            'Pubdate': e.text if(e := entry.find('pubDate')) else None,
            'Content': entry.find('description').text,
            'Link': entry.find('link').text
        }

        output.append(item)

In each iteration you scrape the feed and save it to its csv, that e.g. could be named by domain, ...

 df.to_csv(f'results_feed_{url.split("/")[2]}.csv', index=False)

or use a counter if you like:

for enum, url in enumerate(feeds):
    ...
    df.to_csv(f'results_feed{enum}.csv', index=False)

Be aware - This will only work, if all feeds follows the same stucture, else you have to make some adjustments. You also should check if your elements you try to find are available before calling methods or properties:

'Pubdate': e.text if(e := entry.find('pubDate')) else None

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

feeds = ['http://feeds.bbci.co.uk/news/rss.xml','http://www.cbn.com/cbnnews/us/feed/']

for url in feeds:
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'xml')

    output = []
    for entry in soup.find_all('item'):

        item = {
            'Title': entry.find('title').text,
            'Pubdate': e.text if(e := entry.find('pubDate')) else None,
            'Content': entry.find('description').text,
            'Link': entry.find('link').text
        }

        output.append(item)


    df = pd.DataFrame(output)

    df.to_csv(f'results_feed_{url.split("/")[2]}.csv', index=False)
  • Related