Home > database >  Crawl and download Readme.md files from GitHub using python
Crawl and download Readme.md files from GitHub using python

Time:06-06

I'm trying to do an NLP task. For that purpose I need a considerable amount of Readme.md files from GitHub. This is what I am trying to do:

  1. For a given number n, I want to list the first n GitHub repositories (And Their URLs) based on the number of their stars.
  2. I want to download the Readme.md file from those URLs.
  3. I want to save the Readme.md Files on my hard drive, each in a separate folder. The folder name should be the name of the repository.

I'm not acquainted with crawling and web scraping, but I am relatively good with python. I'll be thankful if you can give me some help on how to accomplish this steps. Any help would be appreciated.

My effort: I've searched a little, and I found a website (gitstar-ranking.com) that ranks GitHub repos based on their stars. But that does not solve my problem because it is again a scraping task to get the name or the URL of those repos from this website.

CodePudding user response:

Here's my attempt using the suggestion from @Luke. I changed the minimum stars to 500 since we don't need 5 million results (>500 still yields 66513 results).
You might not need the ssl workaround on lines 29-30, but since I'm behind a proxy, it's a pain to do it properly.
The script finds files called readme.md in any combination of lower- and uppercase but nothing else. It saves the file as README.md (uppercase) but this can be adjusted by using the actual filename.

import urllib.request
import json
import ssl
import os
import time


n = 5  # number of fetched READMEs
url = 'https://api.github.com/search/repositories?q=stars:>500&sort=stars'
request = urllib.request.urlopen(url)
page = request.read().decode()
api_json = json.loads(page)

repos = api_json['items'][:n]

for repo in repos:
    full_name = repo['full_name']
    print('fetching readme from', full_name)
    
    # find readme url (case senitive)
    contents_url = repo['url']   '/contents'
    request = urllib.request.urlopen(contents_url)
    page = request.read().decode()
    contents_json = contents_json = json.loads(page)
    readme_url = [file['download_url'] for file in contents_json if file['name'].lower() == 'readme.md'][0]
    
    # download readme contents
    try:
        context = ssl._create_unverified_context()  # prevent ssl problems
        request = urllib.request.urlopen(readme_url, context=context)
    except urllib.error.HTTPError as error:
        print(error)
        continue  # if the url can't be opened, there's no use to try to download anything
    readme = request.read().decode()
    
    # create folder named after repo's name and save readme.md there
    try:
        os.mkdir(repo['name'])  
    except OSError as error:
        print(error)
    f = open(repo['name']   '/README.md', 'w', encoding="utf-8")
    f.write(readme)
    print('ok')
    
    # only 10 requests per min for unauthenticated requests
    if n >= 9:  # n   1 initial request 
        time.sleep(6)
  • Related