Home > Software engineering >  Scraping Google Scholar, article title
Scraping Google Scholar, article title

Time:12-21

I need to create a list of articles for a study using Google Scholar (among others), and I get several thousands of results. Copy pasting these manually will take forever....

I wrote the following code for BeautifulSoup (I'm a noob) but I get no output.

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSou`p

article_info = []
pages = np.arange(1,1770,10) # Since I got 17 700 results, and there are 10 articles per page


for page in pages:
    page = requests.get("https://scholar.google.com/scholar?start="   str(pages)   "&q=mitochondrial synthesis&hl=en&as_sdt=0,5&as_ylo=2020&as_yhi=2022&as_rr=1")

    soup = BeautifulSoup(html, 'html.parser')
    article_names = soup.findAll('div', attrs={'class':'gs_r gs_or gs_scl'})
    for store in article_names:
        name = store.h3.a.text
        article_info.append(name)`

article_list = pd.DataFrame({'Article name': article_info})
article_list

I think all article names (text) are in a div called "gs_r gs_or gs_scl", which holds an H3 , which holds an "a" tag.

Google Scholar HTML

But I get no output.... My result

Grateful for any advice. Thanks and best regards,

CodePudding user response:

Rather than np.arange, try the range function to set the first and last page number for your URLs. I added a time.sleep() call to slow down the requests. You can try different values to test the tolerance of the server. You will get HTTP 429 responses if not. I don't know if you truly need a data frame, but I also added the option to write the DF to a CSV file. The CSV will have two columns, the first is the article title and the second is the URL.

import pandas as pd
import numpy as np
import requests
import time
from bs4 import BeautifulSoup

article_info = []

# range(first_page, last_page)
for i in range(1, 20):
    page = requests.get("https://scholar.google.com/scholar?start="   str(i)   "&q=mitochondrial synthesis&hl=en&as_sdt=0,5&as_ylo=2020&as_yhi=2022&as_rr=1")
    # http status code
    print(page)
    # sleep between requests
    time.sleep(60)
     
    soup = BeautifulSoup(page.text, 'html.parser')

    article_names = soup.findAll('div', attrs={'class':'gs_r gs_or gs_scl'})
    for store in article_names:
        title = store.h3.a.text
        url = store.h3.a
        article_info.append([title, url['href']])

article_list = pd.DataFrame(article_info)

# write to CSV
article_list.to_csv('scholar_titles.csv', index=False)

Output:

Mechanisms and regulation of protein synthesis in mitochondria   | https://www.nature.com/articles/s41580-021-00332-2
Mitochondrial OXPHOS biogenesis: Co-regulation of protein synthesis, import, and assembly pathways | https://www.mdpi.com/1422-0067/21/11/3820/pdf?version=1591105531
...
  • Related