Home > Net >  Using Python, BeautifulSoup, CSV to scrape a URL
Using Python, BeautifulSoup, CSV to scrape a URL

Time:02-17

In this URL https://doc8643.com/aircrafts I want to scrape all rows.

Then for each individual row, for example https://doc8643.com/aircraft/A139 I want to scrape these three areas of data

<table > 
<h4>Manufacturers</h4>
<h4>Technical Data</h4>

Can this is done in python?

import requests, csv
from bs4 import BeautifulSoup
from urllib.request import Request

url = 'https://doc8643.com/aircrafts'
req = Request(url , headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})

with open('doc8643.csv', "w", encoding="utf-8") as f:
    writer = csv.writer(f)

    while True:
        print(url)
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'html.parser')

        # Go throught table = tbody and extract the data under the 'td' tag
        for row in soup.select('ul.nav.nav-pills.nav-stacked li.aircraft_item'):
            writer.writerow([c.text if c.text else '' for c in row.select('h3')])
            print(row)

        # If more than one page then iterate through all of them        
        if soup.select_one('ul.pagination li.active   li a'):
            url = soup.select_one('ul.pagination li.active   li a')['href']
        else:
            break

CodePudding user response:

You should create function which get value c.text (ie, A139) and creates full url like https://doc8643.com/aircraft/A139 and runs Request or requests and BeautifulSoup to get all needs data

def scrape_details(number):
    url = 'https://doc8643.com/aircraft/'   number
    print('details:', url)
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # ... scrape details and put in list `results` ...

    return results

and run it in your loop

        for row in soup.select('ul.nav.nav-pills.nav-stacked li.aircraft_item'):
            data = [c.text if c.text else '' for c in row.select('h3')]
            for item in data:
                values = scrape_details(item)
                writer.writerow([item]   values)

The biggest problem is to scrape details.

For some details it needs to scrape dl and next all dt and dd and use zip() to group in pairs.

Something like

def scrape_details(number):
    url = 'https://doc8643.com/aircraft/'   number
    print('details:', url)
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    results = []

    all_dl = soup.find_all('dl')

    for item in all_dl:
        all_dt = item.find_all('dt')
        all_dd = item.find_all('dd')
        for dt, dd in zip(all_dt, all_dd):
            pair = f"{dt.string}: {dd.string}" 
            results.append(pair)
            print(pair)

    #print(results)

    return results

but this need more code - and I skip this part.


Minimal working code

EDIT: I added url = 'https://doc8643.com' url

import csv
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

# --- functions ---

def scrape_details(number):
    url = 'https://doc8643.com/aircraft/'   number
    print('details:', url)
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    results = []
    
    all_dl = soup.find_all('dl')
    
    for item in all_dl:
        all_dt = item.find_all('dt')
        all_dd = item.find_all('dd')
        for dt, dd in zip(all_dt, all_dd):
            pair = f"{dt.string}: {dd.string}"
            results.append(pair)
            print(pair)

    #print(results)

    return results

# --- main ---

url = 'https://doc8643.com/aircrafts'

with open('doc8643.csv', "w", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["data1", "data2", "data3", "etc..."])

    while True:
        print('url:', url)
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Go throught table = tbody and extract the data under the 'td' tag
        for row in soup.select('ul.nav.nav-pills.nav-stacked li.aircraft_item'):
            data = [c.text if c.text else '' for c in row.select('h3')]
            for item in data:
                values = scrape_details(item)
                writer.writerow([item]   values)

        # If more than one page then iterate through all of them        
        if soup.select_one('ul.pagination li.active   li a'):
            url = soup.select_one('ul.pagination li.active   li a')['href']
            url = 'https://doc8643.com'   url
        else:
            break

BTW:

Maybe it would be better to keep results as dictionary

results[dt.string] = [dd.string]
  • Related