Home > other >  Scrapy Crawler: Build JSON file with right filter
Scrapy Crawler: Build JSON file with right filter

Time:07-02

I'm using CSS Class selector to help me out with a spider. On Scrapy shell if I do the following command I get the output of all the elements I need:

scrapy shell "https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b"

response.css(".acta-table:nth-child(3) .tc::text , .acta-table a::text").extract()

Now I need to build the JSON file according to the information on the 12 tables the webpage is built on. The JSON I'm trying to build should look something like this:

{
  "DadesPartit":
    {
      "Temporada": "2021-2022",
      "Categoria": "Cadet",
      "Divisio": "Primera",
      "Grup": 2,
      "Jornada": 28
    },
  "TitularsCasa":
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "JAIME",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "BRUNO",
        "Cognom":"FERRÉ CORREA",
        "Link": "https://.."
      }
      
    ],
  "SuplentsCasa":
    [
      {
        "Nom": " MARC",
        "Cognom":"GIMÉNEZ ABELLA",
        "Link": "https://.."
      }
    ],
  "CosTecnicCasa":
    [
      {
        "Nom": " JORDI",
        "Cognom":"LORENTE VILLENA",
        "Llicencia": "E"
      }
    ],
  "TargetesCasa": 
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Tipus": "Groga",
        "Minut": 65
      }
    ],
  "Arbitres":
    [
      {
        "Nom": "ALEJANDRO",
        "Cognom":"ALVAREZ MOLINA",
        "Delegacio": "Barcelona1"
        
      }
    ],
  "Gols":
    [
      {
        "Nom": "NATXO",
        "Cognom":"MONTERO RAYA",
        "Minut": 5,
        "Tipus": "Gol de penal"
      }
    ],
  "Estadi":
    {
      "Nom": "CAMP DE FUTBOL COL·LEGI LA SALLE BONANOVA",
      "Direccio":"C/ DE SANT JOAN DE LA SALLE, 33, BARCELONA"
    },
    "TitularsFora":
    [
      {
        "Nom": "MARTI",
        "Cognom":"MOLINA MARTIMPE",
        "Link": "https://.."
      },
      {
        "Nom": " XAVIER",
        "Cognom":"MORA AMOR",
        "Link": "https://.."
      },
      {
        "Nom": " IVAN",
        "Cognom":"ARRANZ MORALES",
        "Link": "https://.."
      }
      
    ],
  "SuplentsFora":
    [
      {
        "Nom": "OLIVER",
        "Cognom":"ALCAZAR SANCHEZ",
        "Link": "https://.."
      }
    ],
  "CosTecnicFora":
    [
      {
        "Nom": "RAFAEL",
        "Cognom":"ESPIGARES MARTINEZ",
        "Llicencia": "D"
      }
    ],
  "TargetesFora": 
    [
      {
        "Nom": "ORIOL",
        "Cognom":"ALCOBA LAGE",
        "Tipus": "Groga",
        "Minut": 34
      }
    ]
}

I would like some guidance on how to build it.

Thanks, Joan

CodePudding user response:

It is much simpler than this with requests and pandas.You can do the following:

import requests as r
import pandas as pd
a=r.get("https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b")
table_fb = pd.read_html(a.content)

You just have to index table_fb for the tables.

Here is the scrapy alternative:

import scrapy
import pandas as pd

class stack(scrapy.Spider):

    name = 'test'
    start_urls = ["https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b"]    
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, 
                callback=self.parse
            )
    def parse(self, response):
        tables = pd.read_html(response.text)
        yield {
            'table1':tables[0],
            'table2':tables[1],
            'table3':tables[2],
            'table4':tables[3],
            'table5':tables[4],
            'table6':tables[5],
            'table7':tables[6],
            'table8':tables[7],
            'table9':tables[8],
            'table10':tables[9],
            'table11':tables[10],
            'table12':tables[11],
            'table13':tables[12],
            'table14':tables[13],

        }

  • Related