Home > Software engineering >  Why is the parsing only happening on the first item of each table
Why is the parsing only happening on the first item of each table

Time:06-27

I'm new to python and web scraping and I would kindly like some advice. I have created the spider however the json output only provides the first element of each table. Can anyone let me know what is the reason for it?

import scrapy

class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']
    
    def parse (self, response):
        for actaelements in response.css('table.acta-table'):
            try:
              yield {
                'name' : actaelements.css('a::text').get(),
                'link' : actaelements.css('a').attrib['href'],
            }
            except:
              yield {
                'name' : actaelements.css('a::text').get(),
                'link' : 'Link Error',
            }
        

My ultimate goal is to create a JSON file that creates for each table the necessary information:

{
  "DadesPartit":
    {
      "Temporada": "2021-2022",
      "Categoria": "Cadet",
      "Divisio": "Primera",
      "Grup": 2,
      "Jornada": 28
    },
  "TitularsCasa":
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "JAIME",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "BRUNO",
        "Cognom":"FERRÉ CORREA",
        "Link": "https://.."
      }
      
    ],
  "SuplentsCasa":
    [
      {
        "Nom": " MARC",
        "Cognom":"GIMÉNEZ ABELLA",
        "Link": "https://.."
      }
    ],
  "CosTecnicCasa":
    [
      {
        "Nom": " JORDI",
        "Cognom":"LORENTE VILLENA",
        "Llicencia": "E"
      }
    ],
  "TargetesCasa": 
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Tipus": "Groga",
        "Minut": 65
      }
    ],
  "Arbitres":
    [
      {
        "Nom": " ALEJANDRO",
        "Cognom":"ALVAREZ MOLINA",
        "Delegacio": "Barcelona1"
        
      }
    ],
  "Gols":
    [
      {
        "Nom": "NATXO",
        "Cognom":"MONTERO RAYA",
        "Minut": 5,
        "Tipus": "Gol de penal"
      }
    ],
  "Estadi":
    {
      "Nom": "CAMP DE FUTBOL COL·LEGI LA SALLE BONANOVA,
      "Direccio":"C/ DE SANT JOAN DE LA SALLE, 33, BARCELONA"
    },
    "TitularsFora":
    [
      {
        "Nom": "MARTI",
        "Cognom":"MOLINA MARTIMPE",
        "Link": "https://.."
      },
      {
        "Nom": " XAVIER",
        "Cognom":"MORA AMOR",
        "Link": "https://.."
      },
      {
        "Nom": " IVAN",
        "Cognom":"ARRANZ MORALES",
        "Link": "https://.."
      }
      
    ],
  "SuplentsFora":
    [
      {
        "Nom": "OLIVER",
        "Cognom":"ALCAZAR SANCHEZ",
        "Link": "https://.."
      }
    ],
  "CosTecnicFora":
    [
      {
        "Nom": " RAFAEL",
        "Cognom":"ESPIGARES MARTINEZ",
        "Llicencia": "D"
      }
    ],
  "TargetesFora": 
    [
      {
        "Nom": " ORIOL",
        "Cognom":"ALCOBA LAGE",
        "Tipus": "Groga",
        "Minut": 34
      }
    ]
}

Thanks, Joan

CodePudding user response:

It happens because your css selector is wrong, it's just for the table and not the items. Also you can remove the try except and give the link a default value if it's "None".

import scrapy


class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']

    def parse(self, response):
        for actaelements in response.css('table.acta-table tbody tr'):
            yield {
                'name': actaelements.css('a::text').get(),
                'link': actaelements.css('a::attr(href)').get(default='Link Error'),
            }

CodePudding user response:

CSS selectors return a list of matching elements. Since there is only one element that matches your query the for loop only executes once and retreives the first link only. One minor adjustment you could make is using xpath you can select all of the children of the table and your code should work as expected.

Simply change your for loop to:

for actalements in response.xpath('//table[@]/*'):

And the rest of your code should work the way you would expect.

  • Related