Home > Mobile >  i scraped a table data with cheerio but it shows all the data in one place
i scraped a table data with cheerio but it shows all the data in one place

Time:01-03

I'm trying to scrape data from a table but some data all shows in one place, for more explanation check my code:

app.get('/test', (req, res) => {
    const fixtures = []
    axios.get('https://www.betstudy.com/soccer-stats/c/albania/1st-division/d/fixtures/', { 
    headers: { "Accept-Encoding": "gzip,deflate,compress"}})
    
    .then((response) => {
        const html = response.data
        const $ = cheerio.load(html)

        $('#leaguesub-tab-1 > div > ul > li', html).each(function(index, element) {
            const date = $('#leaguesub-tab-1 > div').find('div.leaguesub-date').text().replace(/\n/g,'').trim()
            const homeTeam = $(element).find('div.team-title.title-right').find('div').find('a').text().replace(/\n/g,'').trim()
            const awayTeam = $(element).find('div.team-title.title-left').find('div').find('a').text().replace(/\n/g,'').trim()
            const scoreOrtime = $(element).find('div.time').find('div').find('a').text().replace(/\n/g,'').trim() ' GMT 1'
            fixtures.push({
                date,
                homeTeam,
                scoreOrtime,
                awayTeam
            })
        })
        res.json(fixtures)
    }).catch((err) => res.json(err))
});

Everything works like a charm for homeTeam and awayTeam and time but the problem appears on date

Check the JSON response here:

[
  {
    "date": "28.01.2023 Saturday              05.02.2023 Sunday              11.02.2023 Saturday              18.02.2023 Saturday              25.02.2023 Saturday              04.03.2023 Saturday              11.03.2023 Saturday              18.03.2023 Saturday              01.04.2023 Saturday              08.04.2023 Saturday              15.04.2023 Saturday              22.04.2023 Saturday              06.05.2023 Saturday              13.05.2023 Saturday",
    "homeTeam": "Apolonia Fier",
    "scoreOrtime": "12:00 GMT 1",
    "awayTeam": "Flamurtari"
  },

The table I want to scrape data from is the issue I guess, because the div tag for the date is out side the ul > li tag that contains the data

Check the source code of the table in this image:

source code

As you can see that each <ul> tag has its own date <div> tag, what I want is to get each ul > li data and its own date.

CodePudding user response:

That date selector is searching the whole document from the root rather than within $(element).find(...), so it gathers all of the text and glues it together.

There's an extra level of hierarchy here that you may be ignoring. The site is laid out like this:

  • DATE 1
  • List of games for DATE 1:
    • game 1 on DATE 1
    • game 2 on DATE 1
    • ...
  • DATE 2
  • List of games for DATE 2:
    • game 1 on DATE 2
    • game 2 on DATE 2
    • ...
  • ...

Given this, my suggestion would be to select all dates and all game lists and glue them together into an array of objects that represent days. For each day, collect all of its games into a subarray. Each game has a home team, an away team, and a time.

For example:

const axios = require("axios");
const cheerio = require("cheerio"); // 1.0.0-rc.12
require("util").inspect.defaultOptions.depth = null;

const url = "<Your URL>";

axios
  .get(url)
  .then(({data: html}) => {
    const $ = cheerio.load(html);
    const dates = [...$("#leaguesub-tab-1 .leaguesub-date")]
      .map(e => $(e).text().trim());
    const data = [...$("#leaguesub-tab-1 .leaguesub-list")].map((e, i) => ({
      date: dates[i],
      games: [...$(e).find("li")].map(e => ({
        homeTeam: $(e).find(".title-right").text().trim(),
        awayTeam: $(e).find(".title-left").text().trim(),
        scoreOrTime: $(e).find(".time").text().trim(),
      }))
    }));
    console.log(data);
  })
  .catch(err => console.error(err));

Output (truncated):

[
  {
    date: '28.01.2023 Saturday',
    games: [
      {
        homeTeam: 'Apolonia Fier',
        awayTeam: 'Flamurtari',
        scoreOrTime: '12:00'
      },
      {
        homeTeam: 'Besëlidhja Lezhë',
        awayTeam: 'Korabi Peshkopi',
        scoreOrTime: '12:00'
      },
      // ...
    ]
  },
  {
    date: '05.02.2023 Sunday',
    games: [
      {
        homeTeam: 'Skënderbeu Korçë',
        awayTeam: 'Tërbuni Pukë',
        scoreOrTime: '13:00'
      },
      {
        homeTeam: 'Oriku',
        awayTeam: 'Dinamo Tirana',
        scoreOrTime: '13:00'
      },
      // ...
    ]
  },
  // ...
]

You could also key that outer array by date for easier lookups:

const byDate = Object.fromEntries(data.map(e => [e.date, e.games]));
//                                          or: [e.date.split(" ")[0], e.games]

Which gives:

{
  '28.01.2023 Saturday': [
    {
      homeTeam: 'Apolonia Fier',
      awayTeam: 'Flamurtari',
      scoreOrTime: '12:00'
    },
    {
      homeTeam: 'Besëlidhja Lezhë',
      awayTeam: 'Korabi Peshkopi',
      scoreOrTime: '12:00'
    },
    // ...
  ],
  '05.02.2023 Sunday': [
    {
      homeTeam: 'Skënderbeu Korçë',
      awayTeam: 'Tërbuni Pukë',
      scoreOrTime: '13:00'
    },
    {
      homeTeam: 'Oriku',
      awayTeam: 'Dinamo Tirana',
      scoreOrTime: '13:00'
    },
    // ...
  ],
  // ...
}

If you're set on your original flattened structure, that can be done as well:

const flattened = data.flatMap(e => e.games.map(f => ({date: e.date, ...f})));

Now the date is attached repeatedly to each game:

[
  {
    date: '28.01.2023 Saturday',
    homeTeam: 'Apolonia Fier',
    awayTeam: 'Flamurtari',
    scoreOrTime: '12:00'
  },
  {
    date: '28.01.2023 Saturday',
    homeTeam: 'Besëlidhja Lezhë',
    awayTeam: 'Korabi Peshkopi',
    scoreOrTime: '12:00'
  },
  // ...
]
  • Related