I'm trying to scrape data from a table but some data all shows in one place, for more explanation check my code:
app.get('/test', (req, res) => {
const fixtures = []
axios.get('https://www.betstudy.com/soccer-stats/c/albania/1st-division/d/fixtures/', {
headers: { "Accept-Encoding": "gzip,deflate,compress"}})
.then((response) => {
const html = response.data
const $ = cheerio.load(html)
$('#leaguesub-tab-1 > div > ul > li', html).each(function(index, element) {
const date = $('#leaguesub-tab-1 > div').find('div.leaguesub-date').text().replace(/\n/g,'').trim()
const homeTeam = $(element).find('div.team-title.title-right').find('div').find('a').text().replace(/\n/g,'').trim()
const awayTeam = $(element).find('div.team-title.title-left').find('div').find('a').text().replace(/\n/g,'').trim()
const scoreOrtime = $(element).find('div.time').find('div').find('a').text().replace(/\n/g,'').trim() ' GMT 1'
fixtures.push({
date,
homeTeam,
scoreOrtime,
awayTeam
})
})
res.json(fixtures)
}).catch((err) => res.json(err))
});
Everything works like a charm for homeTeam
and awayTeam
and time
but the problem appears on date
Check the JSON response here:
[
{
"date": "28.01.2023 Saturday 05.02.2023 Sunday 11.02.2023 Saturday 18.02.2023 Saturday 25.02.2023 Saturday 04.03.2023 Saturday 11.03.2023 Saturday 18.03.2023 Saturday 01.04.2023 Saturday 08.04.2023 Saturday 15.04.2023 Saturday 22.04.2023 Saturday 06.05.2023 Saturday 13.05.2023 Saturday",
"homeTeam": "Apolonia Fier",
"scoreOrtime": "12:00 GMT 1",
"awayTeam": "Flamurtari"
},
The table I want to scrape data from is the issue I guess, because the div tag for the date is out side the ul > li
tag that contains the data
Check the source code of the table in this image:
As you can see that each <ul>
tag has its own date <div>
tag, what I want is to get each ul > li
data and its own date.
CodePudding user response:
That date selector is searching the whole document from the root rather than within $(element).find(...)
, so it gathers all of the text and glues it together.
There's an extra level of hierarchy here that you may be ignoring. The site is laid out like this:
- DATE 1
- List of games for DATE 1:
- game 1 on DATE 1
- game 2 on DATE 1
- ...
- DATE 2
- List of games for DATE 2:
- game 1 on DATE 2
- game 2 on DATE 2
- ...
- ...
Given this, my suggestion would be to select all dates and all game lists and glue them together into an array of objects that represent days. For each day, collect all of its games into a subarray. Each game has a home team, an away team, and a time.
For example:
const axios = require("axios");
const cheerio = require("cheerio"); // 1.0.0-rc.12
require("util").inspect.defaultOptions.depth = null;
const url = "<Your URL>";
axios
.get(url)
.then(({data: html}) => {
const $ = cheerio.load(html);
const dates = [...$("#leaguesub-tab-1 .leaguesub-date")]
.map(e => $(e).text().trim());
const data = [...$("#leaguesub-tab-1 .leaguesub-list")].map((e, i) => ({
date: dates[i],
games: [...$(e).find("li")].map(e => ({
homeTeam: $(e).find(".title-right").text().trim(),
awayTeam: $(e).find(".title-left").text().trim(),
scoreOrTime: $(e).find(".time").text().trim(),
}))
}));
console.log(data);
})
.catch(err => console.error(err));
Output (truncated):
[
{
date: '28.01.2023 Saturday',
games: [
{
homeTeam: 'Apolonia Fier',
awayTeam: 'Flamurtari',
scoreOrTime: '12:00'
},
{
homeTeam: 'Besëlidhja Lezhë',
awayTeam: 'Korabi Peshkopi',
scoreOrTime: '12:00'
},
// ...
]
},
{
date: '05.02.2023 Sunday',
games: [
{
homeTeam: 'Skënderbeu Korçë',
awayTeam: 'Tërbuni Pukë',
scoreOrTime: '13:00'
},
{
homeTeam: 'Oriku',
awayTeam: 'Dinamo Tirana',
scoreOrTime: '13:00'
},
// ...
]
},
// ...
]
You could also key that outer array by date for easier lookups:
const byDate = Object.fromEntries(data.map(e => [e.date, e.games]));
// or: [e.date.split(" ")[0], e.games]
Which gives:
{
'28.01.2023 Saturday': [
{
homeTeam: 'Apolonia Fier',
awayTeam: 'Flamurtari',
scoreOrTime: '12:00'
},
{
homeTeam: 'Besëlidhja Lezhë',
awayTeam: 'Korabi Peshkopi',
scoreOrTime: '12:00'
},
// ...
],
'05.02.2023 Sunday': [
{
homeTeam: 'Skënderbeu Korçë',
awayTeam: 'Tërbuni Pukë',
scoreOrTime: '13:00'
},
{
homeTeam: 'Oriku',
awayTeam: 'Dinamo Tirana',
scoreOrTime: '13:00'
},
// ...
],
// ...
}
If you're set on your original flattened structure, that can be done as well:
const flattened = data.flatMap(e => e.games.map(f => ({date: e.date, ...f})));
Now the date is attached repeatedly to each game:
[
{
date: '28.01.2023 Saturday',
homeTeam: 'Apolonia Fier',
awayTeam: 'Flamurtari',
scoreOrTime: '12:00'
},
{
date: '28.01.2023 Saturday',
homeTeam: 'Besëlidhja Lezhë',
awayTeam: 'Korabi Peshkopi',
scoreOrTime: '12:00'
},
// ...
]