Home > database >  Cheerio get text from non-unique HTML class (JS)
Cheerio get text from non-unique HTML class (JS)

Time:12-01

I am trying to scrape information from a website with the following HTML format:

<tr >
<td >    <table >
        <tbody><tr>
            <td rowspan="2">
                <img src="https://img.a.transfermarkt.technology/portrait/medium/881116-1664480529.jpg?lm=1" data-src="https://img.a.transfermarkt.technology/portrait/medium/881116-1664480529.jpg?lm=1" title="Darío Osorio" alt="Darío Osorio"  data-ll-status="loaded">            </td>
            <td >
                <a title="Darío Osorio" href="/dario-osorio/profil/spieler/881116">Darío Osorio</a>                            </td>
        </tr>
        <tr>
            <td>Right Winger</td>
        </tr>
    </tbody></table>
</td><td >18</td><td ><img src="https://tmssl.akamaized.net/images/flagge/verysmall/33.png?lm=1520611569" title="Chile" alt="Chile" ></td><td ><table >
    <tbody><tr>
        <td rowspan="2">
            <a title="Club Universidad de Chile" href="/club-universidad-de-chile/startseite/verein/1037"><img src="https://tmssl.akamaized.net/images/wappen/tiny/1037.png?lm=1420190110" title="Club Universidad de Chile" alt="Club Universidad de Chile" ></a>       </td>
        <td >
            <a title="Club Universidad de Chile" href="/club-universidad-de-chile/startseite/verein/1037">U. de Chile</a>       </td>
    </tr>
    <tr>
        <td>
            <img src="https://tmssl.akamaized.net/images/flagge/tiny/33.png?lm=1520611569" title="Chile" alt="Chile" > <a title="Primera División" href="/primera-division-de-chile/transfers/wettbewerb/CLPD">Primera División</a>        </td>
    </tr>
</tbody></table>
</td><td ><table >
    <tbody><tr>
        <td rowspan="2">
            <a title="Newcastle United" href="/newcastle-united/startseite/verein/762"><img src="https://tmssl.akamaized.net/images/wappen/tiny/762.png?lm=1472921161" title="Newcastle United" alt="Newcastle United" ></a>     </td>
        <td >
            <a title="Newcastle United" href="/newcastle-united/startseite/verein/762">Newcastle</a>        </td>
    </tr>
    <tr>
        <td>
            <img src="https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569" title="England" alt="England" > <a title="Premier League" href="/premier-league/transfers/wettbewerb/GB1">Premier League</a>      </td>
    </tr>
</tbody></table>
</td><td >-</td><td >€3.00m</td><td >? </td><td ><a title="Darío Osorio to Newcastle United?" id="27730/Newcastle United sent scouts to Chile to follow Dario Osorio. the 18-year-old is being monitored by Barcelona, ​​Wolverhampton and Newcastle United./http://www.90min.com//16127/180/Darío Osorio to Newcastle United?"  href="https://www.transfermarkt.co.uk/dario-osorio-to-newcastle-united-/thread/forum/180/thread_id/16127/post_id/27730#27730">&nbsp;&nbsp;&nbsp;</a></td></tr>

I want to scrape "Darió Osorio", "U. de Chile" and "Newcastle" all from the text of different elements with [] from the HTML.

I have tried a couple of different things, my most recent attempt looks like this:

$('.odd', html).each((index, el) => {
                const source = $(el)
                const information= source.find('td.main-link').first().text().trim()
                const differentInformation= source.find('a:nth-child(1)').text()
            })

But I am only successful in scraping "Darió Osorio" with the first()-method. The variable for "differentInformation" currently looks like this with my code: "Darió OsorioU. de ChileNewcastle". The result I want to get in the end is a JSON-Object like this:

[ 
{ "firstInfo" : "Darió Osorio",
 "secondInfo": "U. de Chile",
 "thirdInfo": "Newcastle"
 },
 { "firstInfo" : "Information",
 "secondInfo": "Different Information",
 "thirdInfo": "More Different Information" 
} 
] 

CodePudding user response:

After clarification in the comments, it sounds like you're looking for something like this:

const cheerio = require("cheerio"); // 1.0.0-rc.12

const url = "YOUR URL";

(async () => {
  const response = await fetch(url);

  if (!response.ok) {
    throw Error(response.statusText);
  }

  const html = await response.text();
  const $ = cheerio.load(html);

  const data = [...$(".items .odd, .items .even")].map(e => {
    const [player, currentClub, interestedClub] =
      [...$(e).find(".hauptlink")].map(e => $(e).text().trim());
    return {player, currentClub, interestedClub};
  });
  console.log(data);
})()
  .catch(error => console.log(error));

This relies on .hauptlink which exists in the first 3 row cells that you're interested in retrieving, so that seems like the most straightforward solution. Perhaps a more robust solution would be to pick specific the <td> cells out you want.

CodePudding user response:

I'm not sure if I understand your requirements correctly. And I'm not familiar with the way you are interacting with the HTML. Still, it looks to me like you could just use an Attribute Selector to get the correct elements directly. So if you want to find all the elements with the title="information" it would look something like this (like I said I don't know cheerio so I couldn't test it)

$('.odd', html).each((index, el) => {
            const source = $(el)
            const allInformation = source.find('[title="information"]');
            allInformation.each((idx, information) => {
                console.log(information.text().trim();
            })
        })

Edit: Now that I'm thinking more about this you don't even need to change your query. Just don't use the first() and instead loop over your result as I did above. Since your query returns you an array (which is why you can do first() to get the first element of the said array). That array should contain all elements matching your query.

  • Related