Home > other >  Excluding quotes while scraping text on a forum site
Excluding quotes while scraping text on a forum site

Time:10-28

I am struggling with excluding quotes on a forum website. I need to scrape message texts some of which include quotations as a reference to an earlier message. Scraping all messages that have quotes means I get the same text on scraped data multiple times. In which case analyzing it becomes difficult. Can anybody help me with bypassing quoted text while scraping? Here is an example page: https://forum.donanimhaber.com/toyota-touch-2-ve-touch-go-2-kullanici-tecrubeleri-ve-klavuzda-yazmayanlar--88405838 Here is the html code of a message with quotation:

                            <table>
                                <tbody>
                                    <tr>
                                        <td>
                                            <table style="width:100%;"><tbody><tr><td>**<blockquote ><i>quote:</i><br><br>Orijinalden alıntı:  DBolanci <br>   <br>  Beyler alb&#252;m kapağı ve klas&#246;r listelemeyi bende yapamadım. Mp3leri tek tek d&#252;zenledim en ince ayrıntısına kadar yazdım ama g&#246;stermiyor. Nasıl yapacaz bilgisi olan? Ayrıca ara&#231;ta navigasyon &#246;zelliğini nasıl kazandırabiliriz? servis yazılım i&#231;in &#252;cret istiyor :( <br>  </blockquote>**</td></tr></tbody></table> <br>  aynı soruların cevabını bende bekliyorum. yardımcı olabilecek kimse yokmu?
                                        </td>
                                    </tr>
                                </tbody>
                            </table>
                    </span>

CodePudding user response:

If you look at the html structure of the message you can see that it follows this format:

<span >
  <table>
    <table> QUOTED TEXT </table>
    TEXT CONTENT
  </table>

All you have to do is select msg>table content and do not select msg>table>table content. In xpath selectors this can be achieved in something like:

//span[@]/table/tbody/tr/td/text()
  • Related