Python lxml xpath is selecting more content than expected-CodePudding

I'm tring to save the result of searching. A typical result is something like: https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2

In the html file, the information I need is within this section:

<table  id="searchResults">
    <thead>
    <tr>
        <th></th>
        <th></th>
        <th>
            <a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Symbol&amp;sortDir=Ascending"
               target="_self">Symbol</a>
        </th>
        <th>Description</th>
        <th>
            <a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Category&amp;sortDir=Ascending"
               target="_self">Category</a>
            <a  data-ga-action="Help Icon Click"
               href="/Guide/GeneCard#tocEl-2" target="_blank" title="Read more about gene categories"></a></th>
        <th>
            <a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Gifts&amp;sortDir=Ascending"
               target="_self">GIFtS</a>
            <a  data-ga-action="Help Icon Click"
               href="/Guide/GeneCard#GIFtS" target="_blank"
               title="Read more about GeneCards Inferred Functionality Scores (GIFtS)"></a></th>
        <th>
            <a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Gcid&amp;sortDir=Ascending"
               target="_self">GC id</a>
            <a  data-ga-action="Help Icon Click"
               href="/Guide/GCids" target="_blank" title="Read more about GeneCards identifiers (GC ids)"></a></th>
        <th>
            <a href="/Search/Keyword?queryString=NONHSAT072848.2&amp;pageSize=25&amp;startPage=0&amp;sort=Score&amp;sortDir=Ascending"
               target="_self">Score</a>
            <a  data-ga-action="Help Icon Click"
               href="/Guide/Search#relevance" target="_blank" title="Read more about search scores"></a></th>
    </tr>
    </thead>
    <tbody>

    <tr>
        <td >1</td>
        <td ><a href="#"></a></td>
        <td >
            <a href="/cgi-bin/carddisp.pl?gene=IL1R1-AS1&amp;keywords=NONHSAT072848.2" target="_blank"
               data-track-event="Result Clicked" data-ga-label="IL1R1-AS1">IL1R1-AS1</a>
        </td>
        <td >IL1R1 Antisense RNA 1</td>
        <td >RNA Gene</td>
        <td >9</td>
        <td >GC02M102174</td>
        <td >1.29</td>
    </tr>
    </tbody>
</table>

Here is my code:

import lxml.html
import requests

NONCODE_IDs = [
    "NONHSAT072848.2",
    "NONHSAT182278.1",
    "NONHSAG077582.1",
    "NONHSAG028748.2",
    "NONHSAT151221.1",
    "NONHSAT151222.1",
    "NONHSAG000557.2"
]

# query link example: https://www.genecards.org/Search/Keyword?queryString=MAPK

my_header = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
}

link_base = "https://www.genecards.org/Search/Keyword?queryString="
query_link = link_base   NONCODE_IDs[0]

response = requests.get(query_link, headers=my_header)
html = lxml.html.fromstring(response.content)
table = html.xpath('//table[@id="searchResults"]')[0]

However,

table = html.xpath('//table[@id="searchResults"]')[0] is selecting more content than expected.

etree.tostring(table) returns content starting from the desired line <table id="searchResults"> to the end of the html file.

I'm not sure where I did wrong.

CodePudding user response：

For this perticular web page, beautifulsoup works for me. Yet I'm still looking for a generel fix for it using lxml because I'm a fan of xpath which beautifulsoup does not support.

Here is the beautifulsoup code that can extract the table correctly:

from bs4 import BeautifulSoup
import requests

query_link = "https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"}

response = requests.get(query_link, headers=headers)
html = BeautifulSoup(response.content, "html.parser")
table = html.find_all("table", {"class": "table table-striped table-condensed", "id": "searchResults"})

print(table)

CodePudding user response：

I'm still not entirely sure why this happens, but it seems that lxml (unlike BeautifulSoup) treats the table as two different tables: one containing the <thead> and the other the <tbody>. So to extract them both, try:

table = html.xpath('//table[@id="searchResults"]')[0]
print(lxml.html.tostring(table[0]).decode())
print(lxml.html.tostring(table[1]).decode())

The output should be the one in your question.