I'm tring to save the result of searching. A typical result is something like: https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2
In the html file, the information I need is within this section:
<table id="searchResults">
<thead>
<tr>
<th></th>
<th></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Symbol&sortDir=Ascending"
target="_self">Symbol</a>
</th>
<th>Description</th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Category&sortDir=Ascending"
target="_self">Category</a>
<a data-ga-action="Help Icon Click"
href="/Guide/GeneCard#tocEl-2" target="_blank" title="Read more about gene categories"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Gifts&sortDir=Ascending"
target="_self">GIFtS</a>
<a data-ga-action="Help Icon Click"
href="/Guide/GeneCard#GIFtS" target="_blank"
title="Read more about GeneCards Inferred Functionality Scores (GIFtS)"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Gcid&sortDir=Ascending"
target="_self">GC id</a>
<a data-ga-action="Help Icon Click"
href="/Guide/GCids" target="_blank" title="Read more about GeneCards identifiers (GC ids)"></a></th>
<th>
<a href="/Search/Keyword?queryString=NONHSAT072848.2&pageSize=25&startPage=0&sort=Score&sortDir=Ascending"
target="_self">Score</a>
<a data-ga-action="Help Icon Click"
href="/Guide/Search#relevance" target="_blank" title="Read more about search scores"></a></th>
</tr>
</thead>
<tbody>
<tr>
<td >1</td>
<td ><a href="#"></a></td>
<td >
<a href="/cgi-bin/carddisp.pl?gene=IL1R1-AS1&keywords=NONHSAT072848.2" target="_blank"
data-track-event="Result Clicked" data-ga-label="IL1R1-AS1">IL1R1-AS1</a>
</td>
<td >IL1R1 Antisense RNA 1</td>
<td >RNA Gene</td>
<td >9</td>
<td >GC02M102174</td>
<td >1.29</td>
</tr>
</tbody>
</table>
Here is my code:
import lxml.html
import requests
NONCODE_IDs = [
"NONHSAT072848.2",
"NONHSAT182278.1",
"NONHSAG077582.1",
"NONHSAG028748.2",
"NONHSAT151221.1",
"NONHSAT151222.1",
"NONHSAG000557.2"
]
# query link example: https://www.genecards.org/Search/Keyword?queryString=MAPK
my_header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
}
link_base = "https://www.genecards.org/Search/Keyword?queryString="
query_link = link_base NONCODE_IDs[0]
response = requests.get(query_link, headers=my_header)
html = lxml.html.fromstring(response.content)
table = html.xpath('//table[@id="searchResults"]')[0]
However,
table = html.xpath('//table[@id="searchResults"]')[0]
is selecting more content than expected.
etree.tostring(table)
returns content starting from the desired line <table id="searchResults">
to the end of the html file.
I'm not sure where I did wrong.
CodePudding user response:
For this perticular web page, beautifulsoup works for me. Yet I'm still looking for a generel fix for it using lxml because I'm a fan of xpath which beautifulsoup does not support.
Here is the beautifulsoup code that can extract the table correctly:
from bs4 import BeautifulSoup
import requests
query_link = "https://www.genecards.org/Search/Keyword?queryString=NONHSAT072848.2"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"}
response = requests.get(query_link, headers=headers)
html = BeautifulSoup(response.content, "html.parser")
table = html.find_all("table", {"class": "table table-striped table-condensed", "id": "searchResults"})
print(table)
CodePudding user response:
I'm still not entirely sure why this happens, but it seems that lxml (unlike BeautifulSoup) treats the table as two different tables: one containing the <thead>
and the other the <tbody>
. So to extract them both, try:
table = html.xpath('//table[@id="searchResults"]')[0]
print(lxml.html.tostring(table[0]).decode())
print(lxml.html.tostring(table[1]).decode())
The output should be the one in your question.