Is there a way using pandas.read_html
to get the img alt
text from an image ? The page I am scrapping just replaced some texts with pictures, and the old text is now the alt
text for those images. Here is an example:
<td>
<div>...
<a href="/index.php/WF..." title="WF"><img alt="WFrag" src="/images/thumb/WF.png" </a>
</div>
</td>
This is how it looked, and it was perfect for pandas.read_html
<td>
WF
</td>
CodePudding user response:
As per the thread linked to in the previous answer, read_html
is designed only for tables with text content and isn't able to parse <a>
, <div>
, <img>
, etc. You'll need to preprocess the HTML you get from scraping before passing it to read_html
.
If you're in the market for a quick and dirty solution, you could do this using regex:
import pandas as pd
import re
table_html = """
<table>
<tr>
<td>
<div>
<a href="/index.php/WF..." title="WF"><img alt="WFrag" src="/images/thumb/WF.png" </a>
</div>
</td>
</tr>
</table>"""
# Delete all <div>, </div>, <a> and </a> tags
table_html = re.sub(r'</?div[^>]*>', '', table_html)
table_html = re.sub(r'</?a[^>]*>', '', table_html)
# Replace <img> tags with the content of their alt attribute
table_html = re.sub(r'<img[^>]*alt="([^"]*)"[^>]*>', r'\1', table_html)
print(pd.read_html(table_html))
Outputs:
[ 0
0 WFrag]
However this is not a very robust solution as it may need to be adapted to any irregular output (if a tag was written in capital letters for example). A better way would be to parse the HTML string and perform the same operations with a library specifically designed for parsing HTML, such as beautifulsoup4.
CodePudding user response:
It seems like it is not possible with pandas. You can check out the answer in this thread Pandas read_html to return raw HTML contents [for certain rows/cells/etc.]