I'm trying to scrape a web table which contains a cell starting with "<=". This cell (the bottom right cell) is returned as a logical NA. If I change "<=" into ">=", this value is scraped without issue. I have this issue with rvest 1.02 on RStudio Workbench, but no issue on my laptop version of RStudio running rvest 1.00.
# Minimal example:
sample <-
minimal_html("<table>
<tbody>
<tr>
<th>Col A</th><th>Col B</th>
</tr>
<tr>
<td>>=62.000</td><td><=72.000</td>
</tr>
</tbody>
</table>")
sample %>%
rvest::html_elements("table") %>%
rvest::html_table()
Output:
[[1]]
# A tibble: 1 × 2
`Col A` `Col B`
<chr> <lgl>
1 >=62.000 NA
CodePudding user response:
I have RStudio desktop (R 4.1.1) and rvest 1.0.2. I got the following result without issue:
[[1]]
# A tibble: 1 × 2
`Col A` `Col B`
<chr> <chr>
1 >=62.000 <=72.000
CodePudding user response:
I think you have a set-up where the "<" is being interpreted as the start of a tag and thus the sequence <td><
is interpreted as faulty html and cleaned rather than the "<" being preserved through html entity encoding as <
.
This would be an issue with the underlying parser, presumably later fixed.
Your set-up printing sample %>% html_node('body') %>% toString()
resulting in
<tr>
\n
<td>>=62.000</td>
\n
<td>\n</td>
\n
</tr>
seems to at least align with this reasoning.
I went looking for evidence and came across the following, for the 'lxml' html parser, lxml truncates text that contains 'less than' character, which seems to align with my supposition