Home > database >  Web scraping of tables with R rvest, bottom right cell starting with "<=" is returned a
Web scraping of tables with R rvest, bottom right cell starting with "<=" is returned a

Time:12-15

I'm trying to scrape a web table which contains a cell starting with "<=". This cell (the bottom right cell) is returned as a logical NA. If I change "<=" into ">=", this value is scraped without issue. I have this issue with rvest 1.02 on RStudio Workbench, but no issue on my laptop version of RStudio running rvest 1.00.

# Minimal example: 
sample <- 
  minimal_html("<table>
               <tbody>
               <tr>
               <th>Col A</th><th>Col B</th>
               </tr>
               <tr>
               <td>>=62.000</td><td><=72.000</td>
               </tr>
               </tbody>
               </table>")
sample %>% 
  rvest::html_elements("table") %>% 
  rvest::html_table()

Output:

[[1]]
# A tibble: 1 × 2
  `Col A`  `Col B`
  <chr>    <lgl>  
1 >=62.000 NA    

CodePudding user response:

I have RStudio desktop (R 4.1.1) and rvest 1.0.2. I got the following result without issue:

[[1]]
# A tibble: 1 × 2
  `Col A`  `Col B` 
  <chr>    <chr>   
1 >=62.000 <=72.000

CodePudding user response:

I think you have a set-up where the "<" is being interpreted as the start of a tag and thus the sequence <td>< is interpreted as faulty html and cleaned rather than the "<" being preserved through html entity encoding as &lt;.

This would be an issue with the underlying parser, presumably later fixed.

Your set-up printing sample %>% html_node('body') %>% toString() resulting in

<tr>
  \n
  <td>&gt;=62.000</td>
  \n
  <td>\n</td>
  \n
</tr>

seems to at least align with this reasoning.

I went looking for evidence and came across the following, for the 'lxml' html parser, lxml truncates text that contains 'less than' character, which seems to align with my supposition

  • Related