Placing "NA" into an Empty Position?-CodePudding

I am trying to scrape name/address information from yellowpages (https://www.yellowpages.ca/). I have a function (from :(R) Webscraping Error : arguments imply differing number of rows: 1, 0) that is able to retrieve this information:

library(rvest)
library(dplyr)

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

However, sometimes the address information is not always present. For example : there are several barbers listed on this page https://www.yellowpages.ca/search/si/1/barber/Sudbury ON and they all have addresses except one of them. As a result, when I run this function, I get the following error:

scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury ON")

Error:
! Tibble columns must have compatible sizes.
* Size 14: Existing data.
* Size 12: Column `address`.
i Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.

My Question: Is there some way that I can modify the definition of the "scraper" function in such a way, such that when no address is listed, an NA appears in that line? For example:

     barber    address
1 barber111 address111
2 barber222 address222
3 barber333         NA

Is there some way I could add a statement similar to CASE WHEN that would grab the address or place an NA when the address is not there?

CodePudding user response：

In order to match the businesses with their addresses, it is best to find a root node for each listing and get the text from the relevant child node. If the child node is empty, you can add an NA

library(rvest)
library(dplyr)

scraper <- function(url) {

 nodes <- read_html(url) %>% html_elements(".listing_right_section") 

  tibble(name = nodes %>% sapply(function(x) {
             x <- html_text2(html_elements(x, css = ".jsListingName"))
             if(length(x)) x else NA}),
         address = nodes %>% sapply(function(x) {
             x <- html_text2(html_elements(x, css = ".listing__address--full"))
             if(length(x)) x else NA}))
}

So now we can do:

scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury ON")
#> # A tibble: 14 x 2
#>    name                                      address                            
#>    <chr>                                     <chr>                              
#>  1 Lords'n Ladies Hair Design                1560 Lasalle Blvd, Sudbury, ON P3A~
#>  2 Jo's The Lively Barber                    611 Main St, Lively, ON P3Y 1M9    
#>  3 Hairapy Studio 517 & Barber Shop          517 Notre Dame Ave, Sudbury, ON P3~
#>  4 Nickel Range Unisex Hairstyling           111 Larch St, Sudbury, ON P3E 4T5  
#>  5 Ugo Barber & Hairstyling                  911 Lorne St, Sudbury, ON P3C 4R7  
#>  6 Gordon's Hairstyling                      19 Durham St, Sudbury, ON P3C 5E2  
#>  7 Valley Plaza Barber Shop                  5085 Highway 69 N, Hanmer, ON P3P ~
#>  8 Rick's Hairstyling Shop                   28 Young St, Capreol, ON P0M 1H0   
#>  9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3    
#> 10 Pat's Hairstylists                        33 Godfrey Dr, Copper Cliff, ON P0~
#> 11 WildRootz Hair Studio                     911 Lorne St, Sudbury, ON P3C 4R7  
#> 12 Sleek Barber Bar                          324 Elm St, ON P3C 1V8             
#> 13 Faiella Classic Hair                      <NA>                               
#> 14 Ben's Barbershop & Hairstyling            <NA>

^{Created on 2022-09-16 with reprex v2.0.2}