Home > database >  Placing "NA" into an Empty Position?
Placing "NA" into an Empty Position?

Time:09-17

I am trying to scrape name/address information from yellowpages (https://www.yellowpages.ca/). I have a function (from :(R) Webscraping Error : arguments imply differing number of rows: 1, 0) that is able to retrieve this information:

library(rvest)
library(dplyr)

scraper <- function(url) {
  page <- url %>% 
    read_html()
  
  tibble(
    name = page %>%  
      html_elements(".jsListingName") %>% 
      html_text2(),
    address = page %>% 
      html_elements(".listing__address--full") %>% 
      html_text2()
  )
}

However, sometimes the address information is not always present. For example : there are several barbers listed on this page https://www.yellowpages.ca/search/si/1/barber/Sudbury ON and they all have addresses except one of them. As a result, when I run this function, I get the following error:

scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury ON")

Error:
! Tibble columns must have compatible sizes.
* Size 14: Existing data.
* Size 12: Column `address`.
i Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.

My Question: Is there some way that I can modify the definition of the "scraper" function in such a way, such that when no address is listed, an NA appears in that line? For example:

     barber    address
1 barber111 address111
2 barber222 address222
3 barber333         NA

Is there some way I could add a statement similar to CASE WHEN that would grab the address or place an NA when the address is not there?

CodePudding user response:

In order to match the businesses with their addresses, it is best to find a root node for each listing and get the text from the relevant child node. If the child node is empty, you can add an NA

library(rvest)
library(dplyr)

scraper <- function(url) {

 nodes <- read_html(url) %>% html_elements(".listing_right_section") 

  tibble(name = nodes %>% sapply(function(x) {
             x <- html_text2(html_elements(x, css = ".jsListingName"))
             if(length(x)) x else NA}),
         address = nodes %>% sapply(function(x) {
             x <- html_text2(html_elements(x, css = ".listing__address--full"))
             if(length(x)) x else NA}))
}

So now we can do:

scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury ON")
#> # A tibble: 14 x 2
#>    name                                      address                            
#>    <chr>                                     <chr>                              
#>  1 Lords'n Ladies Hair Design                1560 Lasalle Blvd, Sudbury, ON P3A~
#>  2 Jo's The Lively Barber                    611 Main St, Lively, ON P3Y 1M9    
#>  3 Hairapy Studio 517 & Barber Shop          517 Notre Dame Ave, Sudbury, ON P3~
#>  4 Nickel Range Unisex Hairstyling           111 Larch St, Sudbury, ON P3E 4T5  
#>  5 Ugo Barber & Hairstyling                  911 Lorne St, Sudbury, ON P3C 4R7  
#>  6 Gordon's Hairstyling                      19 Durham St, Sudbury, ON P3C 5E2  
#>  7 Valley Plaza Barber Shop                  5085 Highway 69 N, Hanmer, ON P3P ~
#>  8 Rick's Hairstyling Shop                   28 Young St, Capreol, ON P0M 1H0   
#>  9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3    
#> 10 Pat's Hairstylists                        33 Godfrey Dr, Copper Cliff, ON P0~
#> 11 WildRootz Hair Studio                     911 Lorne St, Sudbury, ON P3C 4R7  
#> 12 Sleek Barber Bar                          324 Elm St, ON P3C 1V8             
#> 13 Faiella Classic Hair                      <NA>                               
#> 14 Ben's Barbershop & Hairstyling            <NA>

Created on 2022-09-16 with reprex v2.0.2

  • Related