Home > Mobile >  Using purrr's possibly over a list to convert empty tibble columns to NA values
Using purrr's possibly over a list to convert empty tibble columns to NA values

Time:07-11

I have a function which uses rvest to extract data from a webpage. The function is the following (which is not so important):

processCardPackMinimalRealtorInfo = function(rowPosition){
  # collect realtor information
  realEstateInformation = bind_cols(
    realEstateCompanyName = CardPackMinimal[rowPosition] %>% 
      html_elements('.re-CardPromotionLogo') %>% 
      html_nodes("a") %>% 
      html_children() %>% 
      html_attr("title"),
    
    realEstatePageLink = CardPackMinimal[rowPosition] %>% 
      html_elements('.re-CardPromotionLogo') %>% 
      html_nodes("a") %>% 
      html_attr('href') %>% 
      paste("https://www.fotocasa.es", ., sep = "")
  )
  return(realEstateInformation)
}

The function works well without "error" but when it encounters "no information" it returns a tibble of 0. So I tried to wrap this function into a purrr, possibly function to return NA values when the tibble is 0 but I cannot see to get the possibly function to return a dataframe of NA when there is no information.

possiblyProcessCardPackMinimalRealtorInfo = possibly(processCardPackMinimalRealtorInfo,
                                                     otherwise = tibble(
                                                       realEstateCompanyName = NA_character_,
                                                       realEstatePageLink = NA_character_
                                                     ))

My question is, how can I correct the possibly function to return NA values when the data collected does not exisit - i.e. the tibble is a 0 x 2 (in the case - the 2 columns are realEstateCompanyName and realEstatePageLink generated in the original function).

Apologies in advance for no dput or sample data, the data involved webscraping and takes a few hours to process.

CodePudding user response:

The function processCardPackMinimalRealtorInfo should throw an error when no rows are output, so that this can be handled by possibly:

library(tibble)
library(purrr)

data0 <- tibble(realEstateCompanyName = character(0),
                realEstatePageLink = character(0))

data1 <- tibble(realEstateCompanyName = "a",
                realEstatePageLink = "b")


processCardPackMinimalRealtorInfo <- function(data) { if (nrow(data)==0) stop('no rows');data}

processCardPackMinimalRealtorInfo(data0)
#> Error in processCardPackMinimalRealtorInfo(data0): no rows

list(data1,data0) %>% map(possibly(processCardPackMinimalRealtorInfo,
                                   otherwise = tibble(
                                     realEstateCompanyName = NA_character_,
                                     realEstatePageLink = NA_character_
                                   )))
#> [[1]]
#> # A tibble: 1 × 2
#>   realEstateCompanyName realEstatePageLink
#>   <chr>                 <chr>             
#> 1 a                     b                 
#> 
#> [[2]]
#> # A tibble: 1 × 2
#>   realEstateCompanyName realEstatePageLink
#>   <chr>                 <chr>             
#> 1 <NA>                  <NA>

another possibility is to handle 0 row in the function itself:

processCardPackMinimalRealtorInfo <- function(data) { 
  if (nrow(data)==0) data = tibble(realEstateCompanyName = NA_character_,
                                   realEstatePageLink = NA_character_)
  data
}

list(data1,data0) %>% map(processCardPackMinimalRealtorInfo)

#> [[1]]
#> # A tibble: 1 × 2
#>   realEstateCompanyName realEstatePageLink
#>   <chr>                 <chr>             
#> 1 a                     b                 
#> 
#> [[2]]
#> # A tibble: 1 × 2
#>   realEstateCompanyName realEstatePageLink
#>   <chr>                 <chr>             
#> 1 <NA>                  <NA>

  • Related