I have a function which uses rvest
to extract data from a webpage. The function is the following (which is not so important):
processCardPackMinimalRealtorInfo = function(rowPosition){
# collect realtor information
realEstateInformation = bind_cols(
realEstateCompanyName = CardPackMinimal[rowPosition] %>%
html_elements('.re-CardPromotionLogo') %>%
html_nodes("a") %>%
html_children() %>%
html_attr("title"),
realEstatePageLink = CardPackMinimal[rowPosition] %>%
html_elements('.re-CardPromotionLogo') %>%
html_nodes("a") %>%
html_attr('href') %>%
paste("https://www.fotocasa.es", ., sep = "")
)
return(realEstateInformation)
}
The function works well without "error" but when it encounters "no information" it returns a tibble
of 0. So I tried to wrap this function into a purrr
, possibly
function to return NA
values when the tibble
is 0 but I cannot see to get the possibly
function to return a dataframe of NA
when there is no information.
possiblyProcessCardPackMinimalRealtorInfo = possibly(processCardPackMinimalRealtorInfo,
otherwise = tibble(
realEstateCompanyName = NA_character_,
realEstatePageLink = NA_character_
))
My question is, how can I correct the possibly
function to return NA
values when the data collected does not exisit - i.e. the tibble
is a 0 x 2 (in the case - the 2 columns are realEstateCompanyName
and realEstatePageLink
generated in the original function).
Apologies in advance for no dput
or sample data, the data involved webscraping and takes a few hours to process.
CodePudding user response:
The function processCardPackMinimalRealtorInfo
should throw an error when no rows are output, so that this can be handled by possibly
:
library(tibble)
library(purrr)
data0 <- tibble(realEstateCompanyName = character(0),
realEstatePageLink = character(0))
data1 <- tibble(realEstateCompanyName = "a",
realEstatePageLink = "b")
processCardPackMinimalRealtorInfo <- function(data) { if (nrow(data)==0) stop('no rows');data}
processCardPackMinimalRealtorInfo(data0)
#> Error in processCardPackMinimalRealtorInfo(data0): no rows
list(data1,data0) %>% map(possibly(processCardPackMinimalRealtorInfo,
otherwise = tibble(
realEstateCompanyName = NA_character_,
realEstatePageLink = NA_character_
)))
#> [[1]]
#> # A tibble: 1 × 2
#> realEstateCompanyName realEstatePageLink
#> <chr> <chr>
#> 1 a b
#>
#> [[2]]
#> # A tibble: 1 × 2
#> realEstateCompanyName realEstatePageLink
#> <chr> <chr>
#> 1 <NA> <NA>
another possibility is to handle 0 row in the function itself:
processCardPackMinimalRealtorInfo <- function(data) {
if (nrow(data)==0) data = tibble(realEstateCompanyName = NA_character_,
realEstatePageLink = NA_character_)
data
}
list(data1,data0) %>% map(processCardPackMinimalRealtorInfo)
#> [[1]]
#> # A tibble: 1 × 2
#> realEstateCompanyName realEstatePageLink
#> <chr> <chr>
#> 1 a b
#>
#> [[2]]
#> # A tibble: 1 × 2
#> realEstateCompanyName realEstatePageLink
#> <chr> <chr>
#> 1 <NA> <NA>