After data scraping, merging content with the original data in R-CodePudding

I did a web scraping from URLs in a csv file, and I am trying to attach those results to the original csv file. However, I couldn't scrape some of the links which were broken; so they were coded as NA.

Thus, the number of rows of my result (web scraping) and the original CSV file are different. The CSV file has a few more rows, so my result is not compatible and cannot be added.

My original csv file has some information, such as the date and name of the publisher, and I'd like to attach my results in accordance with the information (by creating a new column named full_article. The line that I used is at the very line of the following codes:

fox_data <- news_data %>% filter(media_name  == "Fox News")
fox_urls <- fox_data[,4]
fox_url_xml4 <- apply(fox_urls, 1, readUrl) 
nan_fox_url_xml4 <- fox_url_xml4[!is.na(fox_url_xml4)]


textScraper4Fox <- function(x) {
  out <- tryCatch({
    html_text(html_nodes (x, ".article-body")) %>% 
        str_replace_all("\n", "") %>%
        str_replace_all("\t", "") %>%
        paste(collapse = '')
    }, error=function(cond) {
            message("=====================================")
            message(paste("Error Occured :", x))
            message(cond)
            message("=====================================")
            return(NA)
    }
  )
  return(out)
}

fox_article_text <- lapply(nan_fox_url_xml4, textScraper4Fox)
fox_article_text

#create new column "full article"
fox_data$full_article <- fox_article_text

And the error says: Error: Assigned data fox_article_text must be compatible with existing data. x Existing data has 887 rows. x Assigned data has 884 rows. ℹ Only vectors of size 1 are recycled.

CodePudding user response：

I think you might want to consider joining, which merges the new table to the old table by adding new column at those rows in which the id is the same (Maybe a url in your case):

library(tidyverse)

original_data <- tibble(id = c(1, 2), publisher = c("A", "B"), date = c("2021-11-23", "2021-11-23"))
new_data <- tibble(id = 1, full_article = "Lorem ipsum dolor")

original_data %>%
  left_join(new_data)
#> Joining, by = "id"
#> # A tibble: 2 x 4
#>      id publisher date       full_article     
#>   <dbl> <chr>     <chr>      <chr>            
#> 1     1 A         2021-11-23 Lorem ipsum dolor
#> 2     2 B         2021-11-23 <NA>

^{Created on 2021-11-23 by the reprex package (v2.0.1)}