Subsetting elements in a list and placing them in a data frame-CodePudding

I have a list ("listanswer") that looks something like this:

> str(listanswer)
List of 100
 $ : chr [1:3] "" "" "\t\t"
 $ : chr [1:5] "" "Dr. Smith" "123 Fake Street" "New York, ZIPCODE 1" ...
 $ : chr [1:5] "" "Dr. Jones" "124 Fake Street" "New York, ZIPCODE 2" ...


> listanswer
[[1]]
[1] ""   ""   "\t\t"

[[2]]
[1] ""                             "Dr. Smith" "123 Fake Street"         "New York"          
[5] "ZIPCODE 1"    

[[3]]
[1] ""                           "Dr. Jones"   "124 Fake Street,"  "New York"        
[5] "ZIPCODE2"

For each element in this list, I noticed the following pattern within the sub-elements:

# first sub-element is always empty
    > listanswer[[2]][[1]]
    [1] ""
# second sub-element is the name
    > listanswer[[2]][[2]]
    [1] "Dr. Smith"
# third sub-element is always the address 
    > listanswer[[2]][[3]]
    [1] "123 Fake Street"
# fourth sub-element is always the city
    > listanswer[[2]][[4]]
    [1] "New York"
# fifth sub-element is always the ZIP
    > listanswer[[2]][[5]]
    [1] "ZIPCODE 1"

I want to create a data frame that contains the information from this list in row format. For example:

  id      name         address     city       ZIP
1  2 Dr. Smith 123 Fake Street New York ZIPCODE 1
2  3 Dr. Jones 124 Fake Street New York ZIPCODE 2

I thought of the following way to do this:

name = sapply(listanswer,function(x) x[2])
address = sapply(listanswer,function(x) x[3])
city = sapply(listanswer,function(x) x[4])
zip = sapply(listanswer,function(x) x[5])

final_data = data.frame(name, address, city, zip)
id = 1:nrow(final_data)

My Question: I just wanted to confirm - Is this the correct way to reference sub-elements in lists?

CodePudding user response：

If it works, it's the correct way, although there might be a more efficient or more readable way to do the same thing.

Another way to do this is to create a data frame with your columns, and add rows to it. i. e.

#create an empty data frame
df <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(df) <- c("name", "address", "city", "zip")

#add rows
lapply(listanswer, \(x){df[nrow(df)   1,] <- x[2:5]})

This is simply another way to solve the same problem. Readability is a personal preference, and there's nothing wrong with your solution either.

CodePudding user response：

If this is based on your elephant question, for businesses in Vancouver, then this mostly works.

library(rvest)

url<-"Website/british-columbia/"
page <-read_html(url)

#find the div tab of class=one_third
b = page %>% html_nodes("div.one_third") 

listanswer <- b %>% html_text() %>% strsplit("\\n")
#listanswer2 <- b %>% html_text2() %>% strsplit("\\n")
listanswer[[1]]<-NULL #remove first blank record

rows<-lapply(listanswer, function(element){
   vect<-element[-1] #remove first blank field
   cityindex<-as.integer(grep("Vancouver", vect))  #find city field
   #add some error checking and corrections
   if(length(cityindex)==0) {
      cityindex <- length(vect)-1 }
   else if(length(cityindex)>1) {
      cityindex <- cityindex[2] }

   #get the fields of interest
   address <- vect[cityindex-1]
   city<-vect[cityindex]
   phone <- vect[cityindex 1]
   
  if( cityindex < 3) {
      cityindex <- 3
   }  #error check
   #first groups combine into 1 name
   name <- toString(vect[1:(cityindex-2)])
   data.frame(name, address, city, phone)
})

answer<-bind_rows(rows)
#clean up 
answer$phone <- sub("Website", "", answer$phone)
answer

This still needs some clean up to handle the inconsistences but should be 80-90% complete