return same number of elements from html using rvest-CodePudding

I am trying to scrape the name of city and address of all Apple stores in the UK using rvest

library(rvest)
library(xml2)
library(tidyverse)

my_url <- read_html("https://www.apple.com/uk/retail/storelist/")

# extract city name 
city_name <- my_url %>% html_elements("h2") %>% html_text2()
length(city_name)
# 27 cities

address <- my_url %>% html_elements("address") %>% html_text2()
length(address)
# 38 addresses

I am getting more addresses than city names. This is because some cities have multiple stores. How do I get same number city name and address so that I can put them in the dataframe?

CodePudding user response：

You can do

library(rvest)
library(xml2)
library(tidyverse)

read_html("https://www.apple.com/uk/retail/storelist/") %>% 
  html_elements(xpath = "//div[@class='state']") %>%
  lapply(function(x) {
    data.frame(city = html_element(x, "h2") %>% html_text(), 
               address = html_elements(x, "address") %>% html_text2())}) %>%
  do.call(rbind, .) %>%
  as_tibble()
#> # A tibble: 38 x 2
#>    city            address                                                      
#>    <chr>           <chr>                                                        
#>  1 Aberdeen        "27/28 Ground Level Mall\nUnion Square\nAberdeen , AB11 ~
#>  2 Antrim          "Upper Ground Floor\n1 Victoria Square\nBelfast , BT1 4Q~
#>  3 Berkshire       "The Oracle Shopping Centre\nUpper Level\nReading , RG1 ~
#>  4 Bristol         "11 Philadelphia Street\nQuakers Friars\nBristol , BS1 3~
#>  5 Bristol         "Upper Mall\nThe Mall at Cribbs Causeway\nBristol , BS34~
#>  6 Buckinghamshire "26 Midsummer Place\nMidsummer Boulevard\nMilton Keynes ~
#>  7 Cambridgeshire  "Grand Arcade Shopping Centre\nCambridge , CB2 3AX\n0122~
#>  8 Cardiff         "63-66 Grand Arcade\nSt David’s Dewi Sant\nCardiff , CF1~
#>  9 Central London  "No. 1-7 The Piazza\nLondon , WC2E 8HB\n020 7447 1400"    
#> 10 Central London  "235 Regent Street\nLondon , W1B 2EL\n020 7153 9000"      
#> # ... with 28 more rows

^{Created on 2022-04-12 by the reprex package (v2.0.1)}