Home > Software design >  How can I determine the delimiter being used in an infobox-data table on Wikipedia using R?
How can I determine the delimiter being used in an infobox-data table on Wikipedia using R?

Time:12-24

I am trying to scrape the infobox data for an Indonesian film from Wikipedia using R. In the infobox, there are several fields that contain multiple lines of data. For example, the "Pemeran" (or "Cast") field for the film "Kutunggu di Sudut Semanggi" https://id.m.wikipedia.org/wiki/Kutunggu_di_Sudut_Semanggi looks like this in the HTML:

<tr>
  <th scope="row"  style="white-space:nowrap;padding-right:0.65em;">Pemeran</th>
  <td >
    <a href="/w/index.php?title=Marisa_Tompunu&amp;action=edit&amp;redlink=1"  title="Marisa Tompunu (halaman belum tersedia)">Marisa Tompunu</a><br>
    <a href="/wiki/Berliana_Febrianti"  title="Berliana Febrianti">Berliana Febrianti</a><br>
    <a href="/w/index.php?title=Hanna_Wijaya&amp;action=edit&amp;redlink=1"  title="Hanna Wijaya (halaman belum tersedia)">Hanna Wijaya</a><br>
    <a href="/wiki/Slamet_Rahardjo" title="Slamet Rahardjo">Slamet Rahardjo</a><br>
    <a href="/w/index.php?title=Dwi_Asih_Setiawati&amp;action=edit&amp;redlink=1"  title="Dwi Asih Setiawati (halaman belum tersedia)">Dwi Asih Setiawati</a><br>
    <a href="/wiki/Tengku_Firmansyah" title="Tengku Firmansyah">Tengku Firmansyah</a>
  </td>
</tr>

I have written the following code to extract the data from this field and split it into separate lines:

# Scrape the Wikipedia page for the film
url <- "https://id.wikipedia.org/wiki/Kutunggu_di_Sudut_Semanggi"
page <- read_html(url)

# Extract the infobox
infobox <- html_nodes(page, "table.infobox")

# Extract the "Pemeran" field from the infobox
anchors <- html_nodes(infobox, "th:contains('Pemeran')   td")

# Extract the names of the cast members from the anchor elements
pemeran <- html_text(html_nodes(anchors, "a"))

# Split the text into separate lines
lines <- strsplit(pemeran, "<br>")[[1]]

# Create a new row for each line
rows <- data.frame(Pemeran = lines)

# Check the rows
print(rows)

However, when I run this code, the resulting data frame rows only contains one line of data, like this:

         Pemeran
1 Marisa Tompunu

I expected the data frame to contain one row for each cast member, like this:

        Pemeran
1 Marisa Tompunu
2 Berliana Febrianti
3 Hanna Wijaya
4 Slamet Rahadrjo
5 Dwi Asih Setiawati
6 Tengku Firmansyah

I suspect that the issue may be with the delimiter that I am using to split the text into separate lines. Currently, I am using <br> as the delimiter, but it looks like the infobox-data tables in Wikipedia use a different delimiter.

What delimiter is being used in an infobox-data table on Wikipedia, and how can I split the text into separate lines using that delimiter in R?

CodePudding user response:

library(tidyverse)
library(rvest)

tibble(
  Pemeran = "https://id.m.wikipedia.org/wiki/Kutunggu_di_Sudut_Semanggi" %>% 
    read_html() %>%
    html_elements("tr:nth-child(5) a") %>% 
    html_text2()
)

#> # A tibble: 6 × 1
#>   Pemeran           
#>   <chr>             
#> 1 Marisa Tompunu    
#> 2 Berliana Febrianti
#> 3 Hanna Wijaya      
#> 4 Slamet Rahardjo   
#> 5 Dwi Asih Setiawati
#> 6 Tengku Firmansyah

Created on 2022-12-23 by the reprex package (v2.0.1)

  • Related