I have asked a similar question before and tried to use the answers (which were very good) on my project, but I failed. I have the following dataframe:
library(rvest)
library(dplyr)
library(janitor)
library(dplyr)
library(stringr)
library(tidyverse)
link = "https://www.wahlrecht.de/umfragen/forsa/1998.htm"
page = read_html(link)
data1 = page %>% html_nodes("td") %>% html_text()
data1=data.frame(data1)
data1=data1[!apply(data1 == "–", 1, all),]
data1=data.frame(data1)
data1=data1[!apply(data1 == data1[4,1], 1, all),]
data1=data.frame(data1)
data1=data1[-c(1,2),1]
data1=data.frame(data1)
Data1 has now one column with percentage numbers and dates. I want to split up the (one) column of data 1 and create for each date a new column with the following percentage values. I tried it with the following code, but that does not work:
data1 %>%
group_by(idx = cumsum(x==data1[nchar(data1[,1])==10,1])) %>%
filter(x != data1[nchar(data1[,1])==10,1]) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = idx, values_from = x, names_prefix = data1[nchar(data1[,1])==10,1]) %>%
select(-rn)
I would be grateful for any answer / solution. Thanks.
PS: I know that my code can be cut short and might not be perfect, but I can live with that if it works.
CodePudding user response:
You might find that the html_table()
function does most of what you're aiming to do at html extraction point:
library(rvest)
library(janitor)
library(tidyverse)
link = "https://www.wahlrecht.de/umfragen/forsa/1998.htm"
page = read_html(link)
data1 = page %>% html_elements(".wilko") %>% html_table()
data1 <- data1[[1]]
data1
#> # A tibble: 59 x 9
#> `` `` `CDU/CSU` SPD GRÜNE FDP PDS Rechte Sonstige
#> <chr> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "" NA CDU/CSU SPD GRÜNE FDP PDS Rechte Sonstige
#> 2 "17.12.1998" NA 37 % 40 % 6 % 5 % 5 % – 7 %
#> 3 "10.12.1998" NA 39 % 38 % 6 % 5 % 6 % – 6 %
#> 4 "03.12.1998" NA 39 % 38 % 7 % 5 % 5 % – 6 %
#> 5 "26.11.1998" NA 38 % 38 % 7 % 5 % 5 % – 7 %
#> 6 "19.11.1998" NA 37 % 39 % 7 % 6 % 6 % – 5 %
#> 7 "12.11.1998" NA 38 % 39 % 6 % 6 % 6 % – 5 %
#> 8 "05.11.1998" NA 38 % 39 % 7 % 6 % 5 % – 5 %
#> 9 "29.10.1998" NA 39 % 39 % 6 % 6 % 5 % – 5 %
#> 10 "Wahl 1998" NA 35,1 % 40,9 % 6,7 % 6,2 % 5,1 % 3,3 % 2,6 %
#> # ... with 49 more rows
Created on 2022-04-29 by the reprex package (v2.0.1)
CodePudding user response:
You could use split
to divide the column into a list of 7-item vectors, and use janitor
's row_to_names
to change the dates into column names:
split(data1$data1, ceiling(seq_along(data1$data1)/7)) |>
as.data.frame() |>
janitor::row_to_names(row_number = 1)
This would work fine if each date were followed by exactly 6 percentages as you describe, but for example after Wahl 1998
there are 7 percentages, and similarly after 20.08.1998
, so this breaks.
The solution might be helpful to someone with a similar problem but more regular data.