Home > Enterprise >  How to split up a dataframe with one column into a dataframe with different columns?
How to split up a dataframe with one column into a dataframe with different columns?

Time:04-30

I have asked a similar question before and tried to use the answers (which were very good) on my project, but I failed. I have the following dataframe:

library(rvest)
library(dplyr)
library(janitor)
library(dplyr)
library(stringr)
library(tidyverse)



link = "https://www.wahlrecht.de/umfragen/forsa/1998.htm"
page = read_html(link)
data1 = page %>% html_nodes("td") %>% html_text()
data1=data.frame(data1)
data1=data1[!apply(data1 == "–", 1, all),]
data1=data.frame(data1)
data1=data1[!apply(data1 == data1[4,1], 1, all),]
data1=data.frame(data1)
data1=data1[-c(1,2),1]
data1=data.frame(data1)

Data1 has now one column with percentage numbers and dates. I want to split up the (one) column of data 1 and create for each date a new column with the following percentage values. I tried it with the following code, but that does not work:

data1 %>% 
  group_by(idx = cumsum(x==data1[nchar(data1[,1])==10,1])) %>% 
  filter(x != data1[nchar(data1[,1])==10,1]) %>% 
  mutate(rn = row_number()) %>% 
  pivot_wider(names_from = idx, values_from = x, names_prefix = data1[nchar(data1[,1])==10,1]) %>% 
  select(-rn)

I would be grateful for any answer / solution. Thanks.

PS: I know that my code can be cut short and might not be perfect, but I can live with that if it works.

CodePudding user response:

You might find that the html_table() function does most of what you're aiming to do at html extraction point:

library(rvest)
library(janitor)
library(tidyverse)

link = "https://www.wahlrecht.de/umfragen/forsa/1998.htm"
page = read_html(link)
data1 = page %>% html_elements(".wilko") %>% html_table()

data1 <- data1[[1]]
data1

#> # A tibble: 59 x 9
#>    ``           ``    `CDU/CSU` SPD    GRÜNE FDP   PDS   Rechte Sonstige
#>    <chr>        <lgl> <chr>     <chr>  <chr> <chr> <chr> <chr>  <chr>   
#>  1 ""           NA    CDU/CSU   SPD    GRÜNE FDP   PDS   Rechte Sonstige
#>  2 "17.12.1998" NA    37 %      40 %   6 %   5 %   5 %   –      7 %     
#>  3 "10.12.1998" NA    39 %      38 %   6 %   5 %   6 %   –      6 %     
#>  4 "03.12.1998" NA    39 %      38 %   7 %   5 %   5 %   –      6 %     
#>  5 "26.11.1998" NA    38 %      38 %   7 %   5 %   5 %   –      7 %     
#>  6 "19.11.1998" NA    37 %      39 %   7 %   6 %   6 %   –      5 %     
#>  7 "12.11.1998" NA    38 %      39 %   6 %   6 %   6 %   –      5 %     
#>  8 "05.11.1998" NA    38 %      39 %   7 %   6 %   5 %   –      5 %     
#>  9 "29.10.1998" NA    39 %      39 %   6 %   6 %   5 %   –      5 %     
#> 10 "Wahl 1998"  NA    35,1 %    40,9 % 6,7 % 6,2 % 5,1 % 3,3 %  2,6 %   
#> # ... with 49 more rows

Created on 2022-04-29 by the reprex package (v2.0.1)

CodePudding user response:

You could use split to divide the column into a list of 7-item vectors, and use janitor's row_to_names to change the dates into column names:

split(data1$data1, ceiling(seq_along(data1$data1)/7)) |>
  as.data.frame() |>
  janitor::row_to_names(row_number = 1)

This would work fine if each date were followed by exactly 6 percentages as you describe, but for example after Wahl 1998 there are 7 percentages, and similarly after 20.08.1998, so this breaks.

The solution might be helpful to someone with a similar problem but more regular data.

  • Related