How do I separate a character (and number) vector into columns?-CodePudding

I am looking to separate a character and numbers vector into columns. The table is extracted from a pdf that looks like this

I'm looking to create a data frame with the columns unit, year, pct_income, where "unit" is the categories to the left (Government etc).

I have already extracted the text and removed spaces and dots. One problem you might notice is that the number 1 is mistakenly read in as "i" or "I".

How can I proceed from here?

CODE

library(tidyverse)
table_raw <- c("Per cent of Total Income",
               "1938-39 1945 1965",
               "Government                      ................          62     50      6I",
               "Insured          persons            .....   ...........     i6    28    22",
               "Employers                    ................          19      20       15",
               "Other (mainly interest) ......... ....... 3 2 2")


table_raw %>% 
  # replace '.' with one space
  str_replace_all("\\.", " ") %>% 
  
  # replace two spaces or more with one space
  str_replace_all("\\s{2,}", " ")

SOLUTION: Thanks to Peters answer below.

library(tidyverse)

table_raw <- c("Per cent of Total Income",
                     "1938-39 1945 1965",
                     "Government                      ................          62     50      6I",
                     "Insured          persons            .....   ...........     i6    28    22",
                     "Employers                    ................          19      20       15",
                     "Other (mainly interest) ......... ....... 3 2 2")

# --- Cleaning ---#

  # Remove first two rows 
vec <- table_raw [-c(1:2)] %>%
  str_replace_all("\\.", " ") %>% 
  # remove whitespace
  str_squish() %>% 
  # Change 'i' to '1'.
  str_replace_all("[iI](?=\\d)", "1") %>% 
  # Change 'I' to '1'.
str_replace_all("(?<=\\d)[iI]", "1") %>% 
  # Replace space followed by a digit with comma
  str_replace_all("\\s(?=\\d{1})", ",")


final <- data.frame(unit = vec) %>% 
  separate(col = unit, sep = ",", into = c("unit_of_observation", "1938-39", "1945", "1965")) |> 
  pivot_longer(-unit_of_observation, names_to = "year", values_to = "pct_income")

CodePudding user response：

Possibly not the most efficient way this may help:

library(tidyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(stringr)

# initial data wrangling as a vector
vec <- 
  table_raw [-c(1:2)]|> 
  str_replace_all("\\.", " ") |> 
  str_squish() |> 
  str_replace_all("[iI](?=\\d)", "1")|> 
  str_replace_all("(?<=\\d)[iI]", "1") |> 
  str_replace_all("\\s(?=\\d{1})", ",")

#turn into long format as a data frame, I've left the percentages as characters, a simple mutate would resolve this.

data.frame(unit = vec) |> 
  separate(col = unit, sep = ",", into = c("unit", "1938-39", "1945", "1965")) |> 
  pivot_longer(-unit, names_to = "year", values_to = "pct_income")

#> # A tibble: 12 × 3
#>    unit                    year    pct_income
#>    <chr>                   <chr>   <chr>     
#>  1 Government              1938-39 62        
#>  2 Government              1945    50        
#>  3 Government              1965    61        
#>  4 Insured persons         1938-39 16        
#>  5 Insured persons         1945    28        
#>  6 Insured persons         1965    22        
#>  7 Employers               1938-39 19        
#>  8 Employers               1945    20        
#>  9 Employers               1965    15        
#> 10 Other (mainly interest) 1938-39 3         
#> 11 Other (mainly interest) 1945    2         
#> 12 Other (mainly interest) 1965    2

^{Created on 2022-12-27 with reprex v2.0.2}