I am looking to separate a character and numbers vector into columns. The table is extracted from a pdf that looks like this
I'm looking to create a data frame with the columns unit, year, pct_income, where "unit" is the categories to the left (Government etc).
I have already extracted the text and removed spaces and dots. One problem you might notice is that the number 1 is mistakenly read in as "i" or "I".
How can I proceed from here?
CODE
library(tidyverse)
table_raw <- c("Per cent of Total Income",
"1938-39 1945 1965",
"Government ................ 62 50 6I",
"Insured persons ..... ........... i6 28 22",
"Employers ................ 19 20 15",
"Other (mainly interest) ......... ....... 3 2 2")
table_raw %>%
# replace '.' with one space
str_replace_all("\\.", " ") %>%
# replace two spaces or more with one space
str_replace_all("\\s{2,}", " ")
SOLUTION: Thanks to Peters answer below.
library(tidyverse)
table_raw <- c("Per cent of Total Income",
"1938-39 1945 1965",
"Government ................ 62 50 6I",
"Insured persons ..... ........... i6 28 22",
"Employers ................ 19 20 15",
"Other (mainly interest) ......... ....... 3 2 2")
# --- Cleaning ---#
# Remove first two rows
vec <- table_raw [-c(1:2)] %>%
str_replace_all("\\.", " ") %>%
# remove whitespace
str_squish() %>%
# Change 'i' to '1'.
str_replace_all("[iI](?=\\d)", "1") %>%
# Change 'I' to '1'.
str_replace_all("(?<=\\d)[iI]", "1") %>%
# Replace space followed by a digit with comma
str_replace_all("\\s(?=\\d{1})", ",")
final <- data.frame(unit = vec) %>%
separate(col = unit, sep = ",", into = c("unit_of_observation", "1938-39", "1945", "1965")) |>
pivot_longer(-unit_of_observation, names_to = "year", values_to = "pct_income")
CodePudding user response:
Possibly not the most efficient way this may help:
library(tidyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(stringr)
# initial data wrangling as a vector
vec <-
table_raw [-c(1:2)]|>
str_replace_all("\\.", " ") |>
str_squish() |>
str_replace_all("[iI](?=\\d)", "1")|>
str_replace_all("(?<=\\d)[iI]", "1") |>
str_replace_all("\\s(?=\\d{1})", ",")
#turn into long format as a data frame, I've left the percentages as characters, a simple mutate would resolve this.
data.frame(unit = vec) |>
separate(col = unit, sep = ",", into = c("unit", "1938-39", "1945", "1965")) |>
pivot_longer(-unit, names_to = "year", values_to = "pct_income")
#> # A tibble: 12 × 3
#> unit year pct_income
#> <chr> <chr> <chr>
#> 1 Government 1938-39 62
#> 2 Government 1945 50
#> 3 Government 1965 61
#> 4 Insured persons 1938-39 16
#> 5 Insured persons 1945 28
#> 6 Insured persons 1965 22
#> 7 Employers 1938-39 19
#> 8 Employers 1945 20
#> 9 Employers 1965 15
#> 10 Other (mainly interest) 1938-39 3
#> 11 Other (mainly interest) 1945 2
#> 12 Other (mainly interest) 1965 2
Created on 2022-12-27 with reprex v2.0.2