Okay so I'm doing some .pdf scraping with old scanned .pdfs that I converted to readable and the columns aren't lined up very well so some of the information from column two is split into column one. It looks something like this:
df1 <- data.frame(A = c("1253", "3534 n", "9348", "0945", "9457 h", "89745 g"), B = c("uiop", "iud", "eidj", "iodw", "ops", "ios"))
df1
A B
1 1253 uiop
2 3534 n iud
3 9348 eidj
4 0945 iodw
5 9457 h ops
6 89745 g ios
And I want it to end up like this:
df2 <- data.frame(A = c("1253", "3534", "9348", "0945", "9457", "89745"), B = c("uiop", "niud", "eidj", "iodw", "hops", "gios"))
df2
A B
1 1253 uiop
2 3534 niud
3 9348 eidj
4 0945 iodw
5 9457 hops
6 89745 gios
I'm not sure how to (or if it's possible) to write something to go through the column (the actual data has around 2,000 rows) and concatenate any letter on the end of column 1 with the letters in column 2.
CodePudding user response:
I'd paste it all into a single column, then extract numbers into one and letters into another:
library(dplyr)
library(stringr)
df1 %>%
mutate(
AB = paste0(A, B),
numbers = as.numeric(str_extract(AB, pattern = "[:digit:] ")),
letters = str_extract(AB, pattern = "[:alpha:] ")
)
# A B AB numbers letters
# 1 1253 uiop 1253uiop 1253 uiop
# 2 3534 n iud 3534 niud 3534 niud
# 3 9348 eidj 9348eidj 9348 eidj
# 4 0945 iodw 0945iodw 0945 iodw
# 5 9457 h ops 9457 hops 9457 hops
# 6 89745 g ios 89745 gios 89745 gios
CodePudding user response:
You could write a small sanitize function, that strsplit
s a
and paste
s LHS to b
if any.
sanitize_cols <- \(a, b, df) {
a <- deparse(substitute(a)); b <- deparse(substitute(b))
s <- strsplit(df[[a]], ' ')
l2 <- lengths(s) == 2L
df[[b]][l2] <- paste0(sapply(s[l2], `[`, 2L), df[[b]][l2])
df[[a]] <- sapply(s, `[`, 1L)
df
}
sanitize_cols(A, B, df1)
# A B
# 1 1253 uiop
# 2 3534 niud
# 3 9348 eidj
# 4 0945 iodw
# 5 9457 hops
# 6 89745 gios
Data:
df1 <- structure(list(A = c("1253", "3534 n", "9348", "0945", "9457 h",
"89745 g"), B = c("uiop", "iud", "eidj", "iodw", "ops", "ios"
)), class = "data.frame", row.names = c(NA, -6L))