Home > Software design >  Extract characters from a string of numbers in a column
Extract characters from a string of numbers in a column

Time:12-13

Okay so I'm doing some .pdf scraping with old scanned .pdfs that I converted to readable and the columns aren't lined up very well so some of the information from column two is split into column one. It looks something like this:

df1 <- data.frame(A = c("1253", "3534 n", "9348", "0945", "9457 h", "89745 g"), B = c("uiop", "iud", "eidj", "iodw", "ops", "ios"))
df1
        A    B
1    1253 uiop
2  3534 n  iud
3    9348 eidj
4    0945 iodw
5  9457 h  ops
6 89745 g  ios

And I want it to end up like this:

df2 <- data.frame(A = c("1253", "3534", "9348", "0945", "9457", "89745"), B = c("uiop", "niud", "eidj", "iodw", "hops", "gios"))
df2
      A    B
1  1253 uiop
2  3534 niud
3  9348 eidj
4  0945 iodw
5  9457 hops
6 89745 gios

I'm not sure how to (or if it's possible) to write something to go through the column (the actual data has around 2,000 rows) and concatenate any letter on the end of column 1 with the letters in column 2.

CodePudding user response:

I'd paste it all into a single column, then extract numbers into one and letters into another:

library(dplyr)
library(stringr)

df1 %>%
  mutate(
    AB = paste0(A, B),
    numbers = as.numeric(str_extract(AB, pattern = "[:digit:] ")),
    letters = str_extract(AB, pattern = "[:alpha:] ")
  )
#         A    B         AB numbers letters
# 1    1253 uiop   1253uiop    1253    uiop
# 2  3534 n  iud  3534 niud    3534    niud
# 3    9348 eidj   9348eidj    9348    eidj
# 4    0945 iodw   0945iodw    0945    iodw
# 5  9457 h  ops  9457 hops    9457    hops
# 6 89745 g  ios 89745 gios   89745    gios

CodePudding user response:

You could write a small sanitize function, that strsplits a and pastes LHS to b if any.

sanitize_cols <- \(a, b, df) {
  a <- deparse(substitute(a)); b <- deparse(substitute(b))
  s <- strsplit(df[[a]], ' ')
  l2 <- lengths(s) == 2L
  df[[b]][l2] <- paste0(sapply(s[l2], `[`, 2L), df[[b]][l2])
  df[[a]] <- sapply(s, `[`, 1L)
  df
}

sanitize_cols(A, B, df1)
#       A    B
# 1  1253 uiop
# 2  3534 niud
# 3  9348 eidj
# 4  0945 iodw
# 5  9457 hops
# 6 89745 gios

Data:

df1 <- structure(list(A = c("1253", "3534 n", "9348", "0945", "9457 h", 
"89745 g"), B = c("uiop", "iud", "eidj", "iodw", "ops", "ios"
)), class = "data.frame", row.names = c(NA, -6L))
  • Related