Extracting numerical portion of string occurring in the entire data frame-CodePudding

I have a data set which is read into a dataframe as characters. The characters have numerical portions, and some include a "/T". For example:

df <- data.frame(col1 = c("237.7", "191.8", "95.4/T", "48.1"), 
                 col2 = c("234.3", "766.4/T", "167.6/T", "401.6"))
> df
    col1    col2
1  237.7   234.3
2  191.8 766.4/T
3 95.4/T 167.6/T
4   48.1   401.6

What command can I use to identify the rows and columns in the dataframe which contain data with "/T"? The following will clearly not work:

which(df == "/T", arr.ind = TRUE)

Also, I want to extract only the numerical portion of the data from the entire dataframe, or remove all "/T". My actual example has far more columns and rows than the example I provide above. I have seen the following examples in Stack Overflow:

Extracting decimal numbers from a string

Extracting numeric portion from a character in the Data Frame

The above links provide solutions to vectors or a column of the data frame. How can I generalize these solutions to go systemically throughout the entire dataframe?

Thanks.

CodePudding user response：

Depending on what your data looks like and how you are using it, there are multiple ways to address this.

If you need to identify which rows/columns contain "/T", you could do something like this (using the dplyr package's mutate function along with base R's grepl and gsub).

library(dplyr)

df %>% 
  mutate(
    col1_check = ifelse(grepl("/T", col1), "problem", "okay"), 
    col2_check = ifelse(grepl("/T", col2), "problem", "okay"), 
    col1_clean = gsub("/T", "", col1), 
    col2_clean = gsub("/T", "", col2)
    )

Which results in this output:

    col1    col2 col1_check col2_check col1_clean col2_clean
1  237.7   234.3       okay       okay      237.7      234.3
2  191.8 766.4/T       okay    problem      191.8      766.4
3 95.4/T 167.6/T    problem    problem       95.4      167.6
4   48.1   401.6       okay       okay       48.1      401.6

If you just need to remove the "/T" portion, you can use dplyr's mutate_at function with gsub (applied to whatever columns you want to fix):

library(dplyr)

df %>% 
  mutate_at(vars(col1, col2), list(~gsub("/T", "", .)))

And if you need to extract only the numeric elements and the period (say you have other symbols than just "/T"), you can use the stringr package's str_extract function with some regex:

library(dplyr)
library(stringr)

df %>% 
  mutate_at(vars(col1, col2), list(~str_extract(., "[0-9.] ")))