Home > OS >  Extracting numerical portion of string occurring in the entire data frame
Extracting numerical portion of string occurring in the entire data frame

Time:09-26

I have a data set which is read into a dataframe as characters. The characters have numerical portions, and some include a "/T". For example:

df <- data.frame(col1 = c("237.7", "191.8", "95.4/T", "48.1"), 
                 col2 = c("234.3", "766.4/T", "167.6/T", "401.6"))
> df
    col1    col2
1  237.7   234.3
2  191.8 766.4/T
3 95.4/T 167.6/T
4   48.1   401.6

What command can I use to identify the rows and columns in the dataframe which contain data with "/T"? The following will clearly not work:

which(df == "/T", arr.ind = TRUE)

Also, I want to extract only the numerical portion of the data from the entire dataframe, or remove all "/T". My actual example has far more columns and rows than the example I provide above. I have seen the following examples in Stack Overflow:

Extracting decimal numbers from a string

Extracting numeric portion from a character in the Data Frame

The above links provide solutions to vectors or a column of the data frame. How can I generalize these solutions to go systemically throughout the entire dataframe?

Thanks.

CodePudding user response:

Depending on what your data looks like and how you are using it, there are multiple ways to address this.

If you need to identify which rows/columns contain "/T", you could do something like this (using the dplyr package's mutate function along with base R's grepl and gsub).

library(dplyr)

df %>% 
  mutate(
    col1_check = ifelse(grepl("/T", col1), "problem", "okay"), 
    col2_check = ifelse(grepl("/T", col2), "problem", "okay"), 
    col1_clean = gsub("/T", "", col1), 
    col2_clean = gsub("/T", "", col2)
    )

Which results in this output:

    col1    col2 col1_check col2_check col1_clean col2_clean
1  237.7   234.3       okay       okay      237.7      234.3
2  191.8 766.4/T       okay    problem      191.8      766.4
3 95.4/T 167.6/T    problem    problem       95.4      167.6
4   48.1   401.6       okay       okay       48.1      401.6

If you just need to remove the "/T" portion, you can use dplyr's mutate_at function with gsub (applied to whatever columns you want to fix):

library(dplyr)

df %>% 
  mutate_at(vars(col1, col2), list(~gsub("/T", "", .)))

And if you need to extract only the numeric elements and the period (say you have other symbols than just "/T"), you can use the stringr package's str_extract function with some regex:

library(dplyr)
library(stringr)

df %>% 
  mutate_at(vars(col1, col2), list(~str_extract(., "[0-9.] ")))
  • Related