I have a data set which is read into a dataframe as characters. The characters have numerical portions, and some include a "/T". For example:
df <- data.frame(col1 = c("237.7", "191.8", "95.4/T", "48.1"),
col2 = c("234.3", "766.4/T", "167.6/T", "401.6"))
> df
col1 col2
1 237.7 234.3
2 191.8 766.4/T
3 95.4/T 167.6/T
4 48.1 401.6
What command can I use to identify the rows and columns in the dataframe which contain data with "/T"? The following will clearly not work:
which(df == "/T", arr.ind = TRUE)
Also, I want to extract only the numerical portion of the data from the entire dataframe, or remove all "/T". My actual example has far more columns and rows than the example I provide above. I have seen the following examples in Stack Overflow:
Extracting decimal numbers from a string
Extracting numeric portion from a character in the Data Frame
The above links provide solutions to vectors or a column of the data frame. How can I generalize these solutions to go systemically throughout the entire dataframe?
Thanks.
CodePudding user response:
Depending on what your data looks like and how you are using it, there are multiple ways to address this.
If you need to identify which rows/columns contain "/T", you could do something like this (using the dplyr package's mutate
function along with base R's grepl
and gsub
).
library(dplyr)
df %>%
mutate(
col1_check = ifelse(grepl("/T", col1), "problem", "okay"),
col2_check = ifelse(grepl("/T", col2), "problem", "okay"),
col1_clean = gsub("/T", "", col1),
col2_clean = gsub("/T", "", col2)
)
Which results in this output:
col1 col2 col1_check col2_check col1_clean col2_clean
1 237.7 234.3 okay okay 237.7 234.3
2 191.8 766.4/T okay problem 191.8 766.4
3 95.4/T 167.6/T problem problem 95.4 167.6
4 48.1 401.6 okay okay 48.1 401.6
If you just need to remove the "/T" portion, you can use dplyr's mutate_at
function with gsub
(applied to whatever columns you want to fix):
library(dplyr)
df %>%
mutate_at(vars(col1, col2), list(~gsub("/T", "", .)))
And if you need to extract only the numeric elements and the period (say you have other symbols than just "/T"), you can use the stringr package's str_extract
function with some regex:
library(dplyr)
library(stringr)
df %>%
mutate_at(vars(col1, col2), list(~str_extract(., "[0-9.] ")))