Home > Enterprise >  Extract string and put into a column in the same dataframe efficiently - R
Extract string and put into a column in the same dataframe efficiently - R

Time:04-12

So, I have a column in a dataframe that contains longitude and latitude information, but it is dirty. It looks like this:

"STRMEV=(-117.54352 32.12345)" 

I can extract it easily with something like this:

df$Latitude <- strsplit(qdapRegex::ex_between(df$DirtyGeoInfo[1], "(", ")")[[1]][1], " ")[[1]][1] 

And the output is:

"-117.54352". 

Which is the latitude contained within the string in the first row.

Now, I want to do this over an entire dataframe, so I just put it in a for loop, with the [i] replacing the [1] after "df$DirtyGeoInfo" and an [i] after "df$Latitude". I know this is inefficient, but when I try to do it outside of a for loop, it just takes the first row and uses that latitude for the entire dataframe, however the output needs to be a function of a variable in the same row. It works fine in a for loop, it just takes a massive amount of time.

Any suggestions?

CodePudding user response:

We could extract and create the columns by capturing the digits along with any . or prefix - as groups ((...))

library(tidyr)
library(dplyr)
df <- df %>% 
    extract(DirtyGeoInfo, into = c("Latitude", "Longitude"), 
    ".*=\\(([0-9.-] )\\s ([0-9.-] )\\)", remove = FALSE, convert = TRUE)

-output

df
                  DirtyGeoInfo  Latitude Longitude
1 STRMEV=(-117.54352 32.12345) -117.5435  32.12345
2 STRMEV=(-117.54352 32.12345) -117.5435  32.12345

In the OP's code, it is using only the first element from the column. Instead, we can read with read.table after extracting the characters within the ()

df <- cbind(df, read.table(text = unlist(qdapRegex::ex_between(df$DirtyGeoInfo, 
   "(", ")")), header = FALSE, col.names = c("Latitude", "Longitude")))

-output

df
                  DirtyGeoInfo  Latitude Longitude
1 STRMEV=(-117.54352 32.12345) -117.5435  32.12345
2 STRMEV=(-117.54352 32.12345) -117.5435  32.12345

data

df <- structure(list(DirtyGeoInfo = c("STRMEV=(-117.54352 32.12345)", 
"STRMEV=(-117.54352 32.12345)")), class = "data.frame",
 row.names = c(NA, 
-2L))

CodePudding user response:

Here is an alternative: Data from akrun many thanks. In case your data is structured as provided, then we could:

library(readr)
library(dplyr)
library(tidyr)

df %>% 
  separate(DirtyGeoInfo, into = c( "Longitude", "Latitude"), sep = " ") %>% 
  mutate(across(everything(), parse_number))
  Longitude Latitude
1 -117.5435 32.12345
2 -117.5435 32.12345
  • Related