So, I have a column in a dataframe that contains longitude and latitude information, but it is dirty. It looks like this:
"STRMEV=(-117.54352 32.12345)"
I can extract it easily with something like this:
df$Latitude <- strsplit(qdapRegex::ex_between(df$DirtyGeoInfo[1], "(", ")")[[1]][1], " ")[[1]][1]
And the output is:
"-117.54352".
Which is the latitude contained within the string in the first row.
Now, I want to do this over an entire dataframe, so I just put it in a for loop, with the [i] replacing the [1] after "df$DirtyGeoInfo" and an [i] after "df$Latitude". I know this is inefficient, but when I try to do it outside of a for loop, it just takes the first row and uses that latitude for the entire dataframe, however the output needs to be a function of a variable in the same row. It works fine in a for loop, it just takes a massive amount of time.
Any suggestions?
CodePudding user response:
We could extract
and create the columns by capturing the digits along with any .
or prefix -
as groups ((...)
)
library(tidyr)
library(dplyr)
df <- df %>%
extract(DirtyGeoInfo, into = c("Latitude", "Longitude"),
".*=\\(([0-9.-] )\\s ([0-9.-] )\\)", remove = FALSE, convert = TRUE)
-output
df
DirtyGeoInfo Latitude Longitude
1 STRMEV=(-117.54352 32.12345) -117.5435 32.12345
2 STRMEV=(-117.54352 32.12345) -117.5435 32.12345
In the OP's code, it is using only the first element from the column. Instead, we can read with read.table
after extracting the characters within the ()
df <- cbind(df, read.table(text = unlist(qdapRegex::ex_between(df$DirtyGeoInfo,
"(", ")")), header = FALSE, col.names = c("Latitude", "Longitude")))
-output
df
DirtyGeoInfo Latitude Longitude
1 STRMEV=(-117.54352 32.12345) -117.5435 32.12345
2 STRMEV=(-117.54352 32.12345) -117.5435 32.12345
data
df <- structure(list(DirtyGeoInfo = c("STRMEV=(-117.54352 32.12345)",
"STRMEV=(-117.54352 32.12345)")), class = "data.frame",
row.names = c(NA,
-2L))
CodePudding user response:
Here is an alternative: Data from akrun many thanks. In case your data is structured as provided, then we could:
library(readr)
library(dplyr)
library(tidyr)
df %>%
separate(DirtyGeoInfo, into = c( "Longitude", "Latitude"), sep = " ") %>%
mutate(across(everything(), parse_number))
Longitude Latitude
1 -117.5435 32.12345
2 -117.5435 32.12345