Home > Enterprise >  Split string column on non-numeric characters in R
Split string column on non-numeric characters in R

Time:10-04

I have to convert coordinates from DMS (degrees, minutes and seconds) to DegDec (real numbers). However, the DMS separators are messed up, so I would like to split the columns on non-numeric characters excluding ..

My data looks as follows:

lat long
22ª29'56.06" 105º21'37.27"
22°29`53.14" 105°21'29.48"
22º30'00.43" 105ª21'37.46''
105'29'27.17" 105°21'39.68"

The only pattern I can spot in the separators is that they are non-numeric and different from .. I think I could solve this relatively easily if I could split these columns on non-numeric or period characters.

Once I have all three elements, I can concatenate them together to form a predictable pattern with the correct separators.

Here's a replicable sample from the data:

df <- tibble(lat=c("22ª29'56.06\"","22°29`53.14\"","22º30'00.43\"","105'29'27.17\""),
             long=c("105º21'37.27\"","105°21'29.48\"","105°21'37.46\''","105°21'39.68"))

CodePudding user response:

You want to do something like this?

library(dplyr)
library(tidyr)

df %>% 
  separate(lat,into = paste0("lat",1:4),sep = "[^0-9]",remove = FALSE) %>% 
  separate(long,into = paste0("long",1:4),sep = "[^0-9]",remove = FALSE)


# A tibble: 4 x 10
  lat              lat1  lat2  lat3  lat4  long             long1 long2 long3 long4
  <chr>            <chr> <chr> <chr> <chr> <chr>            <chr> <chr> <chr> <chr>
1 "22ª29'56.06\""  22    29    56    06    "105º21'37.27\"" 105   21    37    27   
2 "22°29`53.14\""  22    29    53    14    "105°21'29.48\"" 105   21    29    48   
3 "22º30'00.43\""  22    30    00    43    "105°21'37.46''" 105   21    37    46   
4 "105'29'27.17\"" 105   29    27    17    "105°21'39.68"   105   21    39    68 

CodePudding user response:

Here is a way with cSplit from splitstackshape.

splitstackshape::cSplit(df, c('lat', 'long'), sep = '[^0-9]', fixed = FALSE)

#   lat_1 lat_2 lat_3 lat_4 lat_5 long_1 long_2 long_3 long_4 long_5
#1:    22    29    56     6    NA    105     21     37     27     NA
#2:    22    29    53    14    NA    105     21     29     48     NA
#3:    22    30     0    43    NA    105     21     37     46     NA
#4:   105    29    27    17    NA    105     21     39     68     NA

Since the string ends with non-numeric data it splits and creates a 5th column with NA. You can remove them with Remove columns from dataframe where ALL values are NA

  • Related