R convert character string to a dataframe-CodePudding

Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.

Each individual is separated by a . There are 10 attributes for each individual.

txt = "EREKSON(Andrew,Peter),male10/06/2011@Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.OBAMA(Barack,Hussian),male04/12/1956@London(England),PPF,2001X005729,dist.097,Dt.043/997.CLINTON(Hillary),female25/06/2013@Paris(France),PPF,2009X005729,dist.088,Dt.043/998.GATES(Melinda),female03/03/1980@Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."

I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.

I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.

This is the final dataframe I have in mind, based on the character string input -

result = data.frame(first_names = c('Andrew Peter','Barack Hussain','Hillary','Melinda'),
                    family_name = c('EREKSON','OBAMA','CLINTON','GATES'),
                    gender = c('male','male','female','female'),
                    birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
                    birth_city = c('Geneva','London','Paris','Berlin'),
                    birth_country = c('Switzerland','England','France','Germany'),
                    acc_type = c('PPF','PPF','PPF','VAT'),
                    acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
                    district = c('dist.093','dist.097','dist.088','dist.078'),
                    code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))

Any help would be much appreciated

CodePudding user response：

Here's a tidy solution with tidyr's functions separate_rows and extract:

library(tidyr)
data.frame(txt) %>%
  # separate `txt` into rows using the dot `.` *if* 
  # preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
  separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
  extract(
          # select column from which to extract:
          txt,
          # define column names into which to extract:
          into = c("family_name","first_names","gender",
                   "birthday","birth_city","birth_country",
                   "acc_type","acc_num","district","code"),
          # describe the string exhaustively using capturing groups
          # `(...)` to delimit what's to be extracted:
          regex = "([A-Z] )\\(([\\w,] )\\),([a-z] )([\\d/] )@(\\w )\\((\\w )\\),([A-Z] ),(\\w ),dist.(\\d ),Dt\\.([\\d/] )")
# A tibble: 4 × 10
  family_name first_names    gender birthday   birth_city birth_country acc_type acc_num  
  <chr>       <chr>          <chr>  <chr>      <chr>      <chr>         <chr>    <chr>    
1 EREKSON     Andrew,Peter   male   10/06/2011 Geneva     Switzerland   PPF      2000X007…
2 OBAMA       Barack,Hussian male   04/12/1956 London     England       PPF      2001X005…
3 CLINTON     Hillary        female 25/06/2013 Paris      France        PPF      2009X005…
4 GATES       Melinda        female 03/03/1980 Berlin     Germany       VAT      2010X006…
# … with 2 more variables: district <chr>, code <chr>

CodePudding user response：

Here is one solution using the tidyverse (stringr and readr):

library(dplyr, warn.conflicts = FALSE) # for pipes

df <- 
  txt %>% 
  
  # Replace "." sep with newline
  stringr::str_replace_all(
    "\\.[A-Z]", 
    function(x) stringr::str_replace(x, "\\.", "\n")
  ) %>% 
  
  # Replace (First, Middle) with (First Middle)
  stringr::str_replace_all(
    "\\([a-zA-Z]*,[a-zA-Z]*\\)", 
    function(x) stringr::str_replace(x, ",", " ")
  ) %>%
  
  # Replace ( with ,
  stringr::str_replace_all("\\(", ",") %>%
  
  # Remove )
  stringr::str_remove_all("\\)") %>%
  
  # Replace @ with ,
  stringr::str_replace_all("@", ",") %>%
  
  # Remove the last "."
  stringr::str_replace_all("\\.$", "\n") %>% 
  
  # Add , after female/male
  stringr::str_replace_all("male", "male,") %>% 
  
  # Read as comma delimited file (works since string contains "\n")
  readr::read_delim(
    file = .,
    delim = ",",
    col_names = FALSE
  )
#> Rows: 4 Columns: 10
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Add names
names(df) <- c(
  "family_name",
  "first_names",
  "gender",
  "birthday",
  "birth_city",
  "birth_country",
  "acc_type",
  "acc_num",
  "district",
  "code"
)

df
#> # A tibble: 4 × 10
#>   family_name first_names    gender birthday   birth_city birth_country acc_type
#>   <chr>       <chr>          <chr>  <chr>      <chr>      <chr>         <chr>   
#> 1 EREKSON     Andrew Peter   male   10/06/2011 Geneva     Switzerland   PPF     
#> 2 OBAMA       Barack Hussian male   04/12/1956 London     England       PPF     
#> 3 CLINTON     Hillary        female 25/06/2013 Paris      France        PPF     
#> 4 GATES       Melinda        female 03/03/1980 Berlin     Germany       VAT     
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>

Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and enough to get your started.