Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.
Each individual is separated by a .
There are 10 attributes for each individual.
txt = "EREKSON(Andrew,Peter),male10/06/2011@Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.OBAMA(Barack,Hussian),male04/12/1956@London(England),PPF,2001X005729,dist.097,Dt.043/997.CLINTON(Hillary),female25/06/2013@Paris(France),PPF,2009X005729,dist.088,Dt.043/998.GATES(Melinda),female03/03/1980@Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."
I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.
I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.
This is the final dataframe I have in mind, based on the character string input -
result = data.frame(first_names = c('Andrew Peter','Barack Hussain','Hillary','Melinda'),
family_name = c('EREKSON','OBAMA','CLINTON','GATES'),
gender = c('male','male','female','female'),
birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
birth_city = c('Geneva','London','Paris','Berlin'),
birth_country = c('Switzerland','England','France','Germany'),
acc_type = c('PPF','PPF','PPF','VAT'),
acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
district = c('dist.093','dist.097','dist.088','dist.078'),
code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))
Any help would be much appreciated
CodePudding user response:
Here's a tidy solution with tidyr
's functions separate_rows
and extract
:
library(tidyr)
data.frame(txt) %>%
# separate `txt` into rows using the dot `.` *if*
# preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
extract(
# select column from which to extract:
txt,
# define column names into which to extract:
into = c("family_name","first_names","gender",
"birthday","birth_city","birth_country",
"acc_type","acc_num","district","code"),
# describe the string exhaustively using capturing groups
# `(...)` to delimit what's to be extracted:
regex = "([A-Z] )\\(([\\w,] )\\),([a-z] )([\\d/] )@(\\w )\\((\\w )\\),([A-Z] ),(\\w ),dist.(\\d ),Dt\\.([\\d/] )")
# A tibble: 4 × 10
family_name first_names gender birthday birth_city birth_country acc_type acc_num
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 EREKSON Andrew,Peter male 10/06/2011 Geneva Switzerland PPF 2000X007…
2 OBAMA Barack,Hussian male 04/12/1956 London England PPF 2001X005…
3 CLINTON Hillary female 25/06/2013 Paris France PPF 2009X005…
4 GATES Melinda female 03/03/1980 Berlin Germany VAT 2010X006…
# … with 2 more variables: district <chr>, code <chr>
CodePudding user response:
Here is one solution using the tidyverse (stringr and readr):
library(dplyr, warn.conflicts = FALSE) # for pipes
df <-
txt %>%
# Replace "." sep with newline
stringr::str_replace_all(
"\\.[A-Z]",
function(x) stringr::str_replace(x, "\\.", "\n")
) %>%
# Replace (First, Middle) with (First Middle)
stringr::str_replace_all(
"\\([a-zA-Z]*,[a-zA-Z]*\\)",
function(x) stringr::str_replace(x, ",", " ")
) %>%
# Replace ( with ,
stringr::str_replace_all("\\(", ",") %>%
# Remove )
stringr::str_remove_all("\\)") %>%
# Replace @ with ,
stringr::str_replace_all("@", ",") %>%
# Remove the last "."
stringr::str_replace_all("\\.$", "\n") %>%
# Add , after female/male
stringr::str_replace_all("male", "male,") %>%
# Read as comma delimited file (works since string contains "\n")
readr::read_delim(
file = .,
delim = ",",
col_names = FALSE
)
#> Rows: 4 Columns: 10
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Add names
names(df) <- c(
"family_name",
"first_names",
"gender",
"birthday",
"birth_city",
"birth_country",
"acc_type",
"acc_num",
"district",
"code"
)
df
#> # A tibble: 4 × 10
#> family_name first_names gender birthday birth_city birth_country acc_type
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 EREKSON Andrew Peter male 10/06/2011 Geneva Switzerland PPF
#> 2 OBAMA Barack Hussian male 04/12/1956 London England PPF
#> 3 CLINTON Hillary female 25/06/2013 Paris France PPF
#> 4 GATES Melinda female 03/03/1980 Berlin Germany VAT
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>
Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and enough to get your started.