Home > Back-end >  Splitting text by "|" and populating a table
Splitting text by "|" and populating a table

Time:09-11

I have some data which looks like:

                          cardCharacteristics
1        2 habs.|2 baños|72 m²|Bajos|Ascensor
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción
3                     3 habs.|70 m²|2ª Planta
4       2 habs.|2 baños|160 m²|Terraza|Balcón
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor
6            3 habs.|2 baños|187 m²|4ª Planta

I am trying to split the column into an unspecified number of columns by |. Using the following cSplit_e(., split.col = "cardCharacteristics", sep = "|", type = "character") doesn't get me the result since it splits on all unique values and returns a binary output.

The expected output would be:

tibble(
  "habs" = c(2, 3, 3, 2, 5, 3),
  "baños" = c(2, 2, NA, 2, 2, 2),
  "m^2" = c(72, 110, 70, 160, 176, 187),
  "Floor" = c("Bajos", NA, "2ª Planta", NA, "7ª Planta", "4ª Planta"),
  "Lift" = c("Ascensor", "Ascensor", NA, NA, "Ascensor", NA),
  "Heating" = c(NA, "Calefacción", NA, NA, NA, NA),
  "Terraza" = c(NA, NA, NA, "Terraza", NA, NA),
  "Balcón" = c(NA, NA, NA, "Balcón", NA, NA)
)

or:

   habs baños `m^2` Floor     Lift     Heating     Terraza Balcón
  <dbl> <dbl> <dbl> <chr>     <chr>    <chr>       <chr>   <chr> 
1     2     2    72 Bajos     Ascensor NA          NA      NA    
2     3     2   110 NA        Ascensor Calefacción NA      NA    
3     3    NA    70 2ª Planta NA       NA          NA      NA    
4     2     2   160 NA        NA       NA          Terraza Balcón
5     5     2   176 7ª Planta Ascensor NA          NA      NA    
6     3     2   187 4ª Planta NA       NA          NA      NA

Data:

data = structure(list(cardCharacteristics = c("2 habs.|2 baños|72 m²|Bajos|Ascensor", 
"3 habs.|2 baños|110 m²|Ascensor|Calefacción", "3 habs.|70 m²|2ª Planta", 
"2 habs.|2 baños|160 m²|Terraza|Balcón", "5 habs.|2 baños|176 m²|7ª Planta|Ascensor", 
"3 habs.|2 baños|187 m²|4ª Planta")), row.names = c(NA, 6L
), class = "data.frame")

EDIT:

My progress, is to do the following:

data %>%
mutate(
    habs = str_extract(cardCharacteristics, "(\\d) (?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d) (?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d) (?= m²)"),
    floor = str_extract(cardCharacteristics, "(\\d) (?= 4ª Planta)")
  )

EDIT 2:

The following:

  mutate(
    habs = str_extract(cardCharacteristics, "(\\d) (?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d) (?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d) (?= m²)"),
    Terraza = str_extract(cardCharacteristics, "Terraza"),
    Calefacción = str_extract(cardCharacteristics, "Calefacción"),
    Floor = str_extract(cardCharacteristics, "(\\d) (?=ª Planta)|Bajos"),
  )

Gets me:

                          cardCharacteristics habs baños mts2 Terraza Calefacción Floor
1        2 habs.|2 baños|72 m²|Bajos|Ascensor    2     2   72    <NA>        <NA> Bajos
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción    3     2  110    <NA> Calefacción  <NA>
3                     3 habs.|70 m²|2ª Planta    3  <NA>   70    <NA>        <NA>     2
4       2 habs.|2 baños|160 m²|Terraza|Balcón    2     2  160 Terraza        <NA>  <NA>
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor    5     2  176    <NA>        <NA>     7
6            3 habs.|2 baños|187 m²|4ª Planta    3     2  187    <NA>        <NA>     4

Which is almost what I need.

CodePudding user response:

This is probably solvable using the tidyr package. This could be a starting point for you. sep accepts regex, so you could maybe utilize your regex from the str_extract above.

  library(dplyr)
  library(tidyr)
  data %>% separate(cardCharacteristics,
   sep = "\\|",
   into = c(
     "habs", "baños", "m^2", "Floor",
     "Lift"
   )
 )  

Results in

habs   baños       m^2     Floor        Lift
1 2 habs. 2 baños     72 m²     Bajos    Ascensor
2 3 habs. 2 baños    110 m²  Ascensor Calefacción
3 3 habs.   70 m² 2ª Planta      <NA>        <NA>
4 2 habs. 2 baños    160 m²   Terraza      Balcón
5 5 habs. 2 baños    176 m² 7ª Planta    Ascensor
6 3 habs. 2 baños    187 m² 4ª Planta        <NA>

CodePudding user response:

Returning to your dictionary idea, you might want a tag:value approach as might be had with write.dcf. My system is a different encoding that yours that make some things hard to test, or with your encoding lead to undesired results, though will likely work on your system. I make the assumption that 'habs.', 'banos' 'm2' are the basics, further entries are additional amenities, and each 'record' starts with 'habs.' Using your data:

# split1 <- strsplit(...
strsplit(unname(unlist(data)), '|', fixed = TRUE)
[[1]]
[1] "2 habs."  "2 baños"  "72 m²"    "Bajos"    "Ascensor"

[[2]]
[1] "3 habs."     "2 baños"     "110 m²"      "Ascensor"    "Calefacción"

[[3]]
[1] "3 habs."   "70 m²"     "2ª Planta"

[[4]]
[1] "2 habs." "2 baños" "160 m²"  "Terraza" "Balcón" 

[[5]]
[1] "5 habs."   "2 baños"   "176 m²"    "7ª Planta" "Ascensor" 

[[6]]
[1] "3 habs."   "2 baños"   "187 m²"    "4ª Planta"

These are our records and we want to append \n\n to each as record separator for .dcf

# split1 <- lapply(...
lapply(split1, function(x) c(x, '\n\n'))
[[1]]
[1] "2 habs."  "2 baños"  "72 m²"    "Bajos"    "Ascensor" "\n\n"    

[[2]]
[1] "3 habs."     "2 baños"     "110 m²"      "Ascensor"    "Calefacción"
[6] "\n\n"       

[[3]]
[1] "3 habs."   "70 m²"     "2ª Planta" "\n\n"     

[[4]]
[1] "2 habs." "2 baños" "160 m²"  "Terraza" "Balcón"  "\n\n"   

[[5]]
[1] "5 habs."   "2 baños"   "176 m²"    "7ª Planta" "Ascensor"  "\n\n"     

[[6]]
[1] "3 habs."   "2 baños"   "187 m²"    "4ª Planta" "\n\n"

At the moment, the 'tags' are after the 'values', so we have to swap them:

# split1 <- sub(...
sub('(.*) (.*)', '\\2 \\1', unlist(split1))
 [1] "habs. 2"     "baños 2"     "m² 72"       "Bajos"       "Ascensor"   
 [6] "\n\n"        "habs. 3"     "baños 2"     "m² 110"      "Ascensor"   
[11] "Calefacción" "\n\n"        "habs. 3"     "m² 70"       "Planta 2ª"  
[16] "\n\n"        "habs. 2"     "baños 2"     "m² 160"      "Terraza"    
[21] "Balcón"      "\n\n"        "habs. 5"     "baños 2"     "m² 176"     
[26] "Planta 7ª"   "Ascensor"    "\n\n"        "habs. 3"     "baños 2"    
[31] "m² 187"      "Planta 4ª"   "\n\n"

.dcf tag:value so replace ' ' with ':'

# split1 <- gsub(
gsub(' ', ':', split1)
 [1] "habs.:2"     "baños:2"     "m²:72"       "Bajos"       "Ascensor"   
 [6] "\n\n"        "habs.:3"     "baños:2"     "m²:110"      "Ascensor"   
[11] "Calefacción" "\n\n"        "habs.:3"     "m²:70"       "Planta:2ª"  
[16] "\n\n"        "habs.:2"     "baños:2"     "m²:160"      "Terraza"    
[21] "Balcón"      "\n\n"        "habs.:5"     "baños:2"     "m²:176"     
[26] "Planta:7ª"   "Ascensor"    "\n\n"        "habs.:3"     "baños:2"    
[31] "m²:187"      "Planta:4ª"   "\n\n"

We're close at this point, but I'm pretty sure that our un-terminated amenities (Bajos, Ascensor & etc) will be seen as 'malformed', though appears to work with cat

cat(split1)
habs.:2
baños:2
m²:72
Bajos
Ascensor

habs.:3
baños:2
m²:110
Ascensor
Calefacción
# --snip--
# but
c(read.dcf(textConnection(split1)))
Error in read.dcf(textConnection(split1)) : 
  Line starting 'Bajos ...' is malformed! # first non ':' terminated

Append ':', plays poorly with my encoding, but should work for yours, my bad results below

cat(gsub("(\\p{L} )\\b(?![\\p{P}\\p{S}])", '\\1:', split1, perl = TRUE))
habs.:2
 bañ:os:2
 m:²:72
 Bajos:
 Ascensor:
 


 habs.:3
 bañ:os:2
 m:²:110
 Ascensor:
 Calefacción:

Bad things happen to 'banos' and 'm2'. And perhaps it's all just a bad idea... But, changing the enyey in banos and exponent in m2, things start to get better, but the amenities need either a number or double listing (Ascensor Ascensor)

data$cardCharacteristics[6] <- "3 habs.|2 banos|187 m2|4ª Planta"
split3 <- strsplit(unname(unlist(data)), '|', fixed = TRUE)
split3 <- lapply(split3, function(x) c(x, '\n\n'))
split3 <- sub('(.*) (.*)', '\\2 \\1', unlist(split3))
split3 <- gsub(' ', ':', split3)
split3 <- gsub("(\\p{L} )\\b(?![\\p{P}\\p{S}])", '\\1:', split3, perl = TRUE)> split3
 [1] "habs.:2"      "banos:2"      "m2:72"        "Bajos:"       "Ascensor:"   
 [6] "\n\n"         "habs.:3"      "banos:2"      "m2:110"       "Ascensor:"   
[11] "Calefacción:" "\n\n"         "habs.:3"      "m2:70"        "Planta:2ª"   
[16] "\n\n"         "habs.:2"      "banos:2"      "m2:160"       "Terraza:"    
[21] "Balcón:"      "\n\n"         "habs.:5"      "banos:2"      "m2:176"      
[26] "Planta:7ª"    "Ascensor:"    "\n\n"         "habs.:3"      "banos:2"     
[31] "m2:187"       "Planta:4ª"    "\n\n"        
c(read.dcf(textConnection(split3), fields='habs.'))
[1] "2" "3" "3" "2" "5" "3"
c(read.dcf(textConnection(split3), fields='banos'))
[1] "2" "2" NA  "2" "2" "2"
c(read.dcf(textConnection(split3), fields='Ascensor'))
[1] "" "" NA NA "" NA

And a tibble might result, and things are nicely organized.

CodePudding user response:

You were very close indeed, just the way you extracted Floorwas contrary to your aim (the lookahead prevented the extraction of the ª Plantasubstring!):

data %>% 
  mutate(
    habs = str_extract(cardCharacteristics, "(\\d) (?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d) (?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d) (?= m²)"),
    Terraza = str_extract(cardCharacteristics, "Terraza"),
    Calefacción = str_extract(cardCharacteristics, "Calefacción"),
    Floor = str_extract(cardCharacteristics, "Bajos|\\d ª Planta"),  # <--- Corrected here
  )
                          cardCharacteristics habs baños mts2 Terraza Calefacción     Floor
1        2 habs.|2 baños|72 m²|Bajos|Ascensor    2     2   72    <NA>        <NA>     Bajos
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción    3     2  110    <NA> Calefacción      <NA>
3                     3 habs.|70 m²|2ª Planta    3  <NA>   70    <NA>        <NA> 2ª Planta
4       2 habs.|2 baños|160 m²|Terraza|Balcón    2     2  160 Terraza        <NA>      <NA>
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor    5     2  176    <NA>        <NA> 7ª Planta
6            3 habs.|2 baños|187 m²|4ª Planta    3     2  187    <NA>        <NA> 4ª Planta
  •  Tags:  
  • r
  • Related