I have some data which looks like:
cardCharacteristics
1 2 habs.|2 baños|72 m²|Bajos|Ascensor
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción
3 3 habs.|70 m²|2ª Planta
4 2 habs.|2 baños|160 m²|Terraza|Balcón
5 5 habs.|2 baños|176 m²|7ª Planta|Ascensor
6 3 habs.|2 baños|187 m²|4ª Planta
I am trying to split the column into an unspecified number of columns by |
. Using the following cSplit_e(., split.col = "cardCharacteristics", sep = "|", type = "character")
doesn't get me the result since it splits on all unique values and returns a binary output.
The expected output would be:
tibble(
"habs" = c(2, 3, 3, 2, 5, 3),
"baños" = c(2, 2, NA, 2, 2, 2),
"m^2" = c(72, 110, 70, 160, 176, 187),
"Floor" = c("Bajos", NA, "2ª Planta", NA, "7ª Planta", "4ª Planta"),
"Lift" = c("Ascensor", "Ascensor", NA, NA, "Ascensor", NA),
"Heating" = c(NA, "Calefacción", NA, NA, NA, NA),
"Terraza" = c(NA, NA, NA, "Terraza", NA, NA),
"Balcón" = c(NA, NA, NA, "Balcón", NA, NA)
)
or:
habs baños `m^2` Floor Lift Heating Terraza Balcón
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2 2 72 Bajos Ascensor NA NA NA
2 3 2 110 NA Ascensor Calefacción NA NA
3 3 NA 70 2ª Planta NA NA NA NA
4 2 2 160 NA NA NA Terraza Balcón
5 5 2 176 7ª Planta Ascensor NA NA NA
6 3 2 187 4ª Planta NA NA NA NA
Data:
data = structure(list(cardCharacteristics = c("2 habs.|2 baños|72 m²|Bajos|Ascensor",
"3 habs.|2 baños|110 m²|Ascensor|Calefacción", "3 habs.|70 m²|2ª Planta",
"2 habs.|2 baños|160 m²|Terraza|Balcón", "5 habs.|2 baños|176 m²|7ª Planta|Ascensor",
"3 habs.|2 baños|187 m²|4ª Planta")), row.names = c(NA, 6L
), class = "data.frame")
EDIT:
My progress, is to do the following:
data %>%
mutate(
habs = str_extract(cardCharacteristics, "(\\d) (?= habs.)"),
baños = str_extract(cardCharacteristics, "(\\d) (?= baños)"),
mts2 = str_extract(cardCharacteristics, "(\\d) (?= m²)"),
floor = str_extract(cardCharacteristics, "(\\d) (?= 4ª Planta)")
)
EDIT 2:
The following:
mutate(
habs = str_extract(cardCharacteristics, "(\\d) (?= habs.)"),
baños = str_extract(cardCharacteristics, "(\\d) (?= baños)"),
mts2 = str_extract(cardCharacteristics, "(\\d) (?= m²)"),
Terraza = str_extract(cardCharacteristics, "Terraza"),
Calefacción = str_extract(cardCharacteristics, "Calefacción"),
Floor = str_extract(cardCharacteristics, "(\\d) (?=ª Planta)|Bajos"),
)
Gets me:
cardCharacteristics habs baños mts2 Terraza Calefacción Floor
1 2 habs.|2 baños|72 m²|Bajos|Ascensor 2 2 72 <NA> <NA> Bajos
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción 3 2 110 <NA> Calefacción <NA>
3 3 habs.|70 m²|2ª Planta 3 <NA> 70 <NA> <NA> 2
4 2 habs.|2 baños|160 m²|Terraza|Balcón 2 2 160 Terraza <NA> <NA>
5 5 habs.|2 baños|176 m²|7ª Planta|Ascensor 5 2 176 <NA> <NA> 7
6 3 habs.|2 baños|187 m²|4ª Planta 3 2 187 <NA> <NA> 4
Which is almost what I need.
CodePudding user response:
This is probably solvable using the tidyr
package. This could be a starting point for you. sep
accepts regex, so you could maybe utilize your regex from the str_extract
above.
library(dplyr)
library(tidyr)
data %>% separate(cardCharacteristics,
sep = "\\|",
into = c(
"habs", "baños", "m^2", "Floor",
"Lift"
)
)
Results in
habs baños m^2 Floor Lift
1 2 habs. 2 baños 72 m² Bajos Ascensor
2 3 habs. 2 baños 110 m² Ascensor Calefacción
3 3 habs. 70 m² 2ª Planta <NA> <NA>
4 2 habs. 2 baños 160 m² Terraza Balcón
5 5 habs. 2 baños 176 m² 7ª Planta Ascensor
6 3 habs. 2 baños 187 m² 4ª Planta <NA>
CodePudding user response:
Returning to your dictionary
idea, you might want a tag:value
approach as might be had with write.dcf
. My system is a different encoding that yours that make some things hard to test, or with your encoding lead to undesired results, though will likely work on your system.
I make the assumption that 'habs.', 'banos' 'm2' are the basics, further entries are additional amenities, and each 'record' starts with 'habs.'
Using your data
:
# split1 <- strsplit(...
strsplit(unname(unlist(data)), '|', fixed = TRUE)
[[1]]
[1] "2 habs." "2 baños" "72 m²" "Bajos" "Ascensor"
[[2]]
[1] "3 habs." "2 baños" "110 m²" "Ascensor" "Calefacción"
[[3]]
[1] "3 habs." "70 m²" "2ª Planta"
[[4]]
[1] "2 habs." "2 baños" "160 m²" "Terraza" "Balcón"
[[5]]
[1] "5 habs." "2 baños" "176 m²" "7ª Planta" "Ascensor"
[[6]]
[1] "3 habs." "2 baños" "187 m²" "4ª Planta"
These are our records and we want to append \n\n
to each as record separator for .dcf
# split1 <- lapply(...
lapply(split1, function(x) c(x, '\n\n'))
[[1]]
[1] "2 habs." "2 baños" "72 m²" "Bajos" "Ascensor" "\n\n"
[[2]]
[1] "3 habs." "2 baños" "110 m²" "Ascensor" "Calefacción"
[6] "\n\n"
[[3]]
[1] "3 habs." "70 m²" "2ª Planta" "\n\n"
[[4]]
[1] "2 habs." "2 baños" "160 m²" "Terraza" "Balcón" "\n\n"
[[5]]
[1] "5 habs." "2 baños" "176 m²" "7ª Planta" "Ascensor" "\n\n"
[[6]]
[1] "3 habs." "2 baños" "187 m²" "4ª Planta" "\n\n"
At the moment, the 'tags' are after the 'values', so we have to swap them:
# split1 <- sub(...
sub('(.*) (.*)', '\\2 \\1', unlist(split1))
[1] "habs. 2" "baños 2" "m² 72" "Bajos" "Ascensor"
[6] "\n\n" "habs. 3" "baños 2" "m² 110" "Ascensor"
[11] "Calefacción" "\n\n" "habs. 3" "m² 70" "Planta 2ª"
[16] "\n\n" "habs. 2" "baños 2" "m² 160" "Terraza"
[21] "Balcón" "\n\n" "habs. 5" "baños 2" "m² 176"
[26] "Planta 7ª" "Ascensor" "\n\n" "habs. 3" "baños 2"
[31] "m² 187" "Planta 4ª" "\n\n"
.dcf
tag:value so replace ' ' with ':'
# split1 <- gsub(
gsub(' ', ':', split1)
[1] "habs.:2" "baños:2" "m²:72" "Bajos" "Ascensor"
[6] "\n\n" "habs.:3" "baños:2" "m²:110" "Ascensor"
[11] "Calefacción" "\n\n" "habs.:3" "m²:70" "Planta:2ª"
[16] "\n\n" "habs.:2" "baños:2" "m²:160" "Terraza"
[21] "Balcón" "\n\n" "habs.:5" "baños:2" "m²:176"
[26] "Planta:7ª" "Ascensor" "\n\n" "habs.:3" "baños:2"
[31] "m²:187" "Planta:4ª" "\n\n"
We're close at this point, but I'm pretty sure that our un-terminated amenities (Bajos, Ascensor & etc) will be seen as 'malformed', though appears to work with cat
cat(split1)
habs.:2
baños:2
m²:72
Bajos
Ascensor
habs.:3
baños:2
m²:110
Ascensor
Calefacción
# --snip--
# but
c(read.dcf(textConnection(split1)))
Error in read.dcf(textConnection(split1)) :
Line starting 'Bajos ...' is malformed! # first non ':' terminated
Append ':', plays poorly with my encoding, but should work for yours, my bad results below
cat(gsub("(\\p{L} )\\b(?![\\p{P}\\p{S}])", '\\1:', split1, perl = TRUE))
habs.:2
bañ:os:2
m:²:72
Bajos:
Ascensor:
habs.:3
bañ:os:2
m:²:110
Ascensor:
Calefacción:
Bad things happen to 'banos' and 'm2'. And perhaps it's all just a bad idea... But, changing the enyey in banos and exponent in m2, things start to get better, but the amenities need either a number or double listing (Ascensor Ascensor)
data$cardCharacteristics[6] <- "3 habs.|2 banos|187 m2|4ª Planta"
split3 <- strsplit(unname(unlist(data)), '|', fixed = TRUE)
split3 <- lapply(split3, function(x) c(x, '\n\n'))
split3 <- sub('(.*) (.*)', '\\2 \\1', unlist(split3))
split3 <- gsub(' ', ':', split3)
split3 <- gsub("(\\p{L} )\\b(?![\\p{P}\\p{S}])", '\\1:', split3, perl = TRUE)> split3
[1] "habs.:2" "banos:2" "m2:72" "Bajos:" "Ascensor:"
[6] "\n\n" "habs.:3" "banos:2" "m2:110" "Ascensor:"
[11] "Calefacción:" "\n\n" "habs.:3" "m2:70" "Planta:2ª"
[16] "\n\n" "habs.:2" "banos:2" "m2:160" "Terraza:"
[21] "Balcón:" "\n\n" "habs.:5" "banos:2" "m2:176"
[26] "Planta:7ª" "Ascensor:" "\n\n" "habs.:3" "banos:2"
[31] "m2:187" "Planta:4ª" "\n\n"
c(read.dcf(textConnection(split3), fields='habs.'))
[1] "2" "3" "3" "2" "5" "3"
c(read.dcf(textConnection(split3), fields='banos'))
[1] "2" "2" NA "2" "2" "2"
c(read.dcf(textConnection(split3), fields='Ascensor'))
[1] "" "" NA NA "" NA
And a tibble might result, and things are nicely organized.
CodePudding user response:
You were very close indeed, just the way you extracted Floor
was contrary to your aim (the lookahead prevented the extraction of the ª Planta
substring!):
data %>%
mutate(
habs = str_extract(cardCharacteristics, "(\\d) (?= habs.)"),
baños = str_extract(cardCharacteristics, "(\\d) (?= baños)"),
mts2 = str_extract(cardCharacteristics, "(\\d) (?= m²)"),
Terraza = str_extract(cardCharacteristics, "Terraza"),
Calefacción = str_extract(cardCharacteristics, "Calefacción"),
Floor = str_extract(cardCharacteristics, "Bajos|\\d ª Planta"), # <--- Corrected here
)
cardCharacteristics habs baños mts2 Terraza Calefacción Floor
1 2 habs.|2 baños|72 m²|Bajos|Ascensor 2 2 72 <NA> <NA> Bajos
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción 3 2 110 <NA> Calefacción <NA>
3 3 habs.|70 m²|2ª Planta 3 <NA> 70 <NA> <NA> 2ª Planta
4 2 habs.|2 baños|160 m²|Terraza|Balcón 2 2 160 Terraza <NA> <NA>
5 5 habs.|2 baños|176 m²|7ª Planta|Ascensor 5 2 176 <NA> <NA> 7ª Planta
6 3 habs.|2 baños|187 m²|4ª Planta 3 2 187 <NA> <NA> 4ª Planta