I have a large dataset of equivalent structure to the below.
input.data <- data.frame(Continent = rep("Asia", 5),
Country = rep("India", 5),
Country.lat = rep(20.5937, 5),
Country.long = rep(78.9629, 5),
State = rep("Punjab", 5),
Region = rep("Ludhiana", 5),
Animal.group = rep("Mammal", 5),
Species.guild = c("Carnivore", "Omnivore\nOmnivore", "Omnivore\nOmnivore\nCarnivore", "Omnivore\nOmnivore\nCarnivore", "Herbivore\nOmnivore\nCarnivore\nCarnivore"),
Family = c("Canidae", "Muridae\nMuridae", "Muridae\nMuridae\nCanidae", "Muridae\nMuridae\nCanidae", "Leporidae\nMuridae\nCanidae\nCanidae"),
Scientific.name = c("Canis familiaris", "Mus musculus\nRattus rattus", "Mus musculus\nRattus rattus\nCanis familiaris", "Mus musculus\nRattus rattus\nCanis familiaris", "Oryctolagus cuniculus\nRattus rattus\nCanis familiaris\nCanis familiaris"),
Common.name = c("Dog", "House mouse\nBlack rat", "House mouse\nBlack rat\nDog", "House mouse\nBlack rat\nDog", "European rabbit\nBlack rat\nDog\nDog"),
Num.species = c(1, 2, 3, 3, 4),
Bait.state = rep("Solid", 5),
Bait.deliv.broad = rep("Toxin", 5),
Bait.deliv.specific = "1080\n1080\n1080\n1080\n1080")
input.data
has 5 observations, each of which represents a separate experiment. However, some (but not all) experiments have focused on more than one species and hence have multiple entries in a single cell. This makes it very difficult to summarise species related information, including input.data$Species.guild
, input.data$Family
, input.data$Scientific.name
and input.data$Common.name
. To be able to summarise this data I need to expend input.data
so that the new data frame (output.data
) includes only a single piece of information per cell but otherwise all other associated information on the same row remains the same.
Note that where multiple pieces of info exist in a single cell they are separated by a carriage return \n
The resulting data frame should look like the below (rows don't need to be in same order necessarily).
Assistance much appreciated
output.data <- data.frame(Continent = rep("Asia", 13),
Country = rep("India", 13),
Country.lat = rep(20.5937, 13),
Country.long = rep(78.9629, 13),
State = rep("Punjab", 13),
Region = rep("Ludhiana", 13),
Animal.group = rep("Mammal", 13),
Species.guild = c("Carnivore", "Omnivore", "Omnivore", "Omnivore", "Omnivore", "Carnivore", "Omnivore", "Omnivore", "Carnivore", "Herbivore", "Omnivore", "Carnivore", "Carnivore"),
Family = c("Canidae", "Muridae", "Muridae", "Muridae", "Muridae", "Canidae", "Muridae", "Muridae", "Canidae", "Leporidae", "Muridae", "Canidae", "Canidae"),
Scientific.name = c("Canis familiaris", "Mus musculus", "Rattus rattus", "Mus musculus", "Rattus rattus", "Canis familiaris", "Mus musculus", "Rattus rattus", "Canis familiaris", "Oryctolagus cuniculus", "Rattus rattus", "Canis familiaris", "Canis familiaris"),
Common.name = c("Dog", "House mouse", "Black rat", "House mouse", "Black rat", "Dog", "House mouse", "Black rat", "Dog", "European rabbit", "Black rat", "Dog", "Dog"),
Num.species = rep(1, 13),
Bait.state = rep("Solid", 13),
Bait.deliv.broad = rep("Toxin", 13),
Bait.deliv.specific = c("1080", "1080", "1080", "1080", "1080", "1080", "1080", "1080", "1080", "1080", "1080", "1080", "1080"))
CodePudding user response:
input.data %>%
separate_rows(Species.guild, Family, Scientific.name, Common.name, sep = '\n')
# A tibble: 13 x 15
Continent Country Country.lat Country.long State Region Animal.group
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 Asia India 20.6 79.0 Punjab Ludhiana Mammal
2 Asia India 20.6 79.0 Punjab Ludhiana Mammal
3 Asia India 20.6 79.0 Punjab Ludhiana Mammal
4 Asia India 20.6 79.0 Punjab Ludhiana Mammal
5 Asia India 20.6 79.0 Punjab Ludhiana Mammal
6 Asia India 20.6 79.0 Punjab Ludhiana Mammal
7 Asia India 20.6 79.0 Punjab Ludhiana Mammal
8 Asia India 20.6 79.0 Punjab Ludhiana Mammal
9 Asia India 20.6 79.0 Punjab Ludhiana Mammal
10 Asia India 20.6 79.0 Punjab Ludhiana Mammal
11 Asia India 20.6 79.0 Punjab Ludhiana Mammal
12 Asia India 20.6 79.0 Punjab Ludhiana Mammal
13 Asia India 20.6 79.0 Punjab Ludhiana Mammal
#..with 8 more variables: Species.guild <chr>, Family <chr>,
# Scientific.name <chr>, Common.name <chr>, Num.species <dbl>, Bait.state <chr>,
# Bait.deliv.broad <chr>, Bait.deliv.specific <chr>
CodePudding user response:
Since the last column has additional values that make the separation uneven, you could use splitstackshape::cSplit
and zoo::na.locf
.
library(splitstackshape)
library(zoo)
df <- cSplit(input.data, c(8:11, 15), "long", sep = "\n", drop = TRUE)
na.locf(df[rowSums(is.na(df)) != ncol(df),])
Output
Continent Country Country.lat Country.long State Region Animal.group Species.guild Family Scientific.name Common.name Num.species Bait.state Bait.deliv.broad Bait.deliv.specific
1: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 1 Solid Toxin 1080
2: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 1 Solid Toxin 1080
3: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 1 Solid Toxin 1080
4: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 1 Solid Toxin 1080
5: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 1 Solid Toxin 1080
6: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Mus musculus House mouse 2 Solid Toxin 1080
7: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 2 Solid Toxin 1080
8: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 2 Solid Toxin 1080
9: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 2 Solid Toxin 1080
10: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 2 Solid Toxin 1080
11: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Mus musculus House mouse 3 Solid Toxin 1080
12: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 3 Solid Toxin 1080
13: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 3 Solid Toxin 1080
14: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 3 Solid Toxin 1080
15: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 3 Solid Toxin 1080
16: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Mus musculus House mouse 3 Solid Toxin 1080
17: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 3 Solid Toxin 1080
18: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 3 Solid Toxin 1080
19: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 3 Solid Toxin 1080
20: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 3 Solid Toxin 1080
21: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Herbivore Leporidae Oryctolagus cuniculus European rabbit 4 Solid Toxin 1080
22: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 4 Solid Toxin 1080
23: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 4 Solid Toxin 1080
24: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 4 Solid Toxin 1080
25: Asia India 20.5937 78.9629 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 4 Solid Toxin 1080
Alternatively, if you only need the same amount of values, as is represented in other columns, then as mentioned in the comments you can use separate_rows
from tidyverse
. But we can trim the amount of values in Bait.deliv.specific
according to another column like Species.guild
, then we can separate all the needed columns at the same time.
library(tidyverse)
input.data %>%
mutate(x = str_count(Species.guild, "\n")) %>%
rowwise %>%
mutate(Bait.deliv.specific = paste0(strsplit(Bait.deliv.specific, "\n")[[1]][1:(x 1)], collapse = "\n")) %>%
ungroup %>%
separate_rows(., c(8:11, 15), sep = "\n")
Output
# A tibble: 13 × 15
Continent Country Country.lat Country.long State Region Animal.group Species.guild Family Scientific.name Common.name Num.species Bait.state Bait.deliv.broad Bait.deliv.specific
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Asia India 20.6 79.0 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 1 Solid Toxin 1080
2 Asia India 20.6 79.0 Punjab Ludhiana Mammal Omnivore Muridae Mus musculus House mouse 2 Solid Toxin 1080
3 Asia India 20.6 79.0 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 2 Solid Toxin 1080
4 Asia India 20.6 79.0 Punjab Ludhiana Mammal Omnivore Muridae Mus musculus House mouse 3 Solid Toxin 1080
5 Asia India 20.6 79.0 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 3 Solid Toxin 1080
6 Asia India 20.6 79.0 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 3 Solid Toxin 1080
7 Asia India 20.6 79.0 Punjab Ludhiana Mammal Omnivore Muridae Mus musculus House mouse 3 Solid Toxin 1080
8 Asia India 20.6 79.0 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 3 Solid Toxin 1080
9 Asia India 20.6 79.0 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 3 Solid Toxin 1080
10 Asia India 20.6 79.0 Punjab Ludhiana Mammal Herbivore Leporidae Oryctolagus cuniculus European rabbit 4 Solid Toxin 1080
11 Asia India 20.6 79.0 Punjab Ludhiana Mammal Omnivore Muridae Rattus rattus Black rat 4 Solid Toxin 1080
12 Asia India 20.6 79.0 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 4 Solid Toxin 1080
13 Asia India 20.6 79.0 Punjab Ludhiana Mammal Carnivore Canidae Canis familiaris Dog 4 Solid Toxin 1080