I'm rather new to R and struggling through data tidying. I have a problem, where I don't find an answer to, but maybe I'm searching with the wrong terms.
I have a table (df_samples
) in the following format:
species | gender | group | sample1 | sample2 | sample n |
---|---|---|---|---|---|
penguin | m | i. | 20 | 21 | n |
penguin | f | i. | NA | 18 | n |
lion | m | ii. | 5 | 4 | n |
lion | f | ii. | 2 | 9 | n |
penguin | f | ii. | 22 | NA | n |
tiger | m | ii. | 7 | 6 | n |
tiger | f | ii. | 6 | 8 | n |
Now, the problem here is the penguin with group ii. which is wrong and should be i. In my table there are several hundred different species and samples. I have this problem with several rows, where species have the wrong group.
I was able to find the specific rows with the problems using the following code:
n_occur <- data.frame(table(df_samples$species))
df_samples_2 <- df_samples[df_samples$species %in% n_occur$Var1[n_occur$Freq > 2],]
This gives me the problematic rows and I can view them in an own dataframe. There I am able identify the rows with the mistakes and could correct them. But I have two problems where I'm stuck.
First I don't know how to index the problematic value to change it directly in my original data frame.
Second I have no idea how to bring the data stored in the row with the mistake to the "correct" row.
I am sure, there are answers on the web, but I am really struggling to express my problem in a way, which allows me to find them.
I would be grateful if somebody is able to help, either by pointing out how to search or by solving the problem.
CodePudding user response:
There are a few ways for this.
Assume all species have the same group
If all species belong to the same group, you can use a vector that stores the species
and group
information to replace the current group.
Again, this will replace ALL group
s within the same species
.
base R
correct_group <- c("penguin" = "i.", "tiger" = "ii.", "lion" = "ii.")
df$group <- correct_group[match(df$species, names(correct_group))]
dplyr
library(dplyr)
df %>% mutate(group = correct_group[match(species, names(correct_group))])
If you are doing it by hand:
We can also do it one by one if the species do not belong to the same group (only if you have a few records to change).
First identify the row index where species
is "penguin" and group
is "ii.". This is the record that you would like to change. Then simply replace the group
value with "i.".
base R
df[which(df$species == "penguin" & df$group == "ii."), "group"] <- "i."
dplyr
library(dplyr)
df %>% mutate(group = ifelse(species == "penguin" & group == "ii.", "i.", group))
Output
All of the above methods produce the same output.
species gender group sample1 sample2 sample.n
1 penguin m i. 20 21 n
2 penguin f i. NA 18 n
3 lion m ii. 5 4 n
4 lion f ii. 2 9 n
5 penguin f i. 22 NA n
6 tiger m ii. 7 6 n
7 tiger f ii. 6 8 n
Remember for the dplyr
methods, you have to "save" the df
back to it (df <- df %>% dplyr::method
), otherwise, it will only output the results to the console without actually changing anything.
CodePudding user response:
Using your process you can try the following steps.
Add unique ID to the rows so that it can be filtered later.
df_samples<-df_samples %>%
rowid_to_column()
Remove problem rows from df_samples based on the rowid in df_samples_2
df_samples<-df_samples[-df_samples_2$rowid,]
Update df_samples_2 as per your requirements, row by row mutates based on rowid.
Merge corrected rows back to df_samples
df_samples<-bind_rows(df_samples,df_samples_2)
Also if your end goal & data is as mentioned above you could also try this on your original df_samples
df_samples <-df_samples %>%
group_by(species) %>% #this will create internal groups
arrange(species,group) %>% # Will ensure i. will be before ii.
mutate(group=lag(group,default=first(group))) # lag() will copy earlier row values to current row per group.