Home > OS >  How to correct misspelling in column and collapse values into correct row in R
How to correct misspelling in column and collapse values into correct row in R

Time:02-19

I'm rather new to R and struggling through data tidying. I have a problem, where I don't find an answer to, but maybe I'm searching with the wrong terms.

I have a table (df_samples) in the following format:

species gender group sample1 sample2 sample n
penguin m i. 20 21 n
penguin f i. NA 18 n
lion m ii. 5 4 n
lion f ii. 2 9 n
penguin f ii. 22 NA n
tiger m ii. 7 6 n
tiger f ii. 6 8 n

Now, the problem here is the penguin with group ii. which is wrong and should be i. In my table there are several hundred different species and samples. I have this problem with several rows, where species have the wrong group.

I was able to find the specific rows with the problems using the following code:

n_occur <- data.frame(table(df_samples$species))
df_samples_2 <- df_samples[df_samples$species %in% n_occur$Var1[n_occur$Freq > 2],]

This gives me the problematic rows and I can view them in an own dataframe. There I am able identify the rows with the mistakes and could correct them. But I have two problems where I'm stuck.

First I don't know how to index the problematic value to change it directly in my original data frame.

Second I have no idea how to bring the data stored in the row with the mistake to the "correct" row.

I am sure, there are answers on the web, but I am really struggling to express my problem in a way, which allows me to find them.

I would be grateful if somebody is able to help, either by pointing out how to search or by solving the problem.

CodePudding user response:

There are a few ways for this.

Assume all species have the same group

If all species belong to the same group, you can use a vector that stores the species and group information to replace the current group.

Again, this will replace ALL groups within the same species.

base R

correct_group <- c("penguin" = "i.", "tiger" = "ii.", "lion" = "ii.")

df$group <- correct_group[match(df$species, names(correct_group))]

dplyr

library(dplyr)

df %>% mutate(group = correct_group[match(species, names(correct_group))])

If you are doing it by hand:

We can also do it one by one if the species do not belong to the same group (only if you have a few records to change).

First identify the row index where species is "penguin" and group is "ii.". This is the record that you would like to change. Then simply replace the group value with "i.".

base R

df[which(df$species == "penguin" & df$group == "ii."), "group"] <- "i."

dplyr

library(dplyr)

df %>% mutate(group = ifelse(species == "penguin" & group == "ii.", "i.", group))

Output

All of the above methods produce the same output.

species gender group sample1 sample2 sample.n
1 penguin      m    i.      20      21        n
2 penguin      f    i.      NA      18        n
3    lion      m   ii.       5       4        n
4    lion      f   ii.       2       9        n
5 penguin      f    i.      22      NA        n
6   tiger      m   ii.       7       6        n
7   tiger      f   ii.       6       8        n

Remember for the dplyr methods, you have to "save" the df back to it (df <- df %>% dplyr::method), otherwise, it will only output the results to the console without actually changing anything.

CodePudding user response:

Using your process you can try the following steps.

Add unique ID to the rows so that it can be filtered later.

df_samples<-df_samples %>% 
  rowid_to_column()

Remove problem rows from df_samples based on the rowid in df_samples_2

df_samples<-df_samples[-df_samples_2$rowid,]

Update df_samples_2 as per your requirements, row by row mutates based on rowid.

Merge corrected rows back to df_samples

df_samples<-bind_rows(df_samples,df_samples_2)

Also if your end goal & data is as mentioned above you could also try this on your original df_samples

df_samples <-df_samples %>% 
  group_by(species) %>%  #this will create internal groups
  arrange(species,group) %>%  # Will ensure i. will be before ii.
  mutate(group=lag(group,default=first(group))) # lag() will copy earlier row values to current row per group.
  • Related