Home > Blockchain >  Using a key to clean data in two corresponding columns
Using a key to clean data in two corresponding columns

Time:04-07

I have a large data frame (6 million rows, 20 columns) where data in one column corresponds to data in another column. I created a key that I now want to use to fix rows that have the wrong value. As a small example:

key = data.frame(animal = c('dog', 'cat', 'bird'), 
                  sound = c('bark', 'meow', 'chirp'))

The data frame looks like this (minus the other columns of data):

 df = data.frame(id = c(1, 2, 3, 4),
                     animal = c('dog', 'cat', 'bird', 'cat'), 
                     sound = c('meow', 'bark', 'chirp', 'chirp'))

I swear I have done this before but can't remember my solution. Any ideas?

CodePudding user response:

Using dplyr. If you want to fix sound according to animal,

library(dplyr)
df <- df %>% 
  mutate(sound = sapply(animal, function(x){key %>% filter(animal==x) %>% pull(sound)}))

should do the trick. If you want to fix animal according to sound:

df <- df %>% 
  mutate(animal = sapply(sound, function(x){key %>% filter(sound==x) %>% pull(animal)}))

CodePudding user response:

I'm not sure about relative efficiency, but it's simpler to replace the partially incorrect column completely. It may not even cost you very much time (since you have to look up values anyway to determine that an animal/sound pair is mismatched).

library(tidyverse)
df %>% select(-sound) %>% full_join(key, by = "animal")

For 6 million rows, you may be better off using data.table. If you convert df and key to data tables (as.data.table()) that will take some up-front computational time but may speed up subsequent operations; you can use tidyverse operations on data.table objects without doing any further modifications, but native data.table operations might be faster:

library(data.table
dft <- as.data.table(df)
k <- as.data.table(key)
merge(dft[,-"sound"], k, by = "animal")

I haven't bothered to do any benchmarking (would need much larger examples to be able to measure any differences).

  • Related