Home > Mobile >  How to get new logical column comparing 2 dataframe in r
How to get new logical column comparing 2 dataframe in r

Time:09-01

So I have two datasets:

  • First one has 42 columns(id number and 41 chr columns which are all like "Susan, Math, 73") and around 20 million obs.

  • Second one has 4 columns(Reference number and 3 chr columns which are all like "Max, English, 89") and around 10 thousand obs.

I would like to compare these two dataset and create new logical column(True, False or 1, 0) to first one with 42 columns if any values from first dataset exist in second dataset.

I have tried to use %in% then there are so many combination I should use for example

First_dataset$log_1 <- First_dataset$first_key %in% second_dataset$key_first

First_dataset$log_2 <- First_dataset$first_key %in% second_dataset$key_second

First_dataset$log_3 <- First_dataset$first_key %in% second_dataset$key_third

First_dataset$log_4 <- First_dataset$second_key %in% second_dataset$key_first

First_dataset$log_5 <- First_dataset$second_key %in% second_dataset$key_second

First_dataset$log_6 <- First_dataset$second_key %in% second_dataset$key_third

until 42 columns...

Is there any simpler method(or function) I can employ to get new logical columns with less lines of code?

CodePudding user response:

Giving the number of observations in your datasets, I doubt the following solution would be suitable, but I assume this would work for smaller datasets.

The trick is to reshape your data, that is currently stored in a wide format, to a long format, see ?reshape. You can then perform your comparison.

If you want to go back to a wide format, you can do so, adding as much columns as your original first dataset.

Here is a little reproducible example.

# Your dataset with 42 columns
df1 <- data.frame(id = 1:5,
                  col1 = c("a", "b", "c", "d", "e"),
                  col2 = c("f", "g", "h", "i", "j"),
                  col3 = c("k", "l", "m", "n", "o"))

# Your dataset with 4 columns
df2 <- data.frame(id = 1:3,
                  col1 = c("a", "b", "e"),
                  col2 = c("g", "h", "m"))

# Reshape your datasets to long format
df1_long <- reshape(df1,
                    varying = list(paste0("col", 1:3)),
                    direction = "long",
                    idvar = "id",
                    v.names = "col")

df2_long <- reshape(df2,
                    varying = list(paste0("col", 1:2)),
                    direction = "long",
                    idvar = "id",
                    v.names = "col")

# Perform your comparison
df1_long$log <- df1_long$col %in% df2_long$col

# Go back to your initial wide format
df1_wide <- reshape(df1_long,
                    direction = "wide",
                    timevar = "time",
                    v.names = c("col", "log"),
                    idvar = "id",
                    sep = "")
  • Related