After years of using your advices to another users, here is my for now unsolvable issue...
I have a dataset with thousands of rows and hundreds of column, that have one column with a possible value in common. Here is a subset of my dataset :
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("x1", "x2", "x3", "x2", "x3")
mat <- cbind(ID, Dose, Value)
What I want is to assign a unique value to the rows that have the "Value" column in common, like that :
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("153254", "258634", "896411", "258634", "896411")
Code <- c("1", "2", "3", "2", "3")
mat <- cbind(ID, Dose, Value, Code)
Does anyone have an idea that could help me a little ?
Thanks !
CodePudding user response:
You should consider using a data.frame:
mat <- data.frame(ID, Dose, Value)
Using dplyr
you could create the desired output:
library(dplyr)
mat %>%
group_by(Value) %>%
mutate(Code = cur_group_id()) %>%
ungroup()
This returns
# A tibble: 5 x 4
ID Dose Value Code
<chr> <chr> <chr> <int>
1 A 1 153254 1
2 B 5 258634 2
3 C 3 896411 3
4 D 4 258634 2
5 E 5 896411 3
CodePudding user response:
We may use match
here
library(dplyr)
mat %>%
mutate(Code = match(Value, unique(Value)))
-output
ID Dose Value Code
1 A 1 153254 1
2 B 5 258634 2
3 C 3 896411 3
4 D 4 258634 2
5 E 5 896411 3
data
mat <- data.frame(ID, Dose, Value)
CodePudding user response:
To generate unique values, we could use a hash function. Here is one approach using the fst
package, which implements xxHash. The benefit of is that the values are nicely spaced out, probability for collisions is extremely low, while still being very fast. When data reaches a few million different groups, [1]
should be removed to make use of 64-bit key.
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("x1", "x2", "x3", "x2", "x3")
mat <- cbind(ID, Dose, Value)
mat[,"Value"] <-
lapply(mat[,"Value"], charToRaw) |>
lapply(\(x) fst::hash_fst(x, block_hash = F)[1]) |>
unlist(use.names = F)
ID Dose Value
[1,] "A" "1" "1212139790"
[2,] "B" "5" "1379455937"
[3,] "C" "3" "756640974"
[4,] "D" "4" "1379455937"
[5,] "E" "5" "756640974"