In R, I'm trying to get the frequency and the first occurence for each item of a column in a data frame.
I have this:
df_input <- data.frame(observation_source = c("AB","CD","EF","GH","IJ","KL","MN"), observation_value = c(15,17,17,17,21,32,32))
observation_source observation_value
1 AB 15
2 CD 17
3 EF 17
4 GH 17
5 IJ 21
6 KL 32
7 MN 32
And I'm trying to get this:
observation_source observation_value value_frequency value_first_row
1 AB 15 1 1
2 CD 17 3 2
3 EF 17 3 2
4 GH 17 3 2
5 IJ 21 1 5
6 KL 32 2 6
7 MN 32 2 6
Such that, on row 4 for example, the value 17 occurs 3 times in total and occurs on row 2 for the first time.
I know how to do this with a for
loop, but it gets extremely slow as the number of row increases (e.g. 100,000). Any idea how else I can do that? Many thanks!!
CodePudding user response:
You could use add_count
and match
for the both goals:
library(dplyr)
df_input %>%
add_count(observation_value, name = "value_frequency") %>%
mutate(value_first_row = observation_value %>% match(., .))
# observation_source observation_value value_frequency value_first_row
# 1 AB 15 1 1
# 2 CD 17 3 2
# 3 EF 17 3 2
# 4 GH 17 3 2
# 5 IJ 21 1 5
# 6 KL 32 2 6
# 7 MN 32 2 6
CodePudding user response:
library(dplyr)
df_input %>%
mutate(
value_first_row = row_number()
) %>%
group_by(observation_value) %>%
mutate(
value_frequency = n(),
value_first_row = min(value_first_row)
)
# # A tibble: 7 × 4
# # Groups: observation_value [4]
# observation_source observation_value value_first_row value_frequency
# <chr> <dbl> <int> <int>
# 1 AB 15 1 1
# 2 CD 17 2 3
# 3 EF 17 2 3
# 4 GH 17 2 3
# 5 IJ 21 5 1
# 6 KL 32 6 2
# 7 MN 32 6 2
You can adjust the column order too, if that's important, with ... %>% relocate(value_first_row, .after = last_col())
CodePudding user response:
Using base R
transform(df_input, value_frequency = ave(observation_value,
observation_value, FUN = length),
value_first_row = ave(seq_along(observation_value),
observation_value, FUN = min))
-output
observation_source observation_value value_frequency value_first_row
1 AB 15 1 1
2 CD 17 3 2
3 EF 17 3 2
4 GH 17 3 2
5 IJ 21 1 5
6 KL 32 2 6
7 MN 32 2 6