I am trying to figure out how to identify name changes within a group.
For example, I have a dataframe that looks like this:
df <- data.frame(
state = rep(c("CA", "WI", "NY"), each = 2),
year = rep(c(2000, 2001), each = 9),
name = c("John", "Paul", "Sally",
"Mary", "Fred", "Jane",
"Linda", "Carl", "Jim",
"Peter", "Paul", "Sally",
"Mary", "Kate", "Jane",
"Linda", "Carl", "Jim")
)
> df
state year name
1 CA 2000 John
2 CA 2000 Paul
3 WI 2000 Sally
4 WI 2000 Mary
5 NY 2000 Fred
6 NY 2000 Jane
7 CA 2000 Linda
8 CA 2000 Carl
9 WI 2000 Jim
10 WI 2001 Peter
11 NY 2001 Paul
12 NY 2001 Sally
13 CA 2001 Mary
14 CA 2001 Kate
15 WI 2001 Jane
16 WI 2001 Linda
17 NY 2001 Carl
18 NY 2001 Jim
As you can see, "Peter" replaced "John" in 2001, and "Kate" replaced "Fred" in 2001.
So I want the output to look like:
df <- data.frame(
state = rep(c("CA", "WI", "NY"), each = 2),
year = rep(c(2000, 2001), each = 9),
name = c("John", "Paul", "Sally",
"Mary", "Fred", "Jane",
"Linda", "Carl", "Jim",
"Peter", "Paul", "Sally",
"Mary", "Kate", "Jane",
"Linda", "Carl", "Jim"),
change = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
1, 0, 0, 0, 1, 0, 0, 0, 0)
)
df
state year name change
1 CA 2000 John NA
2 CA 2000 Paul NA
3 WI 2000 Sally NA
4 WI 2000 Mary NA
5 NY 2000 Fred NA
6 NY 2000 Jane NA
7 CA 2000 Linda NA
8 CA 2000 Carl NA
9 WI 2000 Jim NA
10 WI 2001 Peter 1
11 NY 2001 Paul 0
12 NY 2001 Sally 0
13 CA 2001 Mary 0
14 CA 2001 Kate 1
15 WI 2001 Jane 0
16 WI 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0
As you can see, Peter in 2001 and Kate in 2001 are both marked as "1" in the "change" column because they replaced "John" and "Fred" in 2000-CA and 2000-NY, respectively.
I've been looking at using some lag methods, but it seems to just look at the previous row, not by state, year groups:
df2 <- df %>%
group_by(state, year) %>%
mutate(change = lag(name, order_by = year))
Any help would be appreciated!
CodePudding user response:
Based on the expected output, maybe this helps - create a logical column based on the duplicated
'name' in the entire data, then grouped by 'year', if
all
values are FALSE (!change
), then replace with NA
or else
negate (!
) and convert the logical to binary (
)
library(dplyr)
df %>%
mutate(change = duplicated(name)) %>%
group_by(year) %>%
mutate(
change = if(all(!change)) NA_integer_ else (!change)) %>%
ungroup
-output
# A tibble: 18 × 4
state year name change
<chr> <dbl> <chr> <int>
1 CA 2000 John NA
2 CA 2000 Paul NA
3 WI 2000 Sally NA
4 WI 2000 Mary NA
5 NY 2000 Fred NA
6 NY 2000 Jane NA
7 CA 2000 Linda NA
8 CA 2000 Carl NA
9 WI 2000 Jim NA
10 WI 2001 Peter 1
11 NY 2001 Paul 0
12 NY 2001 Sally 0
13 CA 2001 Mary 0
14 CA 2001 Kate 1
15 WI 2001 Jane 0
16 WI 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0
CodePudding user response:
A base R approach that leaves out NA
s
df2 <- split(df, df$year)
cbind(df, change=rep((!(df2$"2000"$name == df2$"2001"$name))*1, length(df2)))
state year name change
1 CA 2000 John 1
2 CA 2000 Paul 0
3 WI 2000 Sally 0
4 WI 2000 Mary 0
5 NY 2000 Fred 1
6 NY 2000 Jane 0
7 CA 2000 Linda 0
8 CA 2000 Carl 0
9 WI 2000 Jim 0
10 WI 2001 Peter 1
11 NY 2001 Paul 0
12 NY 2001 Sally 0
13 CA 2001 Mary 0
14 CA 2001 Kate 1
15 WI 2001 Jane 0
16 WI 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0