I have a dataset (MN_Census) that has information for all census tracts for the following years: 1990, 2000, 2010, and 2020. The variable ID that identifies the census tract is "GISJOIN". My dataset looks like this:
GISJOIN | GEOGYEAR | STATE | STATEA | COUNTY | COUNTYA | TRACTA | CL8AA | CL9AA | DATAYEAR |
---|---|---|---|---|---|---|---|---|---|
G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 1954.45 | 0 | 1990 |
G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 2045.99 | 0 | 1990 |
G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2361.32 | 0 | 2000 |
G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 2862.97 | 0 | 2000 |
G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2327 | 0 | 2010 |
G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 3262 | 0 | 2010 |
G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2244 | NA | 2020 |
G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 3120 | NA | 2020 |
I'm trying to use either join/merge/rbind to add foreach census tract, the years in between. This is, I want to add the following dataset foreach of the census tract:
YearTransaction |
---|
1990 |
1991 |
1992 |
1993 |
1994 |
1995 |
1996 |
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
2004 |
So at the end, my desired dataset will have all years (from 1990 to 2020) foreach one of the census tract. I'm trying using "multidplyr
" to make a partition by census tract ID (which is represented by the variable "GISJOIN"), and join each one of these partitions with the Years dataset. I'm using the following code:
library("parallel")
library("modeest")
library("multidplyr")
library("doParallel")
library("dtplyr")
detectCores()
cluster <- new_cluster(4)
base <- MN_Census %>% group_by(GISJOIN) %>% partition(cluster)
system.time(final <- dplyr::left_join(Years,base, by=c("YearTransaction"="DATAYEAR"),copy=TRUE) %>% collect())
final <- final %>% dplyr::ungroup()
However, I'm not getting the desired results. What I'm getting is a dataaset with all years (which is good), but these years are not duplicated foreach census tract (GISJOIN):
The desired dataset should look like this:
DATAYEAR | GISJOIN | GEOGYEAR | STATE | STATEA | COUNTY | COUNTYA | TRACTA | CL8AA | CL9AA |
---|---|---|---|---|---|---|---|---|---|
1990 | G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 1954.45 | 0 |
1991 | G2700010770100 | ||||||||
… | G2700010770100 | ||||||||
2000 | G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2361.32 | 0 |
2001 | G2700010770100 | ||||||||
… | G2700010770100 | ||||||||
2010 | G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2327 | 0 |
2011 | G2700010770100 | ||||||||
… | G2700010770100 | ||||||||
2020 | G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2244 | NA |
1990 | G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 2045.99 | 0 |
1991 | G2700010790502 | ||||||||
… | G2700010790502 | ||||||||
2000 | G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 2862.97 | 0 |
2001 | G2700010790502 | ||||||||
… | G2700010790502 | ||||||||
2010 | G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 3262 | 0 |
2011 | G2700010790502 | ||||||||
… | G2700010790502 | ||||||||
2020 | G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 3120 | NA |
How can I fix this using multidplyr or any kind of parallel processing? (My original dataset is huge). Any help using dplyr is really appreciated it!
CodePudding user response:
Update:
library(dplyr)
library(tidyr)
df %>%
group_by(TRACTA) %>%
tidyr::complete(DATAYEAR= full_seq(DATAYEAR, period = 1)) %>%
data.frame()
TRACTA DATAYEAR A GISJOIN GEOGYEAR STATE STATEA COUNTY COUNTYA CL8AA CL9AA
1 770100 1990 G2700010770100 2010 Minnesota 27 Aitkin County 1 1954.45 0
2 770100 1991 <NA> NA <NA> NA <NA> <NA> NA NA NA
3 770100 1992 <NA> NA <NA> NA <NA> <NA> NA NA NA
4 770100 1993 <NA> NA <NA> NA <NA> <NA> NA NA NA
5 770100 1994 <NA> NA <NA> NA <NA> <NA> NA NA NA
6 770100 1995 <NA> NA <NA> NA <NA> <NA> NA NA NA
7 770100 1996 <NA> NA <NA> NA <NA> <NA> NA NA NA
8 770100 1997 <NA> NA <NA> NA <NA> <NA> NA NA NA
9 770100 1998 <NA> NA <NA> NA <NA> <NA> NA NA NA
10 770100 1999 <NA> NA <NA> NA <NA> <NA> NA NA NA
11 770100 2000 G2700010770100 2010 Minnesota 27 Aitkin County 1 2361.32 0
12 770100 2001 <NA> NA <NA> NA <NA> <NA> NA NA NA
13 770100 2002 <NA> NA <NA> NA <NA> <NA> NA NA NA
14 770100 2003 <NA> NA <NA> NA <NA> <NA> NA NA NA
15 770100 2004 <NA> NA <NA> NA <NA> <NA> NA NA NA
16 770100 2005 <NA> NA <NA> NA <NA> <NA> NA NA NA
17 770100 2006 <NA> NA <NA> NA <NA> <NA> NA NA NA
18 770100 2007 <NA> NA <NA> NA <NA> <NA> NA NA NA
19 770100 2008 <NA> NA <NA> NA <NA> <NA> NA NA NA
20 770100 2009 <NA> NA <NA> NA <NA> <NA> NA NA NA
21 770100 2010 G2700010770100 2010 Minnesota 27 Aitkin County 1 2327.00 0
22 770100 2011 <NA> NA <NA> NA <NA> <NA> NA NA NA
23 770100 2012 <NA> NA <NA> NA <NA> <NA> NA NA NA
24 770100 2013 <NA> NA <NA> NA <NA> <NA> NA NA NA
25 770100 2014 <NA> NA <NA> NA <NA> <NA> NA NA NA
26 770100 2015 <NA> NA <NA> NA <NA> <NA> NA NA NA
27 770100 2016 <NA> NA <NA> NA <NA> <NA> NA NA NA
28 770100 2017 <NA> NA <NA> NA <NA> <NA> NA NA NA
29 770100 2018 <NA> NA <NA> NA <NA> <NA> NA NA NA
30 770100 2019 <NA> NA <NA> NA <NA> <NA> NA NA NA
31 770100 2020 G2700010770100 2010 Minnesota 27 Aitkin County 1 2244.00 NA
32 790502 1990 G2700010790502 2010 Minnesota 27 Aitkin County 1 2045.99 0
33 790502 1991 <NA> NA <NA> NA <NA> <NA> NA NA NA
34 790502 1992 <NA> NA <NA> NA <NA> <NA> NA NA NA
35 790502 1993 <NA> NA <NA> NA <NA> <NA> NA NA NA
36 790502 1994 <NA> NA <NA> NA <NA> <NA> NA NA NA
37 790502 1995 <NA> NA <NA> NA <NA> <NA> NA NA NA
38 790502 1996 <NA> NA <NA> NA <NA> <NA> NA NA NA
39 790502 1997 <NA> NA <NA> NA <NA> <NA> NA NA NA
40 790502 1998 <NA> NA <NA> NA <NA> <NA> NA NA NA
41 790502 1999 <NA> NA <NA> NA <NA> <NA> NA NA NA
42 790502 2000 G2700010790502 2010 Minnesota 27 Aitkin County 1 2862.97 0
43 790502 2001 <NA> NA <NA> NA <NA> <NA> NA NA NA
44 790502 2002 <NA> NA <NA> NA <NA> <NA> NA NA NA
45 790502 2003 <NA> NA <NA> NA <NA> <NA> NA NA NA
46 790502 2004 <NA> NA <NA> NA <NA> <NA> NA NA NA
47 790502 2005 <NA> NA <NA> NA <NA> <NA> NA NA NA
48 790502 2006 <NA> NA <NA> NA <NA> <NA> NA NA NA
49 790502 2007 <NA> NA <NA> NA <NA> <NA> NA NA NA
50 790502 2008 <NA> NA <NA> NA <NA> <NA> NA NA NA
51 790502 2009 <NA> NA <NA> NA <NA> <NA> NA NA NA
52 790502 2010 G2700010790502 2010 Minnesota 27 Aitkin County 1 3262.00 0
53 790502 2011 <NA> NA <NA> NA <NA> <NA> NA NA NA
54 790502 2012 <NA> NA <NA> NA <NA> <NA> NA NA NA
55 790502 2013 <NA> NA <NA> NA <NA> <NA> NA NA NA
56 790502 2014 <NA> NA <NA> NA <NA> <NA> NA NA NA
57 790502 2015 <NA> NA <NA> NA <NA> <NA> NA NA NA
58 790502 2016 <NA> NA <NA> NA <NA> <NA> NA NA NA
59 790502 2017 <NA> NA <NA> NA <NA> <NA> NA NA NA
60 790502 2018 <NA> NA <NA> NA <NA> <NA> NA NA NA
61 790502 2019 <NA> NA <NA> NA <NA> <NA> NA NA NA
62 790502 2020 G2700010790502 2010 Minnesota 27 Aitkin County 1 3120.00 NA