Home > Software engineering >  How can I add additional years to my dataset using multidplyr or parallel processing?
How can I add additional years to my dataset using multidplyr or parallel processing?

Time:02-03

I have a dataset (MN_Census) that has information for all census tracts for the following years: 1990, 2000, 2010, and 2020. The variable ID that identifies the census tract is "GISJOIN". My dataset looks like this:

GISJOIN GEOGYEAR STATE STATEA COUNTY COUNTYA TRACTA CL8AA CL9AA DATAYEAR
G2700010770100 2010 Minnesota 27 Aitkin County 1 770100 1954.45 0 1990
G2700010790502 2010 Minnesota 27 Aitkin County 1 790502 2045.99 0 1990
G2700010770100 2010 Minnesota 27 Aitkin County 1 770100 2361.32 0 2000
G2700010790502 2010 Minnesota 27 Aitkin County 1 790502 2862.97 0 2000
G2700010770100 2010 Minnesota 27 Aitkin County 1 770100 2327 0 2010
G2700010790502 2010 Minnesota 27 Aitkin County 1 790502 3262 0 2010
G2700010770100 2010 Minnesota 27 Aitkin County 1 770100 2244 NA 2020
G2700010790502 2010 Minnesota 27 Aitkin County 1 790502 3120 NA 2020

I'm trying to use either join/merge/rbind to add foreach census tract, the years in between. This is, I want to add the following dataset foreach of the census tract:

YearTransaction
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004

So at the end, my desired dataset will have all years (from 1990 to 2020) foreach one of the census tract. I'm trying using "multidplyr" to make a partition by census tract ID (which is represented by the variable "GISJOIN"), and join each one of these partitions with the Years dataset. I'm using the following code:

library("parallel")
library("modeest")
library("multidplyr")
library("doParallel")
library("dtplyr")


detectCores()
cluster <- new_cluster(4)
base <- MN_Census %>% group_by(GISJOIN) %>% partition(cluster)
system.time(final <- dplyr::left_join(Years,base, by=c("YearTransaction"="DATAYEAR"),copy=TRUE) %>% collect())
final <- final %>% dplyr::ungroup()

However, I'm not getting the desired results. What I'm getting is a dataaset with all years (which is good), but these years are not duplicated foreach census tract (GISJOIN): enter image description here

The desired dataset should look like this:

DATAYEAR GISJOIN GEOGYEAR STATE STATEA COUNTY COUNTYA TRACTA CL8AA CL9AA
1990 G2700010770100 2010 Minnesota 27 Aitkin County 1 770100 1954.45 0
1991 G2700010770100
G2700010770100
2000 G2700010770100 2010 Minnesota 27 Aitkin County 1 770100 2361.32 0
2001 G2700010770100
G2700010770100
2010 G2700010770100 2010 Minnesota 27 Aitkin County 1 770100 2327 0
2011 G2700010770100
G2700010770100
2020 G2700010770100 2010 Minnesota 27 Aitkin County 1 770100 2244 NA
1990 G2700010790502 2010 Minnesota 27 Aitkin County 1 790502 2045.99 0
1991 G2700010790502
G2700010790502
2000 G2700010790502 2010 Minnesota 27 Aitkin County 1 790502 2862.97 0
2001 G2700010790502
G2700010790502
2010 G2700010790502 2010 Minnesota 27 Aitkin County 1 790502 3262 0
2011 G2700010790502
G2700010790502
2020 G2700010790502 2010 Minnesota 27 Aitkin County 1 790502 3120 NA

How can I fix this using multidplyr or any kind of parallel processing? (My original dataset is huge). Any help using dplyr is really appreciated it!

CodePudding user response:

Update:

library(dplyr)
library(tidyr)

df %>% 
  group_by(TRACTA) %>% 
  tidyr::complete(DATAYEAR= full_seq(DATAYEAR, period = 1)) %>% 
  data.frame()
 TRACTA DATAYEAR              A GISJOIN  GEOGYEAR STATE STATEA COUNTY COUNTYA   CL8AA CL9AA
1  770100     1990 G2700010770100    2010 Minnesota    27 Aitkin County       1 1954.45     0
2  770100     1991           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
3  770100     1992           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
4  770100     1993           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
5  770100     1994           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
6  770100     1995           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
7  770100     1996           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
8  770100     1997           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
9  770100     1998           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
10 770100     1999           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
11 770100     2000 G2700010770100    2010 Minnesota    27 Aitkin County       1 2361.32     0
12 770100     2001           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
13 770100     2002           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
14 770100     2003           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
15 770100     2004           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
16 770100     2005           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
17 770100     2006           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
18 770100     2007           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
19 770100     2008           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
20 770100     2009           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
21 770100     2010 G2700010770100    2010 Minnesota    27 Aitkin County       1 2327.00     0
22 770100     2011           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
23 770100     2012           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
24 770100     2013           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
25 770100     2014           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
26 770100     2015           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
27 770100     2016           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
28 770100     2017           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
29 770100     2018           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
30 770100     2019           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
31 770100     2020 G2700010770100    2010 Minnesota    27 Aitkin County       1 2244.00    NA
32 790502     1990 G2700010790502    2010 Minnesota    27 Aitkin County       1 2045.99     0
33 790502     1991           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
34 790502     1992           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
35 790502     1993           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
36 790502     1994           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
37 790502     1995           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
38 790502     1996           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
39 790502     1997           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
40 790502     1998           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
41 790502     1999           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
42 790502     2000 G2700010790502    2010 Minnesota    27 Aitkin County       1 2862.97     0
43 790502     2001           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
44 790502     2002           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
45 790502     2003           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
46 790502     2004           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
47 790502     2005           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
48 790502     2006           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
49 790502     2007           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
50 790502     2008           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
51 790502     2009           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
52 790502     2010 G2700010790502    2010 Minnesota    27 Aitkin County       1 3262.00     0
53 790502     2011           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
54 790502     2012           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
55 790502     2013           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
56 790502     2014           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
57 790502     2015           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
58 790502     2016           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
59 790502     2017           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
60 790502     2018           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
61 790502     2019           <NA>      NA      <NA>    NA   <NA>   <NA>      NA      NA    NA
62 790502     2020 G2700010790502    2010 Minnesota    27 Aitkin County       1 3120.00    NA
  • Related