Home > Software engineering >  Join data.frame with repeated lines in R
Join data.frame with repeated lines in R

Time:11-30

I have two dataframes. One is quality with different dates throughout the year of collection points at station st. The other is the land use of each st season for that year.

Example

library(dplyr)

water_df=read.table(text="st  Date    OD  pH  DBO
A 01/07/2005  8   6.3 3
A 02/06/2005  7   6.2 2.2
A 01/01/2005  7.3 6.5 3.1
A 03/05/2006  6   6.3 4
A 09/08/2006  6.8 7.1 1.1
A 12/12/2006  7.3 8.1 2.9
B 02/07/2005  6.8 5.4 2.6
B 03/06/2005  6.0 5.3 1.9
B 02/01/2005  6.2 5.5 2.6
B 04/05/2006  5.1 5.4 3.4
B 10/08/2006  5.8 6.0 0.9
B 13/12/2006  6.2 6.9 2.5
C 20/12/2006  6.5 7.2 2.6
C 27/12/2006  6.8 7.6 2.7
C 03/01/2007  7.2 8.0 2.9
C 10/01/2007  7.5 8.4 3.0
C 17/01/2007  7.9 8.8 3.1
C 24/01/2007  8.3 9.2 3.3
C 31/01/2007  8.7 9.7 3.5
C 07/02/2007  9.2 10.2    3.6", sep="", header=TRUE)%>%as.data.frame()



land_df=read.table(text = "st year    Veg Water   Soil    Crop    Grass
A 2005    100 200 80  130 70
B 2006    98  180 84  132 86
C 2007    93  175 79  127 106", sep="", header = TRUE)%>%as.data.frame()

I would like to add to the quality data.frame the land use values of that station even if for the same station st is repeated.

enter image description here

I tried some things but it didn't work

#I tryed cbind
join_df<-cbind(water_df, land_df)

#I tryed
library(purrr)
joind_df2<-purrr::reduce(water_df, land_df)

CodePudding user response:

This operation it commonly called a "join" or "merge". You need columns to join on, which means we need to extract the year from your date, and then it's a left_join command. See this FAQ for more information about joining data in R.

library(dplyr)
library(lubridate)

water_df %>%
  mutate(year = year(dmy(Date))) %>%
  left_join(land_df, by = c("st", "year"))
#    st       Date  OD   pH DBO year Veg Water Soil Crop Grass
# 1   A 01/07/2005 8.0  6.3 3.0 2005 100   200   80  130    70
# 2   A 02/06/2005 7.0  6.2 2.2 2005 100   200   80  130    70
# 3   A 01/01/2005 7.3  6.5 3.1 2005 100   200   80  130    70
# 4   A 03/05/2006 6.0  6.3 4.0 2006  NA    NA   NA   NA    NA
# 5   A 09/08/2006 6.8  7.1 1.1 2006  NA    NA   NA   NA    NA
# 6   A 12/12/2006 7.3  8.1 2.9 2006  NA    NA   NA   NA    NA
# 7   B 02/07/2005 6.8  5.4 2.6 2005  NA    NA   NA   NA    NA
# 8   B 03/06/2005 6.0  5.3 1.9 2005  NA    NA   NA   NA    NA
# 9   B 02/01/2005 6.2  5.5 2.6 2005  NA    NA   NA   NA    NA
# 10  B 04/05/2006 5.1  5.4 3.4 2006  98   180   84  132    86
# 11  B 10/08/2006 5.8  6.0 0.9 2006  98   180   84  132    86
# 12  B 13/12/2006 6.2  6.9 2.5 2006  98   180   84  132    86
# 13  C 20/12/2006 6.5  7.2 2.6 2006  NA    NA   NA   NA    NA
# 14  C 27/12/2006 6.8  7.6 2.7 2006  NA    NA   NA   NA    NA
# 15  C 03/01/2007 7.2  8.0 2.9 2007  93   175   79  127   106
# 16  C 10/01/2007 7.5  8.4 3.0 2007  93   175   79  127   106
# 17  C 17/01/2007 7.9  8.8 3.1 2007  93   175   79  127   106
# 18  C 24/01/2007 8.3  9.2 3.3 2007  93   175   79  127   106
# 19  C 31/01/2007 8.7  9.7 3.5 2007  93   175   79  127   106
# 20  C 07/02/2007 9.2 10.2 3.6 2007  93   175   79  127   106

There's some missing values where your land_df didn't have observations for a particular station in a particular year. Have a look at ?tidyr::fill and this FAQ if you want to fill those in with, e.g., the previous observation.

  • Related