Combine datasets-CodePudding

I'm supposed to combine all four datasets from https://www.drroyspencer.com/latest-global-temperatures/ into one dataframe. I've tried to merge it without luck. Any help?

What I've tried so far.

# Lower troposphere data.
lower_trop <- fread("https://www.nsstc.uah.edu/data/msu/v6.0/tlt/uahncdc_lt_6.0.txt", sep = " ")

# Mid-Troposphere data.
mid_trop <- fread("https://www.nsstc.uah.edu/data/msu/v6.0/tmt/uahncdc_mt_6.0.txt", sep = " ")

# Tropopause data.
trop <- fread("https://www.nsstc.uah.edu/data/msu/v6.0/ttp/uahncdc_tp_6.0.txt", sep = " ")

# Lower Stratosphere data.
lower_strat <- fread("https://www.nsstc.uah.edu/data/msu/v6.0/tls/uahncdc_ls_6.0.txt" , sep = " ")

# 
lower_trop1 <- lower_trop[-c(518)]
mid_trop1 <- Mid_trop[-c(518)]
trop1 <- trop[-c(518)]
lower_strat1 <- lower_strat[-c(518)]

# 
lower_trop <- as.data.frame(apply(lower_trop1, 2, as.numeric))
mid_trop <- as.data.frame(apply(mid_trop1, 2, as.numeric))
trop_paus <- as.data.frame((apply(trop1, 2, as.numeric)))
lower_strat <- as.data.frame(apply(lower_strat1, 2, as.numeric))

all<- rbind(lower_trop,mid_trop,trop_paus,lower_strat)

CodePudding user response：

Don't ignore warnings. Your fread calls are warning

lower_trop <- fread("https://www.nsstc.uah.edu/data/msu/v6.0/tlt/uahncdc_lt_6.0.txt", sep = " ")
#  [100%] Downloaded 89265 bytes...
# Warning in fread("https://www.nsstc.uah.edu/data/msu/v6.0/tlt/uahncdc_lt_6.0.txt",  :
#   Stopped early on line 520. Expected 29 fields but found 0. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<Trend    0.14  0.18  0.12  0.16  0.19  0.14  0.11  0.16  0.10  0.12  0.16  0.11  0.18  0.21  0.17  0.10  0.15  0.09  0.25  0.23  0.27  0.01  0.09 -0.02  0.18  0.18  0.18>>

This is because those files have trailing material that are not "proper CSV" (as far as R cares, that is):

 Year Mo Globe  Land Ocean   NH   Land Ocean   SH   Land Ocean Trpcs  Land Ocean NoExt  Land Ocean SoExt  Land Ocean NoPol  Land Ocean SoPol  Land Ocean USA48 USA49  AUST
 1978 12 -0.48 -0.51 -0.47 -0.44 -0.46 -0.42 -0.52 -0.62 -0.50 -0.60 -0.62 -0.59 -0.37 -0.44 -0.30 -0.46 -0.55 -0.45 -0.39 -0.68 -0.06 -0.45 -0.38 -0.49 -1.29 -1.15 -1.29
 1979  1 -0.47 -0.64 -0.41 -0.64 -0.86 -0.50 -0.31 -0.13 -0.34 -0.47 -0.54 -0.45 -0.73 -0.93 -0.55 -0.23  0.12 -0.29 -0.46 -0.95  0.10 -0.16 -0.15 -0.16 -3.22 -2.42  0.92
...snip...
 2021 11  0.08  0.22  0.03  0.11  0.32 -0.02  0.06 -0.00  0.07  0.14  0.35  0.08  0.09  0.31 -0.10  0.02 -0.24  0.06 -0.43 -0.19 -0.70 -0.72 -0.67 -0.75  0.50 -0.06 -0.29
 2021 12  0.21  0.17  0.22  0.27  0.19  0.32  0.15  0.12  0.15  0.03  0.30 -0.04  0.40  0.19  0.59  0.20 -0.07  0.25  0.01 -0.25  0.31 -0.42 -0.24 -0.51  1.63  1.42 -0.06
 Year Mo Globe  Land Ocean   NH   Land Ocean   SH   Land Ocean Trpcs  Land Ocean NoExt  Land Ocean SoExt  Land Ocean NoPol  Land Ocean SoPol  Land Ocean USA48 USA49  AUST

 Trend    0.14  0.18  0.12  0.16  0.19  0.14  0.11  0.16  0.10  0.12  0.16  0.11  0.18  0.21  0.17  0.10  0.15  0.09  0.25  0.23  0.27  0.01  0.09 -0.02  0.18  0.18  0.18

  NOTE:  Version 6.0 as of April 2015
  NOTE:  New Reference for annual cycle 1991-2020
  NOTE:  Version 5.6 as of Jun 2013
  NOTE:  3/13 More accurate grid weighting for USA48
  NOTE:  3/13 Alaska added in USA49, Australia added

  GL 90S-90N, NH 0-90N, SH 90S-0, TRPCS 20S-20N
  NoExt 20N-90N, SoExt 90S-20S, NoPol 60N-90N, SoPol 90S-60S

There is no easy way to just "ignore" those lines, so I suggest you be very clear about it: state that that warning is expected and not a concern here. Your future self (and colleagues) will thank you. You can use this to silence just that warning (and allow fread to continue normal operations):

lower_trop <- withCallingHandlers(
    fread("https://www.nsstc.uah.edu/data/msu/v6.0/tlt/uahncdc_lt_6.0.txt", sep = " "),
    warning = function(w) {
      if (grepl("Stopped early on line", as.character(w))) 
        invokeRestart("muffleWarning")
    })

R dislikes non-unique column names. I don't know anything about the data, but it appears that Land and Ocean are grouped by the preceding columns Globe, NH, SH, etc. Let's rename them to ensure they are unique and unambiguous (not just numbers):

names(lower_trop1)[lands] <- paste0(names(lower_trop1)[lands-1], "_", names(lower_trop1)[lands])
names(lower_trop1)[lands 1] <- paste0(names(lower_trop1)[lands-1], "_", names(lower_trop1)[lands 1])
head(lower_trop1)
#      Year     Mo  Globe Globe_Land Globe_Ocean     NH NH_Land NH_Ocean     SH SH_Land SH_Ocean  Trpcs Trpcs_Land
#    <char> <char> <char>     <char>      <char> <char>  <char>   <char> <char>  <char>   <char> <char>     <char>
# 1:   1978     12  -0.48      -0.51       -0.47  -0.44   -0.46    -0.42  -0.52   -0.62    -0.50  -0.60      -0.62
# 2:   1979      1  -0.47      -0.64       -0.41  -0.64   -0.86    -0.50  -0.31   -0.13    -0.34  -0.47      -0.54
# 3:   1979      2  -0.43      -0.56       -0.39  -0.47   -0.57    -0.41  -0.39   -0.53    -0.37  -0.36      -0.25
# 4:   1979      3  -0.38      -0.51       -0.33  -0.46   -0.51    -0.44  -0.30   -0.53    -0.26  -0.36      -0.43
# 5:   1979      4  -0.40      -0.57       -0.34  -0.47   -0.62    -0.37  -0.34   -0.46    -0.31  -0.35      -0.37
# 6:   1979      5  -0.40      -0.56       -0.33  -0.52   -0.54    -0.52  -0.27   -0.62    -0.19  -0.46      -0.55
# 16 variables not shown: [Trpcs_Ocean <char>, NoExt <char>, NoExt_Land <char>, NoExt_Ocean <char>, SoExt <char>, SoExt_Land <char>, SoExt_Ocean <char>, NoPol <char>, NoPol_Land <char>, NoPol_Ocean <char>, ...]

(Almost) Never use apply on a data.frame. Okay, almost never, there are legitimate times to do it. However, by default, apply is going to return a matrix when you really need it to remain a data.frame (or at least list-like). Use lapply instead:
```
lower_trop1[] <- lapply(lower_trop1, as.numeric)
```
Two tricks to know about this and how it works with frames:
1. A data.frame is just a glorified list with some added conditions: all elements should be named uniquely; all elements (typically) must be the same length; it may have row-names, some tools use/honor/keep them, some tools (dplyr) ignore or actively remove them. Oh, and the class is slightly different. But most things that work on lists work just as well on data.frames, column-wise.
2. Even though we're working on columns of a frame, in one sense we're just working on elements of a list, and we want it to stay list-like, so we should be using lapply.
3. Unfortunately, lapply is going to return a list not a data.frame, so the reassignment into lower_trop1[] <- ... says to replace the contents of lower_trop1 without changing its class. There is likely more that can be said here, but know that frame[] <- lapply(frame, ...) preserves the data.frame properties we need while allowing very simple lapply-operations.

From here, rbind should work as intended.

FYI, you can combine your middle two steps with:

lower_trop1 <- lower_trop[-518,][, lapply(.SD, as.numeric)]

though you still need to do the renaming for unique columns names.

CodePudding user response：

I guess you want some sort of join (so joining the columns from the 4 datasets). Here some information on the different join options in R: https://dplyr.tidyverse.org/reference/mutate-joins.html

Now in your case, you have the Year and the Mo columns that appear in your different datasets. So you could join the datasets by these columns. Here an example of a left join (just for the first two dataframes):

First, make sure that you have unique column names in each single dataframe (you do currently not have unique names, so this step is necessary):

names(lower_trop) <- make.unique(names(lower_trop), sep="_")
names(mid_trop) <- make.unique(names(lower_trop), sep="_")

Then you join these with a left join (so you keep all the rows from the left dataset):

new_df <- left_join(lower_trop, mid_trop, by = c("Year", "Mo"), suffix = c(".lower_trop", ".mid_trop"))