Before getting to the question, I browsed StackOverflow and found similar related questions to my own, though I could not find something that is specific to my issue.
I recently used R to pull some satellite data from an API that provide monthly summary statistics based on locations derived from a shapefile. Unfortunately, I could not submit the entire shapefile to the API as the request was simply too large, resulting in having to split the shapefile into 200 row segments.
The data returned from the API identifies each location in the "location" column with values labeled "location0001", "location0002", "location0003", etc. The dataset looks as follows:
location date data.stats1 data.stats2
------------ ------------ ------------- --------------
location0001 2012-01-01 data data
location0001 2012-02-01 data data
location0001 2012-03-01 data data
location0001 2012-04-01 etc. etc.
location0001 2012-05-01
location0001 2012-06-01
location0001 2012-07-01
location0001 2012-08-01
location0001 2012-09-01
location0001 2012-10-01
location0001 2012-11-01
location0001 2012-12-01
location0002 2012-01-01
location0002 2012-02-01
location0002 2012-03-01
You can see there are 12 iterations of "location0001" for each month of data throughout the year, followed by "location0002" for the next set of monthly data. This continues up to "location0200." This is fine for the first dataset, however, for subsequent datasets, I need the values to continue past 200 (e.g. I need "location0201", "location0202", "location0203" etc.).
What I would like to do is update the values in the "location" column to start on "0201" all the way to "0400" so that it looks like this:
location date data.stats1 data.stats2
------------ ------------ ------------- --------------
location0201 2012-01-01 data data
location0201 2012-02-01 data data
location0201 2012-03-01 data data
location0201 2012-04-01 etc. etc.
location0201 2012-05-01
location0201 2012-06-01
location0201 2012-07-01
location0201 2012-08-01
location0201 2012-09-01
location0201 2012-10-01
location0201 2012-11-01
location0201 2012-12-01
location0202 2012-01-01
location0202 2012-02-01
location0202 2012-03-01
I have tried mutating a new column using a combination of paste0()
, seq()
, and str_pad()
:
my.dataset <- my.dataset %>%
group_by(location) %>%
mutate(location_2 = paste0('location', str_pad(seq(201,400), 4, side = 'left', pad = '0'))) %>%
ungroup()
This returns an error stating that "location_2 must be size 12 or 1, not 200" which seems to indicate that problem is occurring in the seq(201,400)
portion (length is 200).
Is there a way I can update these values by group to reflect the numeric sequence I need?
Thanks!
CodePudding user response:
One option with tidyverse
would be to separate
the number, then change the numbers (here I just add 200), then put the columns back together.
library(tidyverse)
df %>%
separate(location, c("location", "number"), "(?<=[A-Za-z])(?=[0-9])") %>%
mutate(number = paste0("0", as.numeric(number) 200)) %>%
unite("location", location:number, sep = "")
Output
location date
1 location0201 2012-01-01
2 location0201 2012-02-01
3 location0201 2012-03-01
4 location0201 2012-04-01
5 location0201 2012-05-01
6 location0201 2012-06-01
7 location0201 2012-07-01
8 location0201 2012-08-01
9 location0201 2012-09-01
10 location0201 2012-10-01
11 location0201 2012-11-01
12 location0201 2012-12-01
13 location0202 2012-01-01
14 location0202 2012-02-01
15 location0202 2012-03-01
Update
If you would like to turn this into a function and have a different starting number for each dataframe in a list, then we can use map2
to pass a vector of starting numbers.
sep_unite <- function(x, y) {
x %>%
separate(location, c("location", "number"), "(?<=[A-Za-z])(?=[0-9])") %>%
mutate(number = paste0("0", as.numeric(number) y)) %>%
unite("location", location:number, sep = "")
}
df.list <- list(df, df, df, df)
start_num <- seq(200, 800, 200)
output.list <- map2(.x = df.list, .y = start_num, .f = sep_unite)
Output
map(output.list, head, 2)
[[1]]
location date
1 location0201 2012-01-01
2 location0201 2012-02-01
[[2]]
location date
1 location0401 2012-01-01
2 location0401 2012-02-01
[[3]]
location date
1 location0601 2012-01-01
2 location0601 2012-02-01
[[4]]
location date
1 location0801 2012-01-01
2 location0801 2012-02-01
Data
df <- structure(list(location = c("location0001", "location0001", "location0001",
"location0001", "location0001", "location0001", "location0001",
"location0001", "location0001", "location0001", "location0001",
"location0001", "location0002", "location0002", "location0002"
), date = c("2012-01-01", "2012-02-01", "2012-03-01", "2012-04-01",
"2012-05-01", "2012-06-01", "2012-07-01", "2012-08-01", "2012-09-01",
"2012-10-01", "2012-11-01", "2012-12-01", "2012-01-01", "2012-02-01",
"2012-03-01")), class = "data.frame", row.names = c(NA, -15L))