I'm working in R and have data on station names and coordinates (as well as the number of trips to respective stations). There are approximately 1,600 distinct station names, but some stations have many rows with slight variations in coordinates associated with them, resulting in thousands of rows showing with the same station with slightly different coordinates,(and only one trip taken when there are in fact many trips to that station).
I've attached a picture of the data to hopefully help convey my issue.screenshot of data frame in R
My end goal is to create a map visualization (which requires the coordinates data), and within that visual I want to display the number of trips to each distinct station. To do that accurately it seems I should create a data frame that has all the distinct station names along with only one set of coordinates for each station.
I'm just having a hard time figuring out how to do that. I already have a data frame with just the distinct station names, but I need one set of coordinates for each station to properly make the map visualization.
My apologies for not being able to ask more clearly, but any help would be greatly appreciated.
dput(head(popular_starting_routes))
structure(list(member_casual = c("casual", "casual", "casual",
"casual", "casual", "casual"), start_station_name = c("Streeter Dr & Grand Ave",
"Theater on the Lake", "Millennium Park", "Michigan Ave & Oak St",
"Lake Shore Dr & Monroe St", "Wells St & Concord Ln"), start_lat = c(41.892278,
41.926277, 41.8810317, 41.90096039, 41.880958, 41.912133), start_lng = c(-87.612043,
-87.630834, -87.62408432, -87.62377664, -87.616743, -87.634656
), number_of_rides = c(54377L, 17621L, 17004L, 16928L, 16853L,
13941L), average_duration = c(40.9590884013462, 35.1187673798309,
40.4876038971222, 36.4537137681159, 44.4814147827291, 21.9252014441814
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L), groups = structure(list(member_casual = c("casual", "casual",
"casual", "casual", "casual", "casual"), start_station_name = c("Lake Shore Dr & Monroe St",
"Michigan Ave & Oak St", "Millennium Park", "Streeter Dr & Grand Ave",
"Theater on the Lake", "Wells St & Concord Ln"), start_lat = c(41.880958,
41.90096039, 41.8810317, 41.892278, 41.926277, 41.912133), .rows = structure(list(
5L, 4L, 3L, 1L, 2L, 6L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), .drop = TRUE))
CodePudding user response:
Try this:
library(dplyr)
df <- top_50000_popular_ending_routes %>% dplyr::select(end_station_name, end_lat, end_lng) %>% group_by(end_station_name) %>% slice_head(n=1)
CodePudding user response:
You could do something like this:
This will give you the most common end_lat
and end_Ing
per station.
library(dplyr)
df %>%
group_by(end_station_name) %>%
count(end_station_name, end_lat, end_Ing, sort = TRUE) %>%
slice_max(n)
end_station_name end_lat end_Ing n
<chr> <dbl> <dbl> <int>
1 Adler Planetarium 41.9 -87.6 5
data:
df <- structure(list(member_casual = c("casual", "casual", "casual",
"casual", "casual", "casual", "casual", "casual", "casual", "casual",
"casual", "casual", "casual", "casual", "casual", "casual", "casual",
"casual", "casual", "casual", "casual", "casual", "casual", "casual",
"casual", "casual", "casual", "casual", "casual", "casual"),
end_station_name = c("Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium", "Adler Planetarium", "Adler Planetarium",
"Adler Planetarium"), end_lat = c(41.86607, 41.86607, 41.86607,
41.86607, 41.86607, 41.86607, 41.86607, 41.86607, 41.86607,
41.86607, 41.86607, 41.86607, 41.86608, 41.86608, 41.86608,
41.86608, 41.86608, 41.86608, 41.86608, 41.86608, 41.86608,
41.86608, 41.86608, 41.86608, 41.86608, 41.86608, 41.86608,
41.86608, 41.86608, 41.86608), end_Ing = c(-87.60737, -87.60738,
-87.60738, -87.60742, -87.60743, -87.60731, -87.60739, -87.60738,
-87.60736, -87.6074, -87.60738, -87.60738, -87.60726, -87.60734,
-87.60732, -87.60736, -87.60734, -87.6074, -87.60736, -87.60732,
-87.60732, -87.60617, -87.60742, -87.60734, -87.60737, -87.60736,
-87.60735, -87.60744, -87.60727, -87.60734), number_of_rides = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
average_duration = c(28.866667, 5, 34.55, 9.766667, 28.883333,
19.65, 24.883333, 23.533333, 51.183333, 46.983333, 38.366667,
1.066667, 27.483333, 15.783333, 56.25, 19.016667, 1.083333,
9.683333, 32, 13.95, 32.25, 31.116667, 12.783333, 55.3, 10.733333,
1.45, 59.066667, 51.883333, 49.666667, 113.033333)), class = "data.frame", row.names = c(NA,
-30L))