Home > front end >  How to merge duplicate rows in R
How to merge duplicate rows in R

Time:04-04

I am new to R and very stuck on a problem which I've tried to solve in various ways.

I have data I want to plot to a graph that shows twitter engagements per day.

To do this, I need to merge all the 'created at' rows, so there is only one data per row, and each date has the 'total engagements' assigned to it.

This is the data:

enter image description here

So far, I've tried to do this, but can't seem to get the grouping to work.

I mutated the data to get a new 'total engage' column:

lgbthm_data_2 <- lgbthm_data %>%
  mutate(
    total_engage = favorite_count   retweet_count
  ) %>%

Then I've tried to merge the dates:

only_one_date <- lgbthm_data_2 %>%
  group_by(created_at) %>%
 summarise_all(na.omit)

But no idea!

Any help would be great

Thanks

CodePudding user response:

You are looking for:

library(dplyr)
only_one_date <- lgbthm_data_2 %>%
  group_by(created_at) %>%
  summarise(n = n())

And there is even a shorthand for this in dplyr:

only_one_date <- lgbthm_data_2 %>%
  count(created_at)

group_by summarise can be used for many things that involve summarising all values in a group to one value, for example the mean, max and min of a column. Here I think you simply want to know how many rows each group has, i.e., how many tweets were created in one day. The special function n() tells you exactly that.

From experience with Twitter, I also know that the column created_at is usually a time, not a date format. In this case, it makes sense to use count(day = as.Date(created_at)) to convert it to a date first.

CodePudding user response:

library(tidyverse)

data <- tribble(
  ~created_at, ~favorite_count, ~retweet_count,
  "2022-02-01", 0, 2,
  "2022-02-01", 1, 3,
  "2022-02-02", 2, NA
)

summary_data <-
  data %>%
  type_convert() %>%
  group_by(created_at) %>%
  summarise(total_engage = sum(favorite_count, retweet_count, na.rm = TRUE))
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   created_at = col_date(format = "")
#> )
summary_data
#> # A tibble: 2 × 2
#>   created_at total_engage
#>   <date>            <dbl>
#> 1 2022-02-01            6
#> 2 2022-02-02            2

qplot(created_at, total_engage, geom = "col", data = summary_data)

Created on 2022-04-04 by the reprex package (v2.0.0)

  • Related