Home > Back-end >  How to merge data frame columns based on value in R
How to merge data frame columns based on value in R

Time:03-13

I am trying to perform some analysis on the Online New Popularity dataset from the UCI Open Data Repo here: https://archive.ics.uci.edu/ml/datasets/online news popularity

The dataset has a set of 7 boolean attributes that denote the day of the week that the article was published on. For example, the column weekday_is_monday will have the value 1 if the article was published on a Monday and so on. For my analysis, I am trying to merge these fields into a single field that contains the string literal of publishing day.

So I load this dataset then go through and replace each true value with the string literal:

news <- read.csv("path_to_my_dataset",
                 header=TRUE,
                 sep=",",
                 fill=F,
                 strip.white = T,
                 stringsAsFactors=FALSE)
news$weekday_is_monday <- gsub('^1', 'Monday', news$weekday_is_monday)
news$weekday_is_tuesday <- gsub('^1', 'Tuesday', news$weekday_is_tuesday)
news$weekday_is_wednesday <- gsub('^1', 'Wednesday', news$weekday_is_wednesday)
news$weekday_is_thursday <- gsub('^1', 'Thusday', news$weekday_is_thursday)
news$weekday_is_friday <- gsub('^1', 'Friday', news$weekday_is_friday)
news$weekday_is_saturday <- gsub('^1', 'Saturday', news$weekday_is_saturday)
news$weekday_is_sunday <- gsub('^1', 'Sunday', news$weekday_is_sunday)

Next I found a solution in this thread that used the dpyler::coalesce function to merge all the fields. I adapted this to my dataset as follows:

news <- news %>% mutate_at(vars(starts_with("weekday_is")), funs(na_if(.,"0"))) %>%
  mutate(news, publishing_day = coalesce(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday,
                                   weekday_is_friday, weekday_is_saturday, weekday_is_sunday))
news$publishing_day <- as.factor(news$publishing_day)
summary(news$publishing_day)

However, this only merges the fields from the first column (i.e. Monday):

     0 Monday 
 32983   6661 

Where am I going wrong here?

CodePudding user response:

In the pipe operation, do not repeat inputting the data on the left hand side of mutate into the right hand side of it. That is the cause of your problem. You just need to remove news in mutate(news, publishing_day = ...)

news <- news %>% mutate_at(vars(starts_with("weekday_is")), funs(na_if(.,"0"))) %>%
  mutate(publishing_day = coalesce(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday,
                                         weekday_is_friday, weekday_is_saturday, weekday_is_sunday))
news$publishing_day <- as.factor(news$publishing_day)
summary(news$publishing_day)
#  Friday    Monday  Saturday    Sunday   Thusday   Tuesday Wednesday 
#    5701      6661      2453      2737      7267      7390      7435 

CodePudding user response:

Here's one technique that uses reshaping pivot_longer, then gsub the literal out of the weekday_ column, then joining it back in.

quux <- read.csv("OnlineNewsPopularity.csv") # from your link

library(dplyr)
library(tidyr) # pivot_longer
quux2 <- quux %>%
  select(url, starts_with("weekday_is")) %>%
  pivot_longer(-url) %>%
  dplyr::filter(value > 0) %>%
  mutate(weekday = gsub("weekday_is_", "", name)) %>%
  left_join(quux, by = "url") %>%
  select(-name, -starts_with("weekday_is_"))
quux2
# # A tibble: 39,644 x 56
#    url          value weekday timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_uniq~ num_hrefs
#    <chr>        <dbl> <chr>       <dbl>          <dbl>            <dbl>           <dbl>            <dbl>            <dbl>     <dbl>
#  1 http://mash~     1 monday        731             12              219           0.664             1.00            0.815         4
#  2 http://mash~     1 monday        731              9              255           0.605             1.00            0.792         3
#  3 http://mash~     1 monday        731              9              211           0.575             1.00            0.664         3
#  4 http://mash~     1 monday        731              9              531           0.504             1.00            0.666         9
#  5 http://mash~     1 monday        731             13             1072           0.416             1.00            0.541        19
#  6 http://mash~     1 monday        731             10              370           0.560             1.00            0.698         2
#  7 http://mash~     1 monday        731              8              960           0.418             1.00            0.550        21
#  8 http://mash~     1 monday        731             12              989           0.434             1.00            0.572        20
#  9 http://mash~     1 monday        731             11               97           0.670             1.00            0.837         2
# 10 http://mash~     1 monday        731             10              231           0.636             1.00            0.797         4
# # ... with 39,634 more rows, and 46 more variables: num_self_hrefs <dbl>, num_imgs <dbl>, num_videos <dbl>,
# #   average_token_length <dbl>, num_keywords <dbl>, data_channel_is_lifestyle <dbl>, data_channel_is_entertainment <dbl>,
# #   data_channel_is_bus <dbl>, data_channel_is_socmed <dbl>, data_channel_is_tech <dbl>, data_channel_is_world <dbl>,
# #   kw_min_min <dbl>, kw_max_min <dbl>, kw_avg_min <dbl>, kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>, kw_min_avg <dbl>,
# #   kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>, self_reference_max_shares <dbl>,
# #   self_reference_avg_sharess <dbl>, is_weekend <dbl>, LDA_00 <dbl>, LDA_01 <dbl>, LDA_02 <dbl>, LDA_03 <dbl>, LDA_04 <dbl>,
# #   global_subjectivity <dbl>, global_sentiment_polarity <dbl>, global_rate_positive_words <dbl>, ...

Proof of contents:

table(quux2$weekday)
#    friday    monday  saturday    sunday  thursday   tuesday wednesday 
#      5701      6661      2453      2737      7267      7390      7435 

You might consider converting that into a factor if you intend to ever arrange on the weekday, since otherwise it'll sort them lexicographically (as shown above).

... %>%
  mutate(weekday = factor(weekday, levels = c("monday", "tuesday", "wednesday", ..., "sunday")))

FYI, the only reason I assigned the pipe from quux into a new variable quux2 was so that during your testing and evaluation of this, you would not inadvertently irreversibly overwrite your main dataset. Feel free to overwrite back onto itself.

  • Related