I am trying to perform some analysis on the Online New Popularity dataset from the UCI Open Data Repo here: https://archive.ics.uci.edu/ml/datasets/online news popularity
The dataset has a set of 7 boolean attributes that denote the day of the week that the article was published on. For example, the column weekday_is_monday
will have the value 1
if the article was published on a Monday and so on. For my analysis, I am trying to merge these fields into a single field that contains the string literal of publishing day.
So I load this dataset then go through and replace each true value with the string literal:
news <- read.csv("path_to_my_dataset",
header=TRUE,
sep=",",
fill=F,
strip.white = T,
stringsAsFactors=FALSE)
news$weekday_is_monday <- gsub('^1', 'Monday', news$weekday_is_monday)
news$weekday_is_tuesday <- gsub('^1', 'Tuesday', news$weekday_is_tuesday)
news$weekday_is_wednesday <- gsub('^1', 'Wednesday', news$weekday_is_wednesday)
news$weekday_is_thursday <- gsub('^1', 'Thusday', news$weekday_is_thursday)
news$weekday_is_friday <- gsub('^1', 'Friday', news$weekday_is_friday)
news$weekday_is_saturday <- gsub('^1', 'Saturday', news$weekday_is_saturday)
news$weekday_is_sunday <- gsub('^1', 'Sunday', news$weekday_is_sunday)
Next I found a solution in this thread that used the dpyler::coalesce
function to merge all the fields. I adapted this to my dataset as follows:
news <- news %>% mutate_at(vars(starts_with("weekday_is")), funs(na_if(.,"0"))) %>%
mutate(news, publishing_day = coalesce(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday,
weekday_is_friday, weekday_is_saturday, weekday_is_sunday))
news$publishing_day <- as.factor(news$publishing_day)
summary(news$publishing_day)
However, this only merges the fields from the first column (i.e. Monday):
0 Monday
32983 6661
Where am I going wrong here?
CodePudding user response:
In the pipe operation, do not repeat inputting the data on the left hand side of mutate
into the right hand side of it. That is the cause of your problem. You just need to remove news
in mutate(news, publishing_day = ...)
news <- news %>% mutate_at(vars(starts_with("weekday_is")), funs(na_if(.,"0"))) %>%
mutate(publishing_day = coalesce(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday,
weekday_is_friday, weekday_is_saturday, weekday_is_sunday))
news$publishing_day <- as.factor(news$publishing_day)
summary(news$publishing_day)
# Friday Monday Saturday Sunday Thusday Tuesday Wednesday
# 5701 6661 2453 2737 7267 7390 7435
CodePudding user response:
Here's one technique that uses reshaping pivot_longer
, then gsub
the literal out of the weekday_
column, then joining it back in.
quux <- read.csv("OnlineNewsPopularity.csv") # from your link
library(dplyr)
library(tidyr) # pivot_longer
quux2 <- quux %>%
select(url, starts_with("weekday_is")) %>%
pivot_longer(-url) %>%
dplyr::filter(value > 0) %>%
mutate(weekday = gsub("weekday_is_", "", name)) %>%
left_join(quux, by = "url") %>%
select(-name, -starts_with("weekday_is_"))
quux2
# # A tibble: 39,644 x 56
# url value weekday timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_uniq~ num_hrefs
# <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 http://mash~ 1 monday 731 12 219 0.664 1.00 0.815 4
# 2 http://mash~ 1 monday 731 9 255 0.605 1.00 0.792 3
# 3 http://mash~ 1 monday 731 9 211 0.575 1.00 0.664 3
# 4 http://mash~ 1 monday 731 9 531 0.504 1.00 0.666 9
# 5 http://mash~ 1 monday 731 13 1072 0.416 1.00 0.541 19
# 6 http://mash~ 1 monday 731 10 370 0.560 1.00 0.698 2
# 7 http://mash~ 1 monday 731 8 960 0.418 1.00 0.550 21
# 8 http://mash~ 1 monday 731 12 989 0.434 1.00 0.572 20
# 9 http://mash~ 1 monday 731 11 97 0.670 1.00 0.837 2
# 10 http://mash~ 1 monday 731 10 231 0.636 1.00 0.797 4
# # ... with 39,634 more rows, and 46 more variables: num_self_hrefs <dbl>, num_imgs <dbl>, num_videos <dbl>,
# # average_token_length <dbl>, num_keywords <dbl>, data_channel_is_lifestyle <dbl>, data_channel_is_entertainment <dbl>,
# # data_channel_is_bus <dbl>, data_channel_is_socmed <dbl>, data_channel_is_tech <dbl>, data_channel_is_world <dbl>,
# # kw_min_min <dbl>, kw_max_min <dbl>, kw_avg_min <dbl>, kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>, kw_min_avg <dbl>,
# # kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>, self_reference_max_shares <dbl>,
# # self_reference_avg_sharess <dbl>, is_weekend <dbl>, LDA_00 <dbl>, LDA_01 <dbl>, LDA_02 <dbl>, LDA_03 <dbl>, LDA_04 <dbl>,
# # global_subjectivity <dbl>, global_sentiment_polarity <dbl>, global_rate_positive_words <dbl>, ...
Proof of contents:
table(quux2$weekday)
# friday monday saturday sunday thursday tuesday wednesday
# 5701 6661 2453 2737 7267 7390 7435
You might consider converting that into a factor
if you intend to ever arrange
on the weekday
, since otherwise it'll sort them lexicographically (as shown above).
... %>%
mutate(weekday = factor(weekday, levels = c("monday", "tuesday", "wednesday", ..., "sunday")))
FYI, the only reason I assigned the pipe from quux
into a new variable quux2
was so that during your testing and evaluation of this, you would not inadvertently irreversibly overwrite your main dataset. Feel free to overwrite back onto itself.