Using the file posts.csv
in the data folder of this repo (the sample of 10,000 public Facebook posts by members of the US congress from 2017), solve the following with dplyr
:
Do not consider posts with zero likes
Compute the comment to like ratio (i.e. comments_count / likes_count) for each post and store it in a column
clr
For each
screen_name
, compute anormaliser_based_on_even_months = max(clr) - min(clr)
, i.e. the maximum minus the minimumclr
value of posts by thatscreen_name
, however, only taking into account the posts made in even months, i.e. posts made in in February, April, June, August, October, December when computingmax(clr) - min(clr)
for eachscreen_name
Set all
normaliser_based_on_even_months
that have a value of zero to NA or delete themAfterwards create a column
normalised_clr
which stores theclr
of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by thenormaliser_based_on_even_months
of the associated screen name. The only exception are posts from screen names that had anormaliser_based_on_even_months
value of zero and were deleted/set to NA before -- for these posts just set the value innormalised_clr
to NA as well or drop the post from the final data frame.In other words, the value of a single post/line
i
(written by a politicianp
) in thatnormalised_clr
column can be computed as:normalised_clr_{i,p} = clr{i}/normaliser_based_on_even_months_{p}
for all observations for which there is a non-NAnormaliser_based_on_even_months
(no need to use a loop for this though,dplyr
allows to compute it in a vectorised way)Keep only those rows with
normalised_clr
> 0Arrange the data frame according to
normalised_clr
in ascending orderPrint out only
screen_name
andnormalised_clr
for the first 10 rows, i.e. the posts with the 10 lowestnormalised_clr
I have written an R program for this. However, my task is to convert that into a SINGLE SQLite query. Here's what I've done:
# Create database
posts_db <- dbConnect(RSQLite::SQLite(), "C:/Users/HP/Documents/posts.sqlite")
# Reading the first file into R
data <- read.csv("C:/Users/HP/Documents/posts.csv",
stringsAsFactors = FALSE)
dbWriteTable(posts_db, "posts", data, overwrite = TRUE)
dbListFields(posts_db, "posts")
dbGetQuery(posts_db,"WITH
cte1 AS (SELECT screen_name, comments_count*1.0/likes_count AS clr,
strftime('%m', date) AS month FROM posts WHERE likes_count>0),
cte2 AS (SELECT (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months
FROM cte1
WHERE month % 2 = 0
GROUP BY screen_name),
cte3 AS (SELECT screen_name, clr, normaliser_based_on_even_months,
clr/normaliser_based_on_even_months AS normalized_clr FROM cte1, cte2
WHERE normaliser_based_on_even_months>0)
SELECT screen_name, normalized_clr FROM cte3
WHERE normalized_clr>0
ORDER BY normalized_clr")
Here's the output:
screen_name normalized_clr
1 repmarkpocan 0.0002546821
2 repmarkpocan 0.0002690018
3 CongressmanRalphAbraham 0.0002756995
4 CongressmanRalphAbraham 0.0002912010
5 derek.kilmer 0.0003549631
6 RepJimMcGovern 0.0003664136
7 CongresswomanNorton 0.0003687929
8 derek.kilmer 0.0003749212
9 RepJimMcGovern 0.0003870155
10 CongresswomanNorton 0.0003895286
For reference here is the R code that produces the result I want the SQLite Query to produce:
posts <- read.csv("C:/Users/HP/Documents/posts.csv")
#Remove columns with zero likes
posts <- posts %>% filter(likes_count > 0)
#create 'clr' which is the comment to like ratio
posts <- posts %>% mutate(clr = comments_count / likes_count)
#Compute the maximum minus the minimum `clr` value of posts by that `screen_name`, however, only taking into account __the posts made in even months, i.e. posts made in in February, April, June, August, October, December__ when computing `max(clr) - min(clr)` for each `screen_name`. Code from https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
posts$date <- ymd(posts$date)
posts$date <- month(posts$date)
posts_normaliser <- posts %>% group_by(screen_name) %>% mutate(normaliser_based_on_even_months = case_when(date%%2==0 ~ (max(clr) - min(clr))))
#Set all `normaliser_based_on_even_months` that have a value of zero to NA or delete them
posts_normaliser <- posts_normaliser %>% filter(normaliser_based_on_even_months > 0)
#Afterwards create a column `normalised_clr` which stores the `clr` of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the `normaliser_based_on_even_months` of the associated screen name.
merged_df <- merge(posts, posts_normaliser)
merged_df <- merged_df %>% group_by(screen_name)%>% mutate(normalised_clr = clr / normaliser_based_on_even_months)
#Keep only those rows with `normalised_clr` \> 0
merged_df <- merged_df %>% filter(normalised_clr > 0)
#Arrange the data frame according to `normalised_clr` in ascending order
merged_df <- merged_df %>% arrange(normalised_clr)
#Print out only `screen_name` and `normalised_clr` for the first 10 rows, i.e. the posts with the 10 lowest `normalised_clr`
merged_df[1:10, c("screen_name", "normalised_clr")]
Here's the Output from R:
> merged_df[1:10, c("screen_name", "normalised_clr")]
# A tibble: 10 × 2
# Groups: screen_name [5]
screen_name normalised_clr
<chr> <dbl>
1 CongresswomanSheilaJacksonLee 0.00214
2 CongresswomanSheilaJacksonLee 0.00218
3 CongresswomanSheilaJacksonLee 0.00277
4 RepMullin 0.00342
5 SenDuckworth 0.00342
6 CongresswomanSheilaJacksonLee 0.00357
7 replahood 0.00477
8 SenDuckworth 0.00488
9 SenDuckworth 0.00505
10 RepSmucker 0.00516
I keep going through the statements but I can't figure out why the outputs are different.
CodePudding user response:
If you look at the output from your SQL query, it results in 4,543,128 rows out of the original 10,000, indicating that your joins are exploding. You need to join on something; since all of the tables you're joining are either (a) original rows or (b) summarized, unique screen_name
, I think we can use that column.
I modified your query to do two things:
- add
screen_name
tocte2
- update the join inside
cte3
- add
limit 10
since you said "all into a single query", and "lowest 10" is part of the constraints.
WITH
cte1 AS (
SELECT screen_name, comments_count*1.0/likes_count AS clr,
strftime('%m', date) AS month
FROM posts WHERE likes_count > 0
),
cte2 AS (
SELECT screen_name, (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months
FROM cte1
WHERE month % 2 = 0
GROUP BY screen_name
),
cte3 AS (
SELECT cte1.screen_name, clr, normaliser_based_on_even_months,
clr/normaliser_based_on_even_months AS normalized_clr
FROM cte1
LEFT JOIN cte2 on cte1.screen_name = cte2.screen_name
WHERE normaliser_based_on_even_months > 0
)
SELECT screen_name, normalized_clr
FROM cte3
WHERE normalized_clr > 0
ORDER BY normalized_clr
LIMIT 10
This produces:
screen_name normalized_clr
1 SenDuckworth 0.002318900
2 RepMullin 0.003415301
3 SenDuckworth 0.003425708
4 repmikecoffman 0.003861004
5 SenDuckworth 0.004173566
6 SenDuckworth 0.004880627
7 SenDuckworth 0.005035819
8 SenDuckworth 0.005051889
9 SenDuckworth 0.005112120
10 RepBost 0.005600815