Can't write an SQLite Query-CodePudding

Using the file posts.csv in the data folder of this repo (the sample of 10,000 public Facebook posts by members of the US congress from 2017), solve the following with dplyr:

Do not consider posts with zero likes
Compute the comment to like ratio (i.e. comments_count / likes_count) for each post and store it in a column clr
For each screen_name, compute a normaliser_based_on_even_months = max(clr) - min(clr), i.e. the maximum minus the minimum clr value of posts by that screen_name, however, only taking into account the posts made in even months, i.e. posts made in in February, April, June, August, October, December when computing max(clr) - min(clr) for each screen_name
Set all normaliser_based_on_even_months that have a value of zero to NA or delete them
Afterwards create a column normalised_clr which stores the clr of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the normaliser_based_on_even_months of the associated screen name. The only exception are posts from screen names that had a normaliser_based_on_even_months value of zero and were deleted/set to NA before -- for these posts just set the value in normalised_clr to NA as well or drop the post from the final data frame.
In other words, the value of a single post/line i (written by a politician p) in that normalised_clr column can be computed as: normalised_clr_{i,p} = clr{i}/normaliser_based_on_even_months_{p} for all observations for which there is a non-NA normaliser_based_on_even_months (no need to use a loop for this though, dplyr allows to compute it in a vectorised way)
Keep only those rows with normalised_clr > 0
Arrange the data frame according to normalised_clr in ascending order
Print out only screen_name and normalised_clr for the first 10 rows, i.e. the posts with the 10 lowest normalised_clr

I have written an R program for this. However, my task is to convert that into a SINGLE SQLite query. Here's what I've done:

# Create database
posts_db <- dbConnect(RSQLite::SQLite(), "C:/Users/HP/Documents/posts.sqlite")

# Reading the first file into R
data <- read.csv("C:/Users/HP/Documents/posts.csv",
    stringsAsFactors = FALSE)

dbWriteTable(posts_db, "posts", data, overwrite = TRUE)

dbListFields(posts_db, "posts")

dbGetQuery(posts_db,"WITH 
cte1 AS (SELECT screen_name, comments_count*1.0/likes_count AS clr, 
strftime('%m', date) AS month FROM posts WHERE likes_count>0), 
cte2 AS (SELECT (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months 
FROM cte1 
WHERE month % 2 = 0
GROUP BY screen_name),
cte3 AS (SELECT screen_name, clr, normaliser_based_on_even_months,
clr/normaliser_based_on_even_months AS normalized_clr FROM cte1, cte2 
WHERE normaliser_based_on_even_months>0)
SELECT screen_name, normalized_clr FROM cte3 
WHERE normalized_clr>0 
ORDER BY normalized_clr")

Here's the output:

               screen_name normalized_clr
1             repmarkpocan   0.0002546821
2             repmarkpocan   0.0002690018
3  CongressmanRalphAbraham   0.0002756995
4  CongressmanRalphAbraham   0.0002912010
5             derek.kilmer   0.0003549631
6           RepJimMcGovern   0.0003664136
7      CongresswomanNorton   0.0003687929
8             derek.kilmer   0.0003749212
9           RepJimMcGovern   0.0003870155
10     CongresswomanNorton   0.0003895286

Sample Data

For reference here is the R code that produces the result I want the SQLite Query to produce:

posts <- read.csv("C:/Users/HP/Documents/posts.csv")

#Remove columns with zero likes
posts <- posts %>% filter(likes_count > 0)

#create 'clr' which is the comment to like ratio
posts <- posts %>% mutate(clr = comments_count / likes_count) 

#Compute the maximum minus the minimum `clr` value of posts by that `screen_name`, however, only taking into account __the posts made in even months, i.e. posts made in in February, April, June, August, October, December__ when computing `max(clr) - min(clr)` for each `screen_name`. Code from https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group

posts$date <- ymd(posts$date)
posts$date <- month(posts$date)


posts_normaliser <- posts %>% group_by(screen_name) %>% mutate(normaliser_based_on_even_months = case_when(date%%2==0 ~ (max(clr) - min(clr))))


#Set all `normaliser_based_on_even_months` that have a value of zero to NA or delete them
posts_normaliser <- posts_normaliser %>% filter(normaliser_based_on_even_months > 0)

#Afterwards create a column `normalised_clr` which stores the `clr` of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the `normaliser_based_on_even_months` of the associated screen name. 

merged_df <- merge(posts, posts_normaliser)

merged_df <- merged_df %>% group_by(screen_name)%>% mutate(normalised_clr = clr / normaliser_based_on_even_months)

#Keep only those rows with `normalised_clr` \> 0

merged_df <- merged_df %>% filter(normalised_clr > 0)

#Arrange the data frame according to `normalised_clr` in ascending order

merged_df <- merged_df %>% arrange(normalised_clr)

#Print out only `screen_name` and `normalised_clr` for the first 10 rows, i.e. the posts with the 10 lowest `normalised_clr`

merged_df[1:10, c("screen_name", "normalised_clr")]

Here's the Output from R:

> merged_df[1:10, c("screen_name", "normalised_clr")]
# A tibble: 10 × 2
# Groups:   screen_name [5]
   screen_name                   normalised_clr
   <chr>                                  <dbl>
 1 CongresswomanSheilaJacksonLee        0.00214
 2 CongresswomanSheilaJacksonLee        0.00218
 3 CongresswomanSheilaJacksonLee        0.00277
 4 RepMullin                            0.00342
 5 SenDuckworth                         0.00342
 6 CongresswomanSheilaJacksonLee        0.00357
 7 replahood                            0.00477
 8 SenDuckworth                         0.00488
 9 SenDuckworth                         0.00505
10 RepSmucker                           0.00516

I keep going through the statements but I can't figure out why the outputs are different.

CodePudding user response：

If you look at the output from your SQL query, it results in 4,543,128 rows out of the original 10,000, indicating that your joins are exploding. You need to join on something; since all of the tables you're joining are either (a) original rows or (b) summarized, unique screen_name, I think we can use that column.

I modified your query to do two things:

add screen_name to cte2
update the join inside cte3
add limit 10 since you said "all into a single query", and "lowest 10" is part of the constraints.

WITH
cte1 AS (
  SELECT screen_name, comments_count*1.0/likes_count AS clr,
    strftime('%m', date) AS month
  FROM posts WHERE likes_count > 0
),
cte2 AS (
  SELECT screen_name, (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months
  FROM cte1
  WHERE month % 2 = 0
  GROUP BY screen_name
),
cte3 AS (
  SELECT cte1.screen_name, clr, normaliser_based_on_even_months,
    clr/normaliser_based_on_even_months AS normalized_clr
  FROM cte1
    LEFT JOIN cte2 on cte1.screen_name = cte2.screen_name
  WHERE normaliser_based_on_even_months > 0
)
SELECT screen_name, normalized_clr
FROM cte3
  WHERE normalized_clr > 0
ORDER BY normalized_clr
LIMIT 10

This produces:

      screen_name normalized_clr
1    SenDuckworth    0.002318900
2       RepMullin    0.003415301
3    SenDuckworth    0.003425708
4  repmikecoffman    0.003861004
5    SenDuckworth    0.004173566
6    SenDuckworth    0.004880627
7    SenDuckworth    0.005035819
8    SenDuckworth    0.005051889
9    SenDuckworth    0.005112120
10        RepBost    0.005600815