Home > Software design >  Querying Stackoverflow public dataset on BigQuery on Q&A SQL
Querying Stackoverflow public dataset on BigQuery on Q&A SQL

Time:12-26

i have a home work assignment : We want to find all questions about the python pandas library, as well as their answers. Write a query that retrieves all the questions for which the title contains the word "pandas" from the posts_questions table, as well as all the appropriate answers for each such question from the post_answers , where each row in the returned table will represent a pair of (question answer). If the question has a number Answers, the same question will appear in multiple rows in the returned table. returned and the of the question as well as the following fields: the id, title, tag, answer_count score, creation time (creation_date) and the body of the text (the body) of both the question and the answer. For the body, all slash characters must be removed the line '\n'.

for this i wrote the following SQL code:

SELECT  tb1.id as q_id,tb1.title as q_title,tb1.tags as q_tags
,tb1.creation_date as q_creation_date,tb1.score as q_score,tb1.answer_count as q_answer_count
,REPLACE(tb1.body,'\n',' ') as body_qustion,REPLACE(tb2.body,'\n',' ') as body_answer
from  `bigquery-public-data.stackoverflow.posts_questions` as tb1 
left join  `bigquery-public-data.stackoverflow.posts_answers`  as tb2 
on tb1.id=tb2.id
 where( tb1.title like "%pandas%" or tb1.title like "%Pandas%" or tb1.title like "%PANDAS%")
 group by tb1.id ,tb1.title ,tb1.tags,tb1.creation_date,tb1.score
,tb1.answer_count,body_qustion,body_answer

but the problem is that when for example for a question i have 3 answers i expect it to return 3 rows for the question instead it returns only one and i dont know what is the problem .

the data is : bigquery-public-data.stackoverflow.posts_questions

enter image description here

and bigquery-public-data.stackoverflow.posts_answers :

enter image description here

CodePudding user response:

You have joined with the wrong ID of the answer table. In the answer table ID column represents the ID of the answer itself whereas parent_id represents the question id. You can play with the below query to have more understanding.

Query:

SELECT
  q.id AS q_id   #id of the question in question table
  ,
  a.id AS a_id  #id of the answer in answer table
  ,
  q.title AS q_title,
  q.tags AS q_tags,
  q.creation_date AS q_creation_date,
  q.score AS q_score,
  q.answer_count AS q_answer_count,
  REPLACE(q.body,'\n',' ') AS body_qustion,
  REPLACE(a.body,'\n',' ') AS body_answer
FROM
  `bigquery-public-data.stackoverflow.posts_questions` q
LEFT JOIN
  `bigquery-public-data.stackoverflow.posts_answers` a
ON
  q.id = a.parent_id #Joining with quesiton Ids
WHERE
  LOWER(q.title) LIKE '%pandas%'
  AND q.creation_date BETWEEN '2021-01-01'
  AND '2021-01-31'
  AND q.answer_count >1

Output:

enter image description here

  • Related