SQL query to extract metadata based on two conditions in second dataset-CodePudding

I have two data sets with the following format:

m -  acc | sra_study | bioproject | 
tax - acc | tax_id | total_count

Each row of m represents a biological sample (acc). The tax table represents which organisms (tax_id) were found in each biological sample (acc) and how many times they were observed (total_count). The tax_id and total_count columns values are integers, while the rest of the columns are strings.

I would like to filter rows in m based on whether tax_id=9606 is present with a total_count > 10000000 AND tax_id=2 is present with a total_count>1000000 for a given sample (acc).

I have tried to do this with the following SQL query:

SELECT m.acc, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m,
     `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
WHERE m.acc = tax.acc
  AND (tax.tax_id = 9606 AND tax.total_count > 10000000)
  AND (tax.tax_id = 2 AND tax.total_count > 1000000)

However, the query does not return any results. I suspect this is because of a problem with the syntax my SQL query.

CodePudding user response：

Switch to explicit JOIN syntax.

OR the WHERE clause conditions.

If you want a m.acc having both tax_id 9606 and 2 (in separate rows), do a GROUP BY:

SELECT m.acc, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m
JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
    ON m.acc = tax.acc
WHERE (tax.tax_id = 9606 AND tax.total_count > 10000000)
   OR (tax.tax_id = 2    AND tax.total_count > 1000000)   
GROUP BY m.acc, m.sra_study, m.bioproject
HAVING COUNT(DISTINCT tax.tax_id) = 2

CodePudding user response：

You have specified what appears to be a mutually exclusive condition in your where clause tax.tax_id = 9606 .. AND .. tax.tax_id = 2

Modify Like

SELECT m.acc, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m, `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
WHERE m.acc=tax.acc AND ((tax.tax_id=9606 AND tax.total_count > 1000000) OR (tax.tax_id=2 AND tax.total_count>1000000))