I have two data sets with the following format:
m - acc | sra_study | bioproject |
tax - acc | tax_id | total_count
Each row of m
represents a biological sample (acc
). The tax
table represents which organisms (tax_id
) were found in each biological sample (acc
) and how many times they were observed (total_count
).
The tax_id
and total_count
columns values are integers, while the rest of the columns are strings.
I would like to filter rows in m
based on whether tax_id=9606
is present with a total_count > 10000000
AND tax_id=2
is present with a total_count>1000000
for a given sample (acc
).
I have tried to do this with the following SQL query:
SELECT m.acc, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m,
`nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
WHERE m.acc = tax.acc
AND (tax.tax_id = 9606 AND tax.total_count > 10000000)
AND (tax.tax_id = 2 AND tax.total_count > 1000000)
However, the query does not return any results. I suspect this is because of a problem with the syntax my SQL query.
CodePudding user response:
Switch to explicit JOIN
syntax.
OR
the WHERE
clause conditions.
If you want a m.acc having both tax_id 9606 and 2 (in separate rows), do a GROUP BY
:
SELECT m.acc, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m
JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
ON m.acc = tax.acc
WHERE (tax.tax_id = 9606 AND tax.total_count > 10000000)
OR (tax.tax_id = 2 AND tax.total_count > 1000000)
GROUP BY m.acc, m.sra_study, m.bioproject
HAVING COUNT(DISTINCT tax.tax_id) = 2
CodePudding user response:
You have specified what appears to be a mutually exclusive condition in your where clause
tax.tax_id = 9606 .. AND .. tax.tax_id = 2
Modify Like
SELECT m.acc, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m, `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
WHERE m.acc=tax.acc AND ((tax.tax_id=9606 AND tax.total_count > 1000000) OR (tax.tax_id=2 AND tax.total_count>1000000))