Compute averages for two distinct sets from one table in a single query-CodePudding

I have a query to show the average length of all the proteins which also exists in a second table mrna_pieces:

SELECT AVG(protein_length)
FROM protein
WHERE exists
(SELECT protein_id FROM mrna_pieces WHERE mrna_brokstukken.protein_id = protein.protein_id)

I also want to show the average for all the proteins which don't exist in the second table.

SELECT AVG(protein_length)
FROM protein
WHERE exists
(SELECT protein_id FROM mrna_pieces WHERE mrna_brokstukken.protein_id != protein.protein_id)

But I want these 2 parts in one table like this table example
I tried:

SELECT AVG(eiwit_lengte) AS avglengthwith, AVG(eiwit_lengte) AS avglengthwithout
FROM eiwit
WHERE exists
(SELECT eiwit_id FROM mrna_brokstukken WHERE mrna_brokstukken.eiwit_id != eiwit.eiwit_id)
WHERE exists
(SELECT eiwit_id FROM mrna_brokstukken WHERE mrna_brokstukken.eiwit_id = eiwit.eiwit_id)

But that gives me the following error:

ERROR: pq: syntax error at or near "WHERE"

CodePudding user response：

Just select the two at the same time:

select (
  select AVG(protein_length)
  FROM protein
  WHERE exists
  (select protein_id from mrna_pieces where mrna_brokstukken.protein_id = protein.protein_id)
) as avgwith,
(
  select AVG(protein_length)
  FROM protein
  WHERE exists
  (select protein_id from mrna_pieces where mrna_brokstukken.protein_id != protein.protein_id)
)
as avgwithout

CodePudding user response：

as both queries will gove back a scalar value you can simply use bot as subqueries

SELECT

    (select AVG(protein_length)
    FROM protein
    WHERE exists
    (select protein_id from mrna_pieces where mrna_brokstukken.protein_id = protein.protein_id) ) AVG_1,
    (select AVG(protein_length)
    FROM protein
    WHERE exists
    (select protein_id from mrna_pieces where mrna_brokstukken.protein_id != protein.protein_id)) AVG_2

CodePudding user response：

select AVG(CASE WHEN m.protein_id IS NULL THEN p.protein_length END) AS ExistsInTable,
       AVG(CASE WHEN m.protein_id IS NOT NULL THEN p.protein_length END) AS ExistsNOTInTable
FROM protein p
LEFT JOIN mrna_pieces m ON m.protein_id = p.protein_id

CodePudding user response：

It should be cheapest to compute the "exists" value once per row in a subquery, then two conditional aggregates in the outer SELECT:

SELECT avg(protein_length) FILTER (WHERE p_exists) AS avg_len_exists
     , avg(protein_length) FILTER (WHERE NOT p_exists) AS avg_len_not_exists
FROM  (
   SELECT protein_length
        , EXISTS (SELECT FROM mrna_pieces m WHERE m.protein_id = p.protein_id) AS p_exists
   FROM   protein p
   ) sub;

About the aggregate FILTER clause (which can be combined with the OVER clause of window functions):

Aggregate columns with additional (distinct) filters

Or, shorter if two result rows are acceptable:

SELECT EXISTS (SELECT FROM mrna_pieces m WHERE m.protein_id = p.protein_id) AS p_exists
     , avg(protein_length) avg_len
FROM   protein p
GROUP  BY 1;

Should be slightly faster, yet.

Note that the second query in your question is not doing what you are asking for. EXISTS ( ... WHERE m.protein_id != p.protein_id) is true if at least one row in table mrna_pieces does not match, which is always the case if there are at least two distinct values in m.protein_id. You would want NOT EXISTS ( ... WHERE m.protein_id = p.protein_id). But use the cheaper query I suggested instead.

CodePudding user response：

You can use the SQL JOIN statement to combine rows from two or more tables based on a related column between them. To get the average value from each table, you can use the AVG() function. Example query:

SELECT AVG(table1.column), AVG(table2.column)
FROM table1
JOIN table2
ON table1.related_column = table2.related_column;