Home > other >  Count pairs in the same column based on a common value in another column
Count pairs in the same column based on a common value in another column

Time:11-06

To simplify, I have a table with two columns, the first one being "article doi number" and the other being "publishing company". Here is a long example cut from the table:

   article_doi_number      publishing_company
1            10.24397                elsevier
2            10.24397         emerald insight
3            10.24397                   jstor
4            10.24397             tandfonline
5            10.24397          wiley and sons
6            10.24397           sage journals
7            10.18150                elsevier
8            10.18150          wiley and sons
9            10.18150         emerald insight
10           10.60920           gale academic
11           10.60920                proquest
12           10.60920           springer link
13           10.60920                   jstor
14           10.80442         ingenta connect
15           10.80442                  pubmed
16           10.21337 harvard business review
17           10.21337                proquest
18           10.80416                    plos
19           10.80431           research gate
20           10.80431                proquest
21           10.70125                    ieee
22           10.81093               frontiers
23           10.33542                  scielo
24           10.33542                  pubmed
25           10.81100 harvard business review
26           10.81100                proquest
27           10.33586                elsevier
28           10.33586          wiley and sons
29           10.33586           sage journals
30           10.33586           gale academic
31           10.33586                proquest
32           10.33586           research gate

I want to return name of the pair of publishing companies (in the same column or different columns, doesn't matter) and the count of how frequently they appear together in distinct/unique article doi numbers. For example, elsevier and wiley and sons is the most frequently appearing pair (3 times).

Another thing I want to establish is, again returning the name of the pair of publishing companies (again doesn't matter whether this happens in different columns or in a single one), but now I want to return the count of how frequently they appear together in unique article doi numbers which are solely published by the respective pair of publishing companies. For example, harvard business review and proquest are the most popular pair that solely publishes an article (2 times).

I tried some queries (using SQLite) with group by, count, etc. but nowhere near. Also tried, pivot table but the data is too big (max. column limit in excel gets exceeded). I am also open to answers using R, but prefer an SQL query the most.

CodePudding user response:

here is how you can do it :

select a.publishing_company , b.publishing_company , count(*) cnt
from tablename a 
join tablename b
   on a.article_doi_number = b.article_doi_number
   and a.id < b.id 
group by a.publishing_company , b.publishing_company
order by cnt desc

db<>fiddle here

you can limit 1 at the end to show the first row only.

  • Related