To simplify, I have a table with two columns, the first one being "article doi number" and the other being "publishing company". Here is a long example cut from the table:
article_doi_number publishing_company
1 10.24397 elsevier
2 10.24397 emerald insight
3 10.24397 jstor
4 10.24397 tandfonline
5 10.24397 wiley and sons
6 10.24397 sage journals
7 10.18150 elsevier
8 10.18150 wiley and sons
9 10.18150 emerald insight
10 10.60920 gale academic
11 10.60920 proquest
12 10.60920 springer link
13 10.60920 jstor
14 10.80442 ingenta connect
15 10.80442 pubmed
16 10.21337 harvard business review
17 10.21337 proquest
18 10.80416 plos
19 10.80431 research gate
20 10.80431 proquest
21 10.70125 ieee
22 10.81093 frontiers
23 10.33542 scielo
24 10.33542 pubmed
25 10.81100 harvard business review
26 10.81100 proquest
27 10.33586 elsevier
28 10.33586 wiley and sons
29 10.33586 sage journals
30 10.33586 gale academic
31 10.33586 proquest
32 10.33586 research gate
I want to return name of the pair of publishing companies (in the same column or different columns, doesn't matter) and the count of how frequently they appear together in distinct/unique article doi numbers. For example, elsevier and wiley and sons is the most frequently appearing pair (3 times).
Another thing I want to establish is, again returning the name of the pair of publishing companies (again doesn't matter whether this happens in different columns or in a single one), but now I want to return the count of how frequently they appear together in unique article doi numbers which are solely published by the respective pair of publishing companies. For example, harvard business review and proquest are the most popular pair that solely publishes an article (2 times).
I tried some queries (using SQLite) with group by, count, etc. but nowhere near. Also tried, pivot table but the data is too big (max. column limit in excel gets exceeded). I am also open to answers using R, but prefer an SQL query the most.
CodePudding user response:
here is how you can do it :
select a.publishing_company , b.publishing_company , count(*) cnt
from tablename a
join tablename b
on a.article_doi_number = b.article_doi_number
and a.id < b.id
group by a.publishing_company , b.publishing_company
order by cnt desc
db<>fiddle here
you can limit 1
at the end to show the first row only.