Home > front end >  List all name pairs that appear in a line together and count their frequency
List all name pairs that appear in a line together and count their frequency

Time:06-18

I have the following file (2016.csv the head of the file is presented below)

Zhichen Gong,Huanhuan Chen
Zhichuan Huang,Tiantian Xie,Ting Zhu,Jianwu Wang,Qingquan Zhang
Zhichuan Huang,Ting Zhu
Zhifei Zhang,Yang Song,Wei Wang 0063,Hairong Qi

I am using the following awk loop to find all possible pairs of names that appear together in a line of the above file together.

awk -F, '{for(i=1;i<NF;i ){for(j=i 1;j<=NF;j ){if($i > $j){k[$i][$j]}else{k[$j][$i]}}}}END{for(n in k){for (l in k[n]){print n,",",l}}}' 2016.csv

The output of this awk loop is the following :

Zhichen Gong , Huanhuan Chen 
Zhichuan Huang , Tiantian Xie
Zhichuan Huang , Ting Zhu
Zhichuan Huang , Jianwu Wang
Zhichuan Huang , Qingquan Zhang
Zhifei Zhang,Yang Song
Zhifei Zhang,Wei Wang 0063
Zhifei Zhang,Hairong Qi

This loop works fine and finds all the pairs that appear in a line of the initial file together. The only thing I want to add is a counter next to each line of the awk output, which will show how many times this pair exists in the initial file.

For example, for the above awk output, I want it to be like:

Zhichen Gong , Huanhuan Chen, 1
Zhichuan Huang , Tiantian Xie, 1
Zhichuan Huang , Ting Zhu, 2
Zhichuan Huang , Jianwu Wang, 1
Zhichuan Huang , Qingquan Zhang, 1
Zhifei Zhang,Yang Song, 1
Zhifei Zhang,Wei Wang 0063,1 
Zhifei Zhang,Hairong Qi,1 

Where 1 in the first line (Zhichen Gong , Huanhuan Chen, 1) shows that this pair of names exists 1 time in the initial file.

I assume that I just have to add a counter in the awk loop, but I couldn't do it up to now properly.

CodePudding user response:

find all possible pairs of names that appear together their counts

You may use this awk solution:

awk -F, -v OFS=" , " '
{
   for (i=1; i<NF; i  )
        fq[$i OFS $(i 1)]
}
END {
   for (i in fq) print i, fq[i]
}' file

CodePudding user response:

Using OP's 11-line sample as our input:

$ cat 2016.csv
Zhichen Gong,Huanhuan Chen
Zhichuan Huang,Tiantian Xie,Ting Zhu,Jianwu Wang,Qingquan Zhang
Zhichuan Huang,Ting Zhu
Zhifei Zhang,Yang Song,Wei Wang 0063,Hairong Qi
Zhihao Huang,Hui Li,Xin Li,Wei He
Zhijun Yin,You Chen,Daniel Fabbri,Jimeng Sun,Bradley A. Malin
Zhipeng Huang 0001,Bogdan Cautis,Reynold Cheng,Yudian Zheng
Zhipeng Huang 0001,Yudian Zheng,Reynold Cheng,Yizhou Sun,Nikos Mamoulis,Xiang Li 0067
Zhiqiang Tao,Hongfu Liu,Sheng Li 0001,Yun Fu 0001
Zhiqiang Xu,Yiping Ke
Zhiyuan Chen 0001,Estevam R. Hruschka Jr.,Bing Liu 0001

Making a few tweaks to OP's current code to keep track of counts and then ordering the output 1st by count and then names:

awk '
BEGIN { FS=","; OFS=" , " }
      { for (i=1;i<NF;i  )
            for(j=i 1;j<=NF;j  )
                if   ($i > $j) k[$i][$j]              # increment counter
                else           k[$j][$i]              # increment counter
      }
END   { # to sort by count we will create a new 3-dimensional array with the count as the 1st dimension
        for (i in k)
            for (j in k[i]) {
                arr[k[i][j]][i][j]                    # arr[count][i][j]
                delete k[i][j]                        # delete old array entry to limit memory usage
            }
        PROCINFO["sorted_in"]="@ind_num_desc"         # sort 1st index by count/descending
        for (cnt in arr) {
            PROCINFO["sorted_in"]="@ind_str_asc"      # sort 2nd/3rd indices by name/ascending
            for (i in arr[cnt])
                for (j in arr[cnt][i])
                    print i,j,cnt
        }
      }
' 2016.csv

NOTES:

  • assumes we have enough memory for the 3-dimensional array; then again ...
  • memory usage for these 2-/3-dimensional arrays should be significantly smaller than the other answers that utilize a compound index for a 1-dimensional array, ie, ...
  • [bob][smith] and [bob][jones] will require bob to be stored once in memory while [bob,smith] and [bob,jones] will require bob to be stored twice in memory

This generates the following 61-line output:

Yudian Zheng , Reynold Cheng , 2
Zhichuan Huang , Ting Zhu , 2
Zhipeng Huang 0001 , Reynold Cheng , 2
Zhipeng Huang 0001 , Yudian Zheng , 2
Daniel Fabbri , Bradley A. Malin , 1
Estevam R. Hruschka Jr. , Bing Liu 0001 , 1
Jimeng Sun , Bradley A. Malin , 1
Jimeng Sun , Daniel Fabbri , 1
Qingquan Zhang , Jianwu Wang , 1
Reynold Cheng , Bogdan Cautis , 1
Reynold Cheng , Nikos Mamoulis , 1
Sheng Li 0001 , Hongfu Liu , 1
Tiantian Xie , Jianwu Wang , 1
Tiantian Xie , Qingquan Zhang , 1
Ting Zhu , Jianwu Wang , 1
Ting Zhu , Qingquan Zhang , 1
Ting Zhu , Tiantian Xie , 1
Wei He , Hui Li , 1
Wei Wang 0063 , Hairong Qi , 1
Xiang Li 0067 , Nikos Mamoulis , 1
Xiang Li 0067 , Reynold Cheng , 1
Xin Li , Hui Li , 1
Xin Li , Wei He , 1
Yang Song , Hairong Qi , 1
Yang Song , Wei Wang 0063 , 1
Yizhou Sun , Nikos Mamoulis , 1
Yizhou Sun , Reynold Cheng , 1
Yizhou Sun , Xiang Li 0067 , 1
You Chen , Bradley A. Malin , 1
You Chen , Daniel Fabbri , 1
You Chen , Jimeng Sun , 1
Yudian Zheng , Bogdan Cautis , 1
Yudian Zheng , Nikos Mamoulis , 1
Yudian Zheng , Xiang Li 0067 , 1
Yudian Zheng , Yizhou Sun , 1
Yun Fu 0001 , Hongfu Liu , 1
Yun Fu 0001 , Sheng Li 0001 , 1
Zhichen Gong , Huanhuan Chen , 1
Zhichuan Huang , Jianwu Wang , 1
Zhichuan Huang , Qingquan Zhang , 1
Zhichuan Huang , Tiantian Xie , 1
Zhifei Zhang , Hairong Qi , 1
Zhifei Zhang , Wei Wang 0063 , 1
Zhifei Zhang , Yang Song , 1
Zhihao Huang , Hui Li , 1
Zhihao Huang , Wei He , 1
Zhihao Huang , Xin Li , 1
Zhijun Yin , Bradley A. Malin , 1
Zhijun Yin , Daniel Fabbri , 1
Zhijun Yin , Jimeng Sun , 1
Zhijun Yin , You Chen , 1
Zhipeng Huang 0001 , Bogdan Cautis , 1
Zhipeng Huang 0001 , Nikos Mamoulis , 1
Zhipeng Huang 0001 , Xiang Li 0067 , 1
Zhipeng Huang 0001 , Yizhou Sun , 1
Zhiqiang Tao , Hongfu Liu , 1
Zhiqiang Tao , Sheng Li 0001 , 1
Zhiqiang Tao , Yun Fu 0001 , 1
Zhiqiang Xu , Yiping Ke , 1
Zhiyuan Chen 0001 , Bing Liu 0001 , 1
Zhiyuan Chen 0001 , Estevam R. Hruschka Jr. , 1

If the order of the output does not matter the END{...} block can be simplified to the following:

END   { for (i in k)
            for (j in k[i])
                print i,j,k[i][j]
      }

CodePudding user response:

Using a sensible sample input file so we can tell at a glance if the script worked or not because the expected output is obvious:

$ cat file
a,b,c
c,a
e,d

This will do what you want using any awk:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    for (i=1; i<NF; i  ) {
        for (j=i 1; j<=NF; j  ) {
            cnt[( $i < $j ? $i FS $j : $j FS $i )]  
        }
    }
}
END {
    for ( pair in cnt ) {
        print pair, cnt[pair]
    }
}

$ awk -f tst.awk file
a,b,1
a,c,2
d,e,1
b,c,1

or if you want it sorted:

$ awk -f tst.awk file | sort
a,b,1
a,c,2
b,c,1
d,e,1

With the OPs provided sample input:

$ cat file2
Zhichen Gong,Huanhuan Chen
Zhichuan Huang,Tiantian Xie,Ting Zhu,Jianwu Wang,Qingquan Zhang
Zhichuan Huang,Ting Zhu
Zhifei Zhang,Yang Song,Wei Wang 0063,Hairong Qi

we have:

$ awk -f tst.awk file2 | sort
Hairong Qi,Wei Wang 0063,1
Hairong Qi,Yang Song,1
Hairong Qi,Zhifei Zhang,1
Huanhuan Chen,Zhichen Gong,1
Jianwu Wang,Qingquan Zhang,1
Jianwu Wang,Tiantian Xie,1
Jianwu Wang,Ting Zhu,1
Jianwu Wang,Zhichuan Huang,1
Qingquan Zhang,Tiantian Xie,1
Qingquan Zhang,Ting Zhu,1
Qingquan Zhang,Zhichuan Huang,1
Tiantian Xie,Ting Zhu,1
Tiantian Xie,Zhichuan Huang,1
Ting Zhu,Zhichuan Huang,2
Wei Wang 0063,Yang Song,1
Wei Wang 0063,Zhifei Zhang,1
Yang Song,Zhifei Zhang,1
  • Related