How to group duplicates in second column and assign them a new number in a third column?-CodePudding

I have a file that looks like this

1, C10 C11 N3 O1
2, C19 C23 O2
3, C19 N2 O2
4, C10 C11 O1
5, C11 N3 O1
6, C13 C8 O3
7, C8 N5 O3

The first column is the group number and the second column is the items in that group. I want to search through the second column and see how many times one of the strings in the row matches any of the strings in the other rows. Then, I need to put a new number in the third column that corresponds to groups that have TWO OR MORE matches.

Example:

1, C10 C11 N3 O1
4, C10 C11 O1
5, C11 N3 O1

Groups 1, 4, and 5 all have two or more strings that match. Therefore, they would all get assigned to a new groups like so:

1, C10 C11 N3 O1, 1
4, C10 C11 O1, 1
5, C11 N3 O1, 1

I am new to this and stumped. Any help is appreciated. Thanks.

Edit: I'm trying to have my code something like this, but I can't get it to work.

while read -r line; do 
     awk '$1 !=$1 && $2 == $2 {print $0}'
done

If another row belongs to two groups, then I would just put it with the first corresponding group. That info won't matter too much, as long as it is in with a matching group. Preferably, the group would encompass all 2 matching strings though.

CodePudding user response：

Here's a python answer that may be helpful. I can explain the logic so that it can be coded in awk if you wish:

#!/usr/bin/env python3
import sys
import collections

groups = {}
items_in_groups = collections.defaultdict(list)
with open(sys.argv[1]) as f:
  for line in f:
    group_num,items = line.split(",")
    if items:
      groups[ group_num ] = items.split()
      for item in groups[ group_num ]: 
        items_in_groups[ item ].append(group_num)

for group,items in groups.items():
  for item in items:
    if len( items_in_groups[ item ] ) >= 2: 
      print( f'{group},{" ".join(items)},{items_in_groups[ item ][0]}' )
      break
  else:
      print( f'{group},{" ".join(items)},N/A' )

The above code (seems like there is a lot, doesn't it?) produces the following output:

1 , C10 C11 N3 O1 , 1
2 , C19 C23 O2 , 2
3 , C19 N2 O2 , 2
4 , C10 C11 O1 , 1
5 , C11 N3 O1 , 1
6 , C13 C8 O3 , 6
7 , C8 N5 O3 , 6

Essentially, I'm doing a double pass through your data. The first pass, we are reading the data into a very simple associative array (well, dict in python) called groups. groups maps the group number to a list of items (e.g. 3 --> C19,N2,02 ). Here's the group dict after it is populated:

Group:

{'1': ['C10', 'C11', 'N3', 'O1'], '2': ['C19', 'C23', 'O2'], '3': ['C19', 'N2', 'O2'], '4': ['C10', 'C11', 'O1'], '5': ['C11', 'N3', 'O1'], '6': ['C13', 'C8', 'O3'], '7': ['C8', 'N5', 'O3']}

items_in_groups:

{'C10': ['1', '4'], 'C11': ['1', '4', '5'], 'N3': ['1', '5'], 'O1': ['1', '4', '5'], 'C19': ['2', '3'], 'C23': ['2'], 'O2': ['2', '3'], 'N2': ['3'], 'C13': ['6'], 'C8': ['6', '7'], 'O3': ['6', '7'], 'N5': ['7']}

items_in_groups is also created on the first pass of the data - for every item we find, we add the group we found it in to an associated list. For instance, we found 'C10' in both group 1 and group 4.

Finally, with items_in_groups computed, we can look for the matches of 2 or more. We loop through the groups (in the order they were read from the file - in recent versions of python the dict ordering is preserved). Then, we loop through each item in that group, checking to see if that item appears in more than one group. If we find an item that appears in more than one group, we stop, printing out the current group, it's items, and the first group in list of matching groups.

EDIT: Simplified the code - the reading of the input file is done line by line now and the print statements at the end are easier to read. Lastly, added a bit of code to handle the situation of no matches.

CodePudding user response：

Here's another answer, but this time written in awk. I basically did a translation from python to awk.

awk -F, '
{ # first pass 
  groups[ $1 ] =  $2
  split( $2, items, " " ) 
  for (item in items) {
    items_in_groups[ items[item] ] = items_in_groups[ items[item] ] " " $1
  }
}
END {  # second pass
  for ( group in groups ) {
    split( groups[group], items, " " ) 
    for ( item in items ) { 
      split( items_in_groups[ items[item] ] , in_groups, " " ) 
      if (length( in_groups ) >= 2)  {
        print group,groups[group],in_groups[1]
        break
      }
    }
  }
}
' $1