Python Iterator based on sorted column with various length-CodePudding

I was trying to write an iterator class in Python that can do a loop for a txt file, in while I would like to group all lines with identical value in the second column:


1	A
2	A
3	B
4	B
5	B
6	C
7	C
8	C
9	C
10	D
11	D
12	D

So I would like my iterator to return four list/tuple one at a time:

[[1,A],[2,A]]
[[3,B],[4,B],[5,B]]
[[6,C],[7,C],[8,C],[9,C]]
[[10,D],[11,D],[12,D]]

Here is my code:

#%% Iterator
class sequence(object):
    def __init__(self, filePath):
        self.file = open(filePath, 'r')
        self.last = []

    def __iter__(self):
        return self

    def __next__(self):
        self.trunk = [self.last]
        stop_checker = False
        while not stop_checker:
            line = self.file.readline()
            if line:  # a solid line
                line = line.strip('\n').split('\t')
                # Check if current line contains a difference contigs
                if self.trunk == [[]]:  # empty trunk, add a new line to it, read next
                    self.trunk=[line]
                elif self.trunk[-1][1] == line[1]:  # contig names matched:
                    self.trunk.append(line)
                else:  # First encounter of a difference contigs, reture th lastt trunk
                    self.last = line
                    return self.trunk               
            else:
                raise StopIteration
                return self.trunk
 
a = sequence('tst.txt')
for i in a:
    print(i)

However, the iterator stops before return the last list, and the result is:

[['1', 'A'], ['2', 'A']]
[['3', 'B'], ['4', 'B'], ['5', 'B']]
[['6', 'C'], ['7', 'C'], ['8', 'C'], ['9', 'C']]

CodePudding user response：

Thanks for the comment of Blckknight, I work it out with itertools.groupby:

import itertools

# Key function
key_func = lambda x: x.strip('\n').split('\t')[1]

with open('tst.txt', 'r') as f:
    for key, group in itertools.groupby(f, key_func):
        print(key   " :", [i.strip('\n').split('\t') for i in list(group)])

The output:

A : [['1', 'A'], ['2', 'A']]
B : [['3', 'B'], ['4', 'B'], ['5', 'B']]
C : [['6', 'C'], ['7', 'C'], ['8', 'C'], ['9', 'C']]
D : [['10', 'D'], ['11', 'D'], ['12', 'D']]

CodePudding user response：

Grouping can be done using pandas:

import pandas as pd

df = pd.DataFrame({"num": range(1, 13), "value": ["A"] * 2   ["B"] * 3   ["C"] * 4   ["D"] * 3})

res = [list(zip(item["num"], item["value"])) for i, item in df.groupby("value")]
for item in res:
    print(item)

OUTPUT:

[(1, 'A'), (2, 'A')]
[(3, 'B'), (4, 'B'), (5, 'B')]
[(6, 'C'), (7, 'C'), (8, 'C'), (9, 'C')]
[(10, 'D'), (11, 'D'), (12, 'D')]