I was trying to write an iterator class in Python that can do a loop for a txt file, in while I would like to group all lines with identical value in the second column:
1 | A |
2 | A |
3 | B |
4 | B |
5 | B |
6 | C |
7 | C |
8 | C |
9 | C |
10 | D |
11 | D |
12 | D |
So I would like my iterator to return four list/tuple one at a time:
[[1,A],[2,A]]
[[3,B],[4,B],[5,B]]
[[6,C],[7,C],[8,C],[9,C]]
[[10,D],[11,D],[12,D]]
Here is my code:
#%% Iterator
class sequence(object):
def __init__(self, filePath):
self.file = open(filePath, 'r')
self.last = []
def __iter__(self):
return self
def __next__(self):
self.trunk = [self.last]
stop_checker = False
while not stop_checker:
line = self.file.readline()
if line: # a solid line
line = line.strip('\n').split('\t')
# Check if current line contains a difference contigs
if self.trunk == [[]]: # empty trunk, add a new line to it, read next
self.trunk=[line]
elif self.trunk[-1][1] == line[1]: # contig names matched:
self.trunk.append(line)
else: # First encounter of a difference contigs, reture th lastt trunk
self.last = line
return self.trunk
else:
raise StopIteration
return self.trunk
a = sequence('tst.txt')
for i in a:
print(i)
However, the iterator stops before return the last list, and the result is:
[['1', 'A'], ['2', 'A']]
[['3', 'B'], ['4', 'B'], ['5', 'B']]
[['6', 'C'], ['7', 'C'], ['8', 'C'], ['9', 'C']]
CodePudding user response:
Thanks for the comment of Blckknight, I work it out with itertools.groupby:
import itertools
# Key function
key_func = lambda x: x.strip('\n').split('\t')[1]
with open('tst.txt', 'r') as f:
for key, group in itertools.groupby(f, key_func):
print(key " :", [i.strip('\n').split('\t') for i in list(group)])
The output:
A : [['1', 'A'], ['2', 'A']]
B : [['3', 'B'], ['4', 'B'], ['5', 'B']]
C : [['6', 'C'], ['7', 'C'], ['8', 'C'], ['9', 'C']]
D : [['10', 'D'], ['11', 'D'], ['12', 'D']]
CodePudding user response:
Grouping can be done using pandas:
import pandas as pd
df = pd.DataFrame({"num": range(1, 13), "value": ["A"] * 2 ["B"] * 3 ["C"] * 4 ["D"] * 3})
res = [list(zip(item["num"], item["value"])) for i, item in df.groupby("value")]
for item in res:
print(item)
OUTPUT:
[(1, 'A'), (2, 'A')]
[(3, 'B'), (4, 'B'), (5, 'B')]
[(6, 'C'), (7, 'C'), (8, 'C'), (9, 'C')]
[(10, 'D'), (11, 'D'), (12, 'D')]