There are any efficient way to split a sequence like this not using [:] slicing?
GATAAG G ATAAG
GA TAAG
GAT AAG
GATA AG
GATAA G
I found something in itertools, but not do it right:
def subslices(seq):
"Return all contiguous non-empty subslices of a sequence"
# subslices('ABCD') --> A AB ABC ABCD B BC BCD C CD D
slices = itertools.starmap(slice, itertools.combinations(range(len(seq) 1), 2))
return map(operator.getitem, itertools.repeat(seq), slices)
list(subslices(s))
['G', 'GA', 'GAT', 'GATA', 'GATAA', 'GATAAG', 'A', 'AT', 'ATA', 'ATAA', 'ATAAG', 'T', 'TA', 'TAA', 'TAAG', 'A', 'AA', 'AAG', 'A', 'AG', 'G']
And also Not readable. Other solution:
def splitting_kmer(s):
n = len(s)
print(n)
for i, _ in enumerate(s, 1):
if i == n:
break
print(s[:n-i], s[n-i:])
Paulo
CodePudding user response:
A simple and efficient way to get all unique substrings of a string:
sample = 'GATAAG'
slices = set(sample[i:j] for i in range(len(sample)) for j in range(i 1, len(sample)))
print(slices)
Result:
{'AA', 'AT', 'GATA', 'A', 'GATAA', 'G', 'GA', 'TA', 'T', 'ATA', 'TAA', 'ATAA', 'GAT'}
They are in random order because it's a set (which is unordered by definition), and they're in a set to ensure there are no duplicates. If you want duplicates and order:
sample = 'GATAAG'
slices = [sample[i:j] for i in range(len(sample)) for j in range(i 1, len(sample))]
print(slices)
Result:
['G', 'GA', 'GAT', 'GATA', 'GATAA', 'A', 'AT', 'ATA', 'ATAA', 'T', 'TA', 'TAA', 'A', 'AA', 'A']