I want to find the same sequences and their rows among multiple pandas data frames by a column.
For example, find them in column3.
Table A
column1, column2, column3
1, a, p1
2, c, p3
3, d, P4
4, b, p2
Table B
column1, column2, column3
1, x, p20
2, x, p20
3, y, p3
4, z, P4
5, w, p7
You can see the P3 and P4 are in the same arrangement in both tables. I want to identify and select their rows, respectively. Please consider I have more than 100 tables, that it has to be usable for more than the two.
Anyone could help me? Thanks!
CodePudding user response:
Here some code that does what you want. In this case, I have generated 40 random columns with 100 entries and search for sequence matches. Because pandas iterations are so slow, you might want to pull that last column out from each of the tables into a separate list-of-lists.
import numpy as np
def findall(row, value):
for i,v in enumerate(row):
if v == value:
yield i
table_list = list( np.random.randint(0,25,size=100) for _ in range(40))
# For each table in the list:
for idxA, tblA in enumerate(table_list):
# For each remaining table:
for idxB, tblB in enumerate(table_list[idxA 1:]):
# For each starting point in the first table:
for a0 in range(len(tblA)-1):
# Search for that value.
for b0 in findall( tblB, tblA[a0] ):
match = 1
# Count how long the match is.
a99 = len(tblA) - a0
b99 = len(tblB) - b0
for dx in range(min(a99,b99)):
if tblA[a0 dx] != tblB[b0 dx]:
break
match = 1
if match > 2:
print( f"Sequence of {match} at table {idxA} row {a0} and table {idxB idaX 1} row {b0}" )