Home > Blockchain >  Pandas Dataframe of Unique Triples
Pandas Dataframe of Unique Triples

Time:10-01

I'm currently working on some python dataframes over on pandas. And I'm not sure how this operation can be done. For example, I have an empty dataframe df and list of the following triples:

L = [(1,2,3), (2,5,4), (2,5,4), (3,2,0), (2,1,3)]

I wish to add all these triples into the dataframe df with columns ['id', 'a', 'b', 'c'] according to some constraint. The id is simply a counter that determines how many items have been added so far and a, b, and c are columns for the triples (but they would be commutative with each other). So the idea is to linearly traverse all items in L and then add each one to the df according to the restriction:

  1. It is ok to add (1,2,3) since df is still empty. (id=0)
  2. It is ok to add (2,5,4) since it or any of its permutation has not appeared yet in df. (id=1)
  3. We then see (2,5,4) but this already exists in df, hence we cannot add it.
  4. Next is (3,2,0) and we can clearly add this for the same reason as #2. (id=2)
  5. Finally, it's (2,1,3). While this triple has not existed yet in df but since it's a permutation to an existing triplet in df (which is the (1,2,3)), then we cannot add it to df.

In the end, the final df should look something like this.

id  a  b  c
 0  1  2  3
 1  2  5  4
 2  3  2  0

Anyone knows how this can be done? My idea is to first curate an auxiliary list LL that would contain these "unique" triples and then just transform it into a pandas df. But I'm not sure if it's a fast and elegant efficient approach.

CodePudding user response:

Fast solution

Create a numpy array from the list, then sort the array along axis=1 and use duplicated to create a boolean mask to identify dupes, then remove the duplicate rows from the array and create a new dataframe

a = np.array(L)
m = pd.DataFrame(np.sort(a, axis=1)).duplicated()
pd.DataFrame(a[~m], columns=['a', 'b', 'c'])

Result

   a  b  c
0  1  2  3
1  2  5  4
2  3  2  0

CodePudding user response:

You can use a dictionary comprehension with a frozenset of the tuple as key to eliminate the duplicated permutations, then feed the values to the DataFrame constructor:

L = [(1,2,3), (2,5,4), (2,5,4), (3,2,0), (2,1,3)]

df = pd.DataFrame({frozenset(t): t for t in L[::-1]}.values(),
                  columns=['a', 'b', 'c'])

output:

   a  b  c
0  1  2  3
1  3  2  0
2  2  5  4

If order is important, you can use a set to collect the seen values instead:

seen = set()

df = pd.DataFrame([t for t in L if (f:=frozenset(t)) not in seen
                                   and not seen.add(f)],
                   columns=['a', 'b', 'c'])

output:

   a  b  c
0  1  2  3
1  2  5  4
2  3  2  0
  • Related