I'm currently working on some python dataframes over on pandas. And I'm not sure how this operation can be done. For example, I have an empty dataframe df
and list of the following triples:
L = [(1,2,3), (2,5,4), (2,5,4), (3,2,0), (2,1,3)]
I wish to add all these triples into the dataframe df
with columns ['id', 'a', 'b', 'c']
according to some constraint. The id
is simply a counter that determines how many items have been added so far and a
, b
, and c
are columns for the triples (but they would be commutative with each other). So the idea is to linearly traverse all items in L
and then add each one to the df
according to the restriction:
- It is ok to add
(1,2,3)
sincedf
is still empty. (id=0
) - It is ok to add
(2,5,4)
since it or any of its permutation has not appeared yet indf
. (id=1
) - We then see
(2,5,4)
but this already exists indf
, hence we cannot add it. - Next is
(3,2,0)
and we can clearly add this for the same reason as #2. (id=2
) - Finally, it's
(2,1,3)
. While this triple has not existed yet indf
but since it's a permutation to an existing triplet indf
(which is the(1,2,3)
), then we cannot add it todf
.
In the end, the final df
should look something like this.
id a b c
0 1 2 3
1 2 5 4
2 3 2 0
Anyone knows how this can be done? My idea is to first curate an auxiliary list LL
that would contain these "unique" triples and then just transform it into a pandas df
. But I'm not sure if it's a fast and elegant efficient approach.
CodePudding user response:
Fast solution
Create a numpy array from the list, then sort the array along axis=1 and use duplicated to create a boolean mask to identify dupes, then remove the duplicate rows from the array and create a new dataframe
a = np.array(L)
m = pd.DataFrame(np.sort(a, axis=1)).duplicated()
pd.DataFrame(a[~m], columns=['a', 'b', 'c'])
Result
a b c
0 1 2 3
1 2 5 4
2 3 2 0
CodePudding user response:
You can use a dictionary comprehension with a frozenset
of the tuple as key to eliminate the duplicated permutations, then feed the values to the DataFrame
constructor:
L = [(1,2,3), (2,5,4), (2,5,4), (3,2,0), (2,1,3)]
df = pd.DataFrame({frozenset(t): t for t in L[::-1]}.values(),
columns=['a', 'b', 'c'])
output:
a b c
0 1 2 3
1 3 2 0
2 2 5 4
If order is important, you can use a set to collect the seen values instead:
seen = set()
df = pd.DataFrame([t for t in L if (f:=frozenset(t)) not in seen
and not seen.add(f)],
columns=['a', 'b', 'c'])
output:
a b c
0 1 2 3
1 2 5 4
2 3 2 0