How to reduce repeated elements in a Pandas dataframe with Python-CodePudding

I'm working with a dataframe that looks like this:

A                       B           C       D       E   F   G   H
ctg.s1.000000F_arrow    CDS gene    21215   22825   0       .   DAFEIOHN_00017
ctg.s1.000000F_arrow    CDS gene    21215   22825   0       .   DAFEIOHN_00017
ctg.s1.000000F_arrow    CDS gene    64501   66033   0   -   .   DAFEIOHN_00049
ctg.s1.000000F_arrow    CDS gene    70234   78846   0       .   DAFEIOHN_00053
ctg.s1.000000F_arrow    CDS gene    103455  106526  0       .   DAFEIOHN_00074
ctg.s1.000000F_arrow    CDS gene    161029  161712  0       .   DAFEIOHN_00132
ctg.s1.000000F_arrow    CDS gene    170711  171520  0       .   DAFEIOHN_00142
ctg.s1.000000F_arrow    CDS gene    203959  204450  0   -   .   DAFEIOHN_00174
ctg.s1.000000F_arrow    CDS gene    211381  212196  0       .   DAFEIOHN_00184
ctg.s1.000000F_arrow    CDS gene    236673  238499  0       .   DAFEIOHN_00209
ctg.s1.000000F_arrow    CDS gene    533077  533850  0       .   DAFEIOHN_00475
ctg.s1.000000F_arrow    CDS gene    533995  535194  0       .   DAFEIOHN_00572
ctg.s1.000000F_arrow    CDS gene    641146  643083  0       .   DAFEIOHN_00572

As you can see, in the H column there are repeated elements like DAFEIOHN_00017 or DAFEIOHN_00572. I'd like to modify this dataframe in order to obtain something like this:

A                       B           C       D       E   F   G   H                I
ctg.s1.000000F_arrow    CDS gene    21215   22825   0       .   DAFEIOHN_00017   2
ctg.s1.000000F_arrow    CDS gene    64501   66033   0   -   .   DAFEIOHN_00049   1
ctg.s1.000000F_arrow    CDS gene    70234   78846   0       .   DAFEIOHN_00053   1
ctg.s1.000000F_arrow    CDS gene    103455  106526  0       .   DAFEIOHN_00074   1
ctg.s1.000000F_arrow    CDS gene    161029  161712  0       .   DAFEIOHN_00132   1
ctg.s1.000000F_arrow    CDS gene    170711  171520  0       .   DAFEIOHN_00142   1
ctg.s1.000000F_arrow    CDS gene    203959  204450  0   -   .   DAFEIOHN_00174   1
ctg.s1.000000F_arrow    CDS gene    211381  212196  0       .   DAFEIOHN_00184   1
ctg.s1.000000F_arrow    CDS gene    236673  238499  0       .   DAFEIOHN_00209   1
ctg.s1.000000F_arrow    CDS gene    533077  533850  0       .   DAFEIOHN_00475   1
ctg.s1.000000F_arrow    CDS gene    533995  535194  0       .   DAFEIOHN_00572   2

In this second dataframe the repeated elements are shown only one time and there's a new column I in which the occurrence of each element of the H column is provided.

How can I do that?

Thank you.

CodePudding user response：

You can use drop_duplicates to remove rows that are duplicated in a specific column, and use assign to create a new column with values returned from a combination of groupby('H') and transform('count') to determine the count of each unique value of H:

df = df.drop_duplicates(subset='H').assign(I=df.groupby('H')['H'].transform('count'))

Output:

>>> df
                       A         B       C       D  E  F  G               H  I
0   ctg.s1.000000F_arrow  CDS-gene   21215   22825  0     .  DAFEIOHN_00017  2
2   ctg.s1.000000F_arrow  CDS-gene   64501   66033  0  -  .  DAFEIOHN_00049  1
3   ctg.s1.000000F_arrow  CDS-gene   70234   78846  0     .  DAFEIOHN_00053  1
4   ctg.s1.000000F_arrow  CDS-gene  103455  106526  0     .  DAFEIOHN_00074  1
5   ctg.s1.000000F_arrow  CDS-gene  161029  161712  0     .  DAFEIOHN_00132  1
6   ctg.s1.000000F_arrow  CDS-gene  170711  171520  0     .  DAFEIOHN_00142  1
7   ctg.s1.000000F_arrow  CDS-gene  203959  204450  0  -  .  DAFEIOHN_00174  1
8   ctg.s1.000000F_arrow  CDS-gene  211381  212196  0     .  DAFEIOHN_00184  1
9   ctg.s1.000000F_arrow  CDS-gene  236673  238499  0     .  DAFEIOHN_00209  1
10  ctg.s1.000000F_arrow  CDS-gene  533077  533850  0     .  DAFEIOHN_00475  1
11  ctg.s1.000000F_arrow  CDS-gene  533995  535194  0     .  DAFEIOHN_00572  2

CodePudding user response：

We can use a groupby and count the element like so :

df.groupby('H').count()