Home > Mobile >  How to delete rows with the same value? Merge column with same prefix
How to delete rows with the same value? Merge column with same prefix

Time:12-13

Hi everyone, I have two question need helping

Question 1: I have df results after merge and groupby from multiple other df, after merge I have data as belows:

Name Column A Column B Column C Column D Column E Column F
NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
NameA ValueA ValueB ValueC Value_D002 Value_E06 Value_F4

I want keep value at column Name, ColumnD, ColumnE, ColumnF. And remove value at row 2 at colum ColumnA, ColumnB, ColumnC because duplicated value at row 1

  • DF expected:

    Name Column A Column B Column C Column D Column E Column F
    NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
    NameA NaN NaN NaN Value_D002 Value_E06 Value_F4

Question 2 I have df with data as belows:

ABC_x Quantity silent ABC_y Quantity noirse
A 05 NaN NaN
B 03 NaN NaN
NaN NaN D 08
NaN NaN E 09
G 01 NaN NaN

How to merge two column ABC_x and ABC_y (same prefix ABC) to one column ABC, and merge data of two column special quantity to one column Quantity?

  • DF expected:

    ABC Quantity
    A 05
    B 03
    D 08
    E 09
    G 01

Thank you for reading and help me troubleshoot problem, Have a nice day <3

I have try but unsuccessful

CodePudding user response:

Question 1

pandas has a function duplicated that gives you true for duplicates and false otherwise

In [40]: df.duplicated(["Column A"])
Out[40]: 
0    False
1     True
dtype: bool

You can use this for boolean indexing

In [43]: df.loc[df.duplicated(["Column A"]), "Column A"] = np.nan

In [44]: df
Out[44]: 
     Name Column A Column B Column C     Column D    Column E  Column F
0  NameA   ValueA   ValueB   ValueC   Value_D001   Value_E01   Value_F3
1  NameA       NaN  ValueB   ValueC   Value_D002   Value_E06   Value_F4

and the same for the other columns.

Note

You can also pass multiple columns with

In [52]: df.loc[
    ...:     df.duplicated(["Column A", "Column B", "Column C"]),
    ...:     ["Column A", "Column B", "Column C"],
    ...: ] = np.nan

In [53]: df
Out[53]: 
     Name Column A Column B Column C     Column D    Column E  Column F
0  NameA   ValueA   ValueB   ValueC   Value_D001   Value_E01   Value_F3
1  NameA       NaN      NaN      NaN  Value_D002   Value_E06   Value_F4

However, this would replace only where all three columns are duplicated at the same time.

CodePudding user response:

Question1

when column name have 'Column', chk duplicated to NaN

cond1 = df.columns.str.contains('Column')
df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated()))

result:

    Column A    Column B    Column C    Column D    Column E    Column F
0   ValueA      ValueB      ValueC      Value_D001  Value_E01   Value_F3
1   NaN         NaN         NaN         Value_D002  Value_E06   Value_F4

make result to join to name


  • full code

cond1 = df.columns.str.contains('Column')
df.loc[:, ~cond1].join(df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated())))

    Name    Column A    Column B    Column C    Column D    Column E    Column F
0   NameA   ValueA      ValueB      ValueC      Value_D001  Value_E01   Value_F3
1   NameA   NaN         NaN         NaN         Value_D002  Value_E06   Value_F4

Question2

df.set_axis(df.columns.str.split('[ _]').str[0], axis=1).groupby(level=0, axis=1).first()

result

    ABC Quantity
0   A   05
1   B   03
2   D   08
3   E   09
4   G   01
  • Related