Hi everyone, I have two question need helping
Question 1: I have df results after merge and groupby from multiple other df, after merge I have data as belows:
Name | Column A | Column B | Column C | Column D | Column E | Column F |
---|---|---|---|---|---|---|
NameA | ValueA | ValueB | ValueC | Value_D001 | Value_E01 | Value_F3 |
NameA | ValueA | ValueB | ValueC | Value_D002 | Value_E06 | Value_F4 |
I want keep value at column Name, ColumnD, ColumnE, ColumnF. And remove value at row 2 at colum ColumnA, ColumnB, ColumnC because duplicated value at row 1
DF expected:
Name Column A Column B Column C Column D Column E Column F NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
Question 2 I have df with data as belows:
ABC_x | Quantity silent | ABC_y | Quantity noirse |
---|---|---|---|
A | 05 | NaN | NaN |
B | 03 | NaN | NaN |
NaN | NaN | D | 08 |
NaN | NaN | E | 09 |
G | 01 | NaN | NaN |
How to merge two column ABC_x and ABC_y (same prefix ABC) to one column ABC, and merge data of two column special quantity to one column Quantity?
DF expected:
ABC Quantity A 05 B 03 D 08 E 09 G 01
Thank you for reading and help me troubleshoot problem, Have a nice day <3
I have try but unsuccessful
CodePudding user response:
Question 1
pandas has a function duplicated
that gives you true
for duplicates and false
otherwise
In [40]: df.duplicated(["Column A"])
Out[40]:
0 False
1 True
dtype: bool
You can use this for boolean indexing
In [43]: df.loc[df.duplicated(["Column A"]), "Column A"] = np.nan
In [44]: df
Out[44]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN ValueB ValueC Value_D002 Value_E06 Value_F4
and the same for the other columns.
Note
You can also pass multiple columns with
In [52]: df.loc[
...: df.duplicated(["Column A", "Column B", "Column C"]),
...: ["Column A", "Column B", "Column C"],
...: ] = np.nan
In [53]: df
Out[53]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
However, this would replace only where all three columns are duplicated at the same time.
CodePudding user response:
Question1
when column name have 'Column', chk duplicated to NaN
cond1 = df.columns.str.contains('Column')
df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated()))
result:
Column A Column B Column C Column D Column E Column F
0 ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NaN NaN NaN Value_D002 Value_E06 Value_F4
make result to join to name
- full code
cond1 = df.columns.str.contains('Column')
df.loc[:, ~cond1].join(df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated())))
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
Question2
df.set_axis(df.columns.str.split('[ _]').str[0], axis=1).groupby(level=0, axis=1).first()
result
ABC Quantity
0 A 05
1 B 03
2 D 08
3 E 09
4 G 01