Home > Software engineering >  Pandas drop duplicates and keep first entry when fulfilling multi column condition
Pandas drop duplicates and keep first entry when fulfilling multi column condition

Time:11-01

I have a DataFrame on the following format.

  id    |   value    |   more columns

  1          ''
  1          '' 
  1          ''
 1.1     "some text"
 1.1        "xxx"
 1.2        "xxx"
  2          ''
 2.1         ''
2.1.1       "xxx"
  1          ''
  3          ''
  3          ''
 3.1        "hi"

I want to drop duplicates and keep the first instance/row but ONLY consider the group where value is an empty string and with same id so the following output is created.

  id    |   value    |   more columns

  1          ''
 1.1     "some text"
 1.1        "xxx"
 1.2        "xxx"
  2          ''
 2.1         ''
2.1.1       "xxx"
  3          ''
 3.1        "hi"

Is there any smart / easy way to do this?

CodePudding user response:

df = pd.DataFrame({'id': ['1', '1', '1', '1.1', '1.1', '1.2', '2', '2.1', '2.1.1', '1', '3', '3', '3.1'],
                   'value': ['', '', '', 'some text', 'xxx', 'xxx', '', '', 'xxx', '', '', '', 'hi']})


def fun(x):
    if x.value.eq('').all():
        return x.drop_duplicates('value')
    else:
        return x


df = df.groupby('id').apply(fun).reset_index(level=0,drop=True)
print(df)
       id      value
0       1           
3     1.1  some text
4     1.1        xxx
5     1.2        xxx
6       2           
7     2.1           
8   2.1.1        xxx
10      3           
12    3.1         hi
  • Related