Home > Mobile >  completely remove rows that contain ID associate with more than one industry
completely remove rows that contain ID associate with more than one industry

Time:11-04

I've seen a lot of post on removing duplicates but those don't apply to my case. The idea is I only care about whether the dataset contain IDs associate with more than one industries, if an ID has more than one industry, completely remove that ID and rows associate with it from the dataset. Can this be done with SQL? Python?

For example:

    ID    |    Date    |   Industry   |
  S000123 |  oct/1/22  |     Media    |
  S000123 |  oct/1/22  |    Education |
  S000456 |  oct/4/22  |     Auto     |
  S000789 |  oct/4/22  |    Beverage  |

becomes

     ID   |    Date    |   Industry   |
  S000456 |  oct/4/22  |     Auto     |
  S000789 |  oct/4/22  |    Beverage  |

CodePudding user response:

You can do:

delete from t
where id in (
  select id from t group by id having count(distinct industry) > 1
)

See fiddle.

CodePudding user response:

This will select only the rows you are looking for:

select *
from data
where ID in (
    select ID 
    from data 
    group by ID
    having count(distinct Industry) <= 1
)

The inner query selects only IDs with one or fewer (in case of a NULL Industry) different values for Industry.

  • Related