I've seen a lot of post on removing duplicates but those don't apply to my case. The idea is I only care about whether the dataset contain IDs associate with more than one industries, if an ID has more than one industry, completely remove that ID and rows associate with it from the dataset. Can this be done with SQL? Python?
For example:
ID | Date | Industry |
S000123 | oct/1/22 | Media |
S000123 | oct/1/22 | Education |
S000456 | oct/4/22 | Auto |
S000789 | oct/4/22 | Beverage |
becomes
ID | Date | Industry |
S000456 | oct/4/22 | Auto |
S000789 | oct/4/22 | Beverage |
CodePudding user response:
You can do:
delete from t
where id in (
select id from t group by id having count(distinct industry) > 1
)
See fiddle.
CodePudding user response:
This will select only the rows you are looking for:
select *
from data
where ID in (
select ID
from data
group by ID
having count(distinct Industry) <= 1
)
The inner query selects only IDs with one or fewer (in case of a NULL Industry) different values for Industry.