I have a products
table, which has the product description in two languages, one in English and one in an alternate language.
Let's say:
Product_Desc | Product_Desc_Alt |
---|---|
A | A1 |
A | A2 |
A | A1 |
A | A3 |
B | B1 |
B | B2 |
B | B2 |
C | C1 |
If I do a GROUP BY
statement, there are multiple alternate language occurrences for the same product. So, let's say:
SELECT Product_Desc, COUNT(DISTINCT Product_Desc_Alt) AS CNT
FROM products
GROUP BY Product_Desc
ORDER BY CNT DESC
Product_Desc | CNT |
---|---|
A | 3 |
B | 2 |
C | 1 |
I would like to replace the Product_Desc_Alt
occurrences with the most frequent ones,
so for example I would like the output to be:
Product_Desc | Product_Desc_Alt |
---|---|
A | A1 |
A | A1 |
A | A1 |
A | A1 |
B | B2 |
B | B2 |
B | B2 |
C | C1 |
Obviously, if a product has only one alter lang description, just keep that one.
There may be lots of ways to do that, but I can't think of one.
I am using Azure Databricks so this could also happen with PySpark, but I am interested in doing this the SQL way.
Thanks a lot!
CodePudding user response:
Is this what you mean?
UPDATE products
SET Product_Desc_Alt = (
SELECT TOP 1 Product_Desc_Alt
FROM products P2
WHERE P2.Product_Desc = products.Product_Desc
GROUP BY Product_Desc_Alt
ORDER BY COUNT(*) DESC
)