I have a data frame in spark with the following format.
---------- ---------
|Column 1 | Values |
---------- ---------:
| A | value1 |
| B | value2 |
| C | value2 |
| A | value1 |
| B | value3 |
| C | value1 |
| A | value1 |
| B | value1 |
| C | value2 |
---------- ---------
I would transform it to the following by counting the number of occurs for each value:
---------- --------- ---------- ---------
|Column 1 | value1 | value2 | value2 |
---------- --------- ---------- ---------
| A | 3 | 0 | 0 |
| B | 1 | 1 | 1 |
| C | 1 | 2 | 0 |
---------- --------- ---------- ---------
CodePudding user response:
You can use pivot method as follows:
df = spark.createDataFrame([("a", "value1"), ("b", "value2"), ("c", "value2"), ("a", "value1"), ("b", "value3"), ("c", "value1"), ("a", "value1"),("b", "value1"),("c", "value2")],['col1', 'col2'])
df.show()
pivotDF = df.groupBy("col1").pivot("col2").count().na.fill(0)
pivotDF.show()
Here is the output I get for the code above with spark 2.3:
---- ------ ------ ------
|col1|value1|value2|value3|
---- ------ ------ ------
| c| 1| 2| 0|
| b| 1| 1| 1|
| a| 3| 0| 0|
---- ------ ------ ------