I have a dataframe like this I want to split the colum values based on delimiter and append it to the same column using PySpark
Input:
--------------------------
| Name | Country |
|-------------------------|
| A;B;C | USA |
| X;Y | IND |
| W;D;F;G | UK |
| H | IND |
| J;K;L;S;I;O | USA |
---------------------------
Expected output:
----------------
|Name| Country |
|---------------|
|A | USA |
|B | USA |
|C | USA |
|X | IND |
|Y | IND |
|W | UK |
|D | UK |
|F | UK |
|G | UK |
|H | IND |
|J | USA |
|K | USA |
|L | USA |
|S | USA |
|I | USA |
|O | USA |
-----------------
CodePudding user response:
After splitting the string by ;
using split
. Then with explode
each element in the array can be converted to rows.
from pyspark.sql import functions as F
data = [("A;B;C", "USA",),
("X;Y", "IND",),
("W;D;F;G", "UK",),
("H", "IND",),
("J;K;L;S;I;O", "USA",), ]
df = spark.createDataFrame(data, ("Name", "Country",))
df.withColumn("Name", F.explode(F.split(F.col("Name"), ";"))).show()
Output
---- -------
|Name|Country|
---- -------
| A| USA|
| B| USA|
| C| USA|
| X| IND|
| Y| IND|
| W| UK|
| D| UK|
| F| UK|
| G| UK|
| H| IND|
| J| USA|
| K| USA|
| L| USA|
| S| USA|
| I| USA|
| O| USA|
---- -------
CodePudding user response:
Below code is an example of splitting the columns values and creating new rows
from pandas import DataFrame
df = DataFrame([{'Name': 'a;b;c', 'Country': 1},
{'Name': 'd;e;f', 'Country': 2}])
new_df = DataFrame(df.Name.str.split(';').tolist(), df.Country).stack()
new_df = new_df.reset_index([0, 'Country'])
new_df.columns = ['Country', 'Name']