Split column and append to existing column Pyspark-CodePudding

I have a dataframe like this I want to split the colum values based on delimiter and append it to the same column using PySpark

Input:

--------------------------
| Name          | Country |
|-------------------------|
| A;B;C         |  USA    |
| X;Y           |  IND    |
| W;D;F;G       |  UK     |
| H             |  IND    |
| J;K;L;S;I;O   |  USA    |
---------------------------

Expected output:

 ----------------
|Name|  Country |
|---------------|
|A   |    USA   |
|B   |    USA   |
|C   |    USA   |
|X   |    IND   |
|Y   |    IND   |
|W   |    UK    |
|D   |    UK    |
|F   |    UK    |
|G   |    UK    |
|H   |    IND   |
|J   |    USA   |
|K   |    USA   |
|L   |    USA   |
|S   |    USA   |
|I   |    USA   |
|O   |    USA   |
-----------------

CodePudding user response：

After splitting the string by ; using split. Then with explode each element in the array can be converted to rows.

from pyspark.sql import functions as F

data = [("A;B;C", "USA",),
        ("X;Y", "IND",),
        ("W;D;F;G", "UK",),
        ("H", "IND",),
        ("J;K;L;S;I;O", "USA",), ]
df = spark.createDataFrame(data, ("Name", "Country",))

df.withColumn("Name", F.explode(F.split(F.col("Name"), ";"))).show()

Output

 ---- ------- 
|Name|Country|
 ---- ------- 
|   A|    USA|
|   B|    USA|
|   C|    USA|
|   X|    IND|
|   Y|    IND|
|   W|     UK|
|   D|     UK|
|   F|     UK|
|   G|     UK|
|   H|    IND|
|   J|    USA|
|   K|    USA|
|   L|    USA|
|   S|    USA|
|   I|    USA|
|   O|    USA|
 ---- -------

CodePudding user response：

Below code is an example of splitting the columns values and creating new rows

from pandas import DataFrame

df = DataFrame([{'Name': 'a;b;c', 'Country': 1},
               {'Name': 'd;e;f', 'Country': 2}])
new_df = DataFrame(df.Name.str.split(';').tolist(), df.Country).stack()
new_df = new_df.reset_index([0, 'Country'])
new_df.columns = ['Country', 'Name']