I am working on a data frame that has columns like this
| col_a | col_b |
| -------- | -----------------|
| First | name=x, address=Y|
| Second | name=x, address=Y|
I want to convert it in a data frame with col_b converted into two columns like this
| col_a | name | address |
| -------- | -----|---------|
| First | x | Y |
| Second | x | Y |
is it possible to do?
Thanks
CodePudding user response:
You can use split since your column is interest is StringType
and extract the relevant indexes using getItem -
Data Preparation
input_str = """
First name=x,address=Y
Second name=x,address=Y
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "col_a col_b".split()))
n = len(input_values)
n_col = 2
input_list = [tuple(input_values[i:i n_col]) for i in range(0,n,n_col)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF.show()
------ ----------------
| col_a| col_b|
------ ----------------
| First|name=x,address=Y|
|Second|name=x,address=Y|
------ ----------------
Split
first_split_col = F.split(sparkDF['col_b'], ',')
sparkDF = sparkDF.withColumn('name', F.split(first_split_col.getItem(0),'=').getItem(1))
sparkDF = sparkDF.withColumn('address', F.split(first_split_col.getItem(1),'=').getItem(1))
sparkDF.show()
------ ---------------- ---- -------
| col_a| col_b|name|address|
------ ---------------- ---- -------
| First|name=x,address=Y| x| Y|
|Second|name=x,address=Y| x| Y|
------ ---------------- ---- -------