Home > Net >  pyspark.pandas API - How to seperate column with list into multiple columns?
pyspark.pandas API - How to seperate column with list into multiple columns?

Time:10-11

I'm trying to separate the column with list [599086.9706961295, 4503107.843920314] into two columns ("x" and "y") in my Databricks notebook.

In my Jupyter notebook, the columns are getting separated like this:

# code from my jupter notebook
# column with list in it is: xy
# Method 1
complete[['x', 'y']] = pd.Series(np.stack(complete['xy'].values).T.tolist())

# column is also getting separated using this method

# Method 2
def sepXY(xy):
    return xy[0],xy[1]

complete['x'],complete['y'] = zip(*complete['xy'].apply(sepXY))

In my Databricks Notebook, I'm getting error:

I tried both the methods

import pyspark.pandas as ps

# Method 1
complete[['x', 'y']] = ps.Series(np.stack(complete['xy'].values).T.tolist())

AssertionError:

If I only run ps.Series(np.stack(complete['xy'].values).T.tolist()), I'm getting the output with two list for x and y

0    [599086.9706961295, 599079.1456765212, 599059....
1    [4503107.843920314, 4503083.465809557, 4503024...

But when I assign it to complete[['x','y']], it is throwing the error.

# Method 2
def sepXY(xy):
    return xy[0],xy[1]

complete['x'],complete['y'] = zip(*complete['xy'].apply(sepXY))

ArrowInvalid: Could not convert (599086.9706961295, 4503107.843920314) with type tuple: did not recognize Python value type when inferring an Arrow data type

I checked the datatype, it is not tuple

I also tried

complete[['x','y']] = pd.DataFrame(complete.xy.tolist(), index= complete.index)

My kernel is getting restart if I use this

# This is the column for sample

xy
[599086.9706961295, 4503107.843920314]
[599088.5389507986, 4503112.7796745915]
[599072.8088083105, 4503064.139248001]
[599090.0996424126, 4503117.721156018]
[599074.3909188313, 4503068.925677084]

CodePudding user response:

Input:

complete = spark.createDataFrame(
    [([599086.9706961295, 4503107.843920314],),
     ([599088.5389507986, 4503112.7796745915],),
     ([599072.8088083105, 4503064.139248001],),
     ([599090.0996424126, 4503117.721156018],),
     ([599074.3909188313, 4503068.925677084],)],
    ['xy']
).pandas_api()

With the above example it could be done like this:

complete['x'] = complete['xy'].apply(lambda x: x[0])
complete['y'] = complete['xy'].apply(lambda x: x[1])

print(complete)
#                                         xy              x             y
# 0   [599086.9706961295, 4503107.843920314]  599086.970696  4.503108e 06
# 1  [599088.5389507986, 4503112.7796745915]  599088.538951  4.503113e 06
# 2   [599072.8088083105, 4503064.139248001]  599072.808808  4.503064e 06
# 3   [599090.0996424126, 4503117.721156018]  599090.099642  4.503118e 06
# 4   [599074.3909188313, 4503068.925677084]  599074.390919  4.503069e 06

print(complete.dtypes)
# xy     object
# x     float64
# y     float64
# dtype: object
  • Related