Home > Software design >  Unable to retrieve an item from a tuple/struct type value in pandas dataframe
Unable to retrieve an item from a tuple/struct type value in pandas dataframe

Time:07-18

I am unable to retrieve a particular item from a tuple/struct type value in pandas dataframe. I am able to accomplish the same thing using pyspark dataframe.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
# import pandas as pd
dataStruct = [(("James","","Smith"),"36636","M","3000"), \
      (("Michael","Rose",""),"40288","M","4000"), \
      (("Robert","","Williams"),"42114","M","4000"), \
      (("Maria","Anne","Jones"),"39192","F","4000"), \
      (("Jen","Mary","Brown"),"","F","-1") \
]
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
schemaStruct = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
          StructField('dob', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', StringType(), True)
         ])
df = spark.createDataFrame(data=dataStruct, schema = schemaStruct)
# df.printSchema()
df.select("name.firstname").show()

pandasDF2 = df.toPandas()
display(pandasDF2["name['firstname']"])

Error Output : -

Error

I am able to select "firstname" from a tuple using pyspark dataframe.

But after converting it into a pandas dataframe I am also unable to retrieve the same thing using below command : -

display(pandasDF2["name.firstname"])

I will be helpful if I can accomplish the same in pandas. Retrieving an item from a tuple/struct value in pandas dataframe.

CodePudding user response:

I'm not sure you can get a tuple/list item within a Pandas Series. Have you tried parsing the tuple? display(tuple(pandasDF2['name'])

Also, before trying to manipulate the data, you should know what kind of value is it: display(type(pandasDF2['name']))

CodePudding user response:

Conversion of pyspark dataframe into pandas dataframe will result in translation of the StructType into Row type (pyspark.sql.types.Row). So in order to access the attributes of Row you have to use proper lookup method, this can be achieved by using a simple lambda function or the itemgetter method:

from operator import itemgetter

pandasDF2['name'].map(itemgetter('firstname'))

0      James
1    Michael
2     Robert
3      Maria
4        Jen
Name: name, dtype: object
  • Related