I am unable to retrieve a particular item from a tuple/struct type value in pandas dataframe. I am able to accomplish the same thing using pyspark dataframe.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
# import pandas as pd
dataStruct = [(("James","","Smith"),"36636","M","3000"), \
(("Michael","Rose",""),"40288","M","4000"), \
(("Robert","","Williams"),"42114","M","4000"), \
(("Maria","Anne","Jones"),"39192","F","4000"), \
(("Jen","Mary","Brown"),"","F","-1") \
]
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
schemaStruct = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('dob', StringType(), True),
StructField('gender', StringType(), True),
StructField('salary', StringType(), True)
])
df = spark.createDataFrame(data=dataStruct, schema = schemaStruct)
# df.printSchema()
df.select("name.firstname").show()
pandasDF2 = df.toPandas()
display(pandasDF2["name['firstname']"])
Error Output : -
I am able to select "firstname" from a tuple using pyspark dataframe.
But after converting it into a pandas dataframe I am also unable to retrieve the same thing using below command : -
display(pandasDF2["name.firstname"])
I will be helpful if I can accomplish the same in pandas. Retrieving an item from a tuple/struct value in pandas dataframe.
CodePudding user response:
I'm not sure you can get a tuple/list item within a Pandas Series.
Have you tried parsing the tuple?
display(tuple(pandasDF2['name'])
Also, before trying to manipulate the data, you should know what kind of value is it:
display(type(pandasDF2['name']))
CodePudding user response:
Conversion of pyspark dataframe into pandas dataframe will result in translation of the StructType
into Row
type (pyspark.sql.types.Row
). So in order to access the attributes of Row
you have to use proper lookup method, this can be achieved by using a simple lambda function or the itemgetter
method:
from operator import itemgetter
pandasDF2['name'].map(itemgetter('firstname'))
0 James
1 Michael
2 Robert
3 Maria
4 Jen
Name: name, dtype: object