We have some data in a Delta source which has nested structures. For this example we are focusing on a particular field from the Delta named status
which has a number of sub-fields: commissionDate
, decommissionDate
, isDeactivated
, isPreview
, terminationDate
.
In our transformation we currently read the Delta file in using PySpark, convert the DF to pandas using df.toPandas()
and operate on this pandas DF using the pandas API. Once we have this pandas DF we would like to access its fields without using row iteration.
The data in Pandas looks like the following when queried using inventory_df["status"][0]
(i.e. inventory_df["status"]
is a list):
Row(commissionDate='2011-07-24T00:00:00 00:00', decommissionDate='2013-07-15T00:00:00 00:00', isDeactivated=True, isPreview=False, terminationDate=None)
We have found success using row iteration like:
unit_df["Active"] = [
not row["isDeactivated"] for row in inventory_df["status"]
]
but we have to use a row iteration each time we want to access data from the inventory_df
. This is more verbose and is less efficient.
We would love to be able to do something like:
unit_df["Active"] = [
not inventory_df["status.isDeactivated"]
]
which is similar to the Spark destructuring approach, and allows accessing all of the rows at once but there doesn't seem to be equivalent pandas logic.
The data within PySpark has a format like status: struct<commissionDate:string,decommissionDate:string,isDeactivated:boolean,isPreview:boolean,terminationDate:string>
and we can use the format mentioned above, selecting a subcolumn like df.select("status.isDeactivated")
.
How can this approach be done using pandas?
CodePudding user response:
This may get you to where you think you are:
unit_df["Active"] = unit_df["Active"].apply(lambda x: pd.DataFrame(x.asDict()))
From here I would do:
unit_df = pd.concat([pd.concat(unif_df["Active"], ignore_index=True), unit_df], axis=1)
Which would get you a singular pd.DataFrame
, now with columns for commissiondate
, decomissiondate
, etc.