Home > Net >  dynamically create a grand total row in pandas/pyspark?
dynamically create a grand total row in pandas/pyspark?

Time:04-08

I currently have a pandas dataframe that looks like this:

location count qty approved_count
Phoenix 24 300 15
Dallas 18 403 14

I would like to append a row to the dataframe that iterates over the columns and sums them, and then appends a new row to the bottom, including the value "Grand Total" in the 'location' column. The resulting dataset should look like this:

location count qty approved_count
Phoenix 24 300 15
Dallas 18 403 14
Grand Total 42 703 29

I am currently able to get this result this way:

    df = df.append({'location' : 'Grand Total', 'count' : 
    df['count'].sum(), 'qty' : df['qty'].sum(),
    'approved_count' : df['approved_count'].sum()}, ignore_index = True)

however I would like to be able to dynamically iterate over the columns and sum, excluding the 'location' column from the sum process. Is this possible with Pandas or Pyspark?

CodePudding user response:

Try:

df = df.set_index("location")
df.loc["Grand Total"] = df.sum()
df = df.reset_index()

>>> df
      location  count  qty  approved_count
0      Phoenix     24  300              15
1       Dallas     18  403              14
2  Grand Total     42  703              29

Or in one line using concat:

>>> pd.concat([df.set_index("location"), df.drop("location",axis=1).sum().rename("Grand Total").to_frame().T]).reset_index()
         index  count  qty  approved_count
0      Phoenix     24  300              15
1       Dallas     18  403              14
2  Grand Total     42  703              29

CodePudding user response:

Using Pyspark as below -

Input Data

df = spark.createDataFrame([('Phoenix', 24, 300, 15), ('Dallas', 18, 403, 14)], schema = ["location", "count", "qty", "approved_count"])
df.show()

 -------- ----- --- -------------- 
|location|count|qty|approved_count|
 -------- ----- --- -------------- 
| Phoenix|   24|300|            15|
|  Dallas|   18|403|            14|
 -------- ----- --- -------------- 

Creating a new dataframe to find the summation of columns

df1 = df.withColumn("location", lit("Grand Total")).groupBy("location").agg(*[sum(c).alias(c) for c in df.columns if c != "location"])
df1.show()

 ----------- ----- --- -------------- 
|   location|count|qty|approved_count|
 ----------- ----- --- -------------- 
|Grand Total|   42|703|            29|
 ----------- ----- --- -------------- 

Union both the dataframes to append the rows

df.unionByName(df1).show()

 ----------- ----- --- -------------- 
|   location|count|qty|approved_count|
 ----------- ----- --- -------------- 
|    Phoenix|   24|300|            15|
|     Dallas|   18|403|            14|
|Grand Total|   42|703|            29|
 ----------- ----- --- -------------- 
  • Related