dynamically create a grand total row in pandas/pyspark?-CodePudding

I currently have a pandas dataframe that looks like this:

location	count	qty	approved_count
Phoenix	24	300	15
Dallas	18	403	14

I would like to append a row to the dataframe that iterates over the columns and sums them, and then appends a new row to the bottom, including the value "Grand Total" in the 'location' column. The resulting dataset should look like this:

location	count	qty	approved_count
Phoenix	24	300	15
Dallas	18	403	14
Grand Total	42	703	29

I am currently able to get this result this way:

    df = df.append({'location' : 'Grand Total', 'count' : 
    df['count'].sum(), 'qty' : df['qty'].sum(),
    'approved_count' : df['approved_count'].sum()}, ignore_index = True)

however I would like to be able to dynamically iterate over the columns and sum, excluding the 'location' column from the sum process. Is this possible with Pandas or Pyspark?

CodePudding user response：

Try:

df = df.set_index("location")
df.loc["Grand Total"] = df.sum()
df = df.reset_index()

>>> df
      location  count  qty  approved_count
0      Phoenix     24  300              15
1       Dallas     18  403              14
2  Grand Total     42  703              29

Or in one line using concat:

>>> pd.concat([df.set_index("location"), df.drop("location",axis=1).sum().rename("Grand Total").to_frame().T]).reset_index()
         index  count  qty  approved_count
0      Phoenix     24  300              15
1       Dallas     18  403              14
2  Grand Total     42  703              29

CodePudding user response：

Using Pyspark as below -

Input Data

df = spark.createDataFrame([('Phoenix', 24, 300, 15), ('Dallas', 18, 403, 14)], schema = ["location", "count", "qty", "approved_count"])
df.show()

 -------- ----- --- -------------- 
|location|count|qty|approved_count|
 -------- ----- --- -------------- 
| Phoenix|   24|300|            15|
|  Dallas|   18|403|            14|
 -------- ----- --- --------------

Creating a new dataframe to find the summation of columns

df1 = df.withColumn("location", lit("Grand Total")).groupBy("location").agg(*[sum(c).alias(c) for c in df.columns if c != "location"])
df1.show()

 ----------- ----- --- -------------- 
|   location|count|qty|approved_count|
 ----------- ----- --- -------------- 
|Grand Total|   42|703|            29|
 ----------- ----- --- --------------

Union both the dataframes to append the rows

df.unionByName(df1).show()

 ----------- ----- --- -------------- 
|   location|count|qty|approved_count|
 ----------- ----- --- -------------- 
|    Phoenix|   24|300|            15|
|     Dallas|   18|403|            14|
|Grand Total|   42|703|            29|
 ----------- ----- --- --------------