I currently have a pandas dataframe that looks like this:
location | count | qty | approved_count |
---|---|---|---|
Phoenix | 24 | 300 | 15 |
Dallas | 18 | 403 | 14 |
I would like to append a row to the dataframe that iterates over the columns and sums them, and then appends a new row to the bottom, including the value "Grand Total" in the 'location' column. The resulting dataset should look like this:
location | count | qty | approved_count |
---|---|---|---|
Phoenix | 24 | 300 | 15 |
Dallas | 18 | 403 | 14 |
Grand Total | 42 | 703 | 29 |
I am currently able to get this result this way:
df = df.append({'location' : 'Grand Total', 'count' :
df['count'].sum(), 'qty' : df['qty'].sum(),
'approved_count' : df['approved_count'].sum()}, ignore_index = True)
however I would like to be able to dynamically iterate over the columns and sum, excluding the 'location' column from the sum process. Is this possible with Pandas or Pyspark?
CodePudding user response:
Try:
df = df.set_index("location")
df.loc["Grand Total"] = df.sum()
df = df.reset_index()
>>> df
location count qty approved_count
0 Phoenix 24 300 15
1 Dallas 18 403 14
2 Grand Total 42 703 29
Or in one line using concat
:
>>> pd.concat([df.set_index("location"), df.drop("location",axis=1).sum().rename("Grand Total").to_frame().T]).reset_index()
index count qty approved_count
0 Phoenix 24 300 15
1 Dallas 18 403 14
2 Grand Total 42 703 29
CodePudding user response:
Using Pyspark as below -
Input Data
df = spark.createDataFrame([('Phoenix', 24, 300, 15), ('Dallas', 18, 403, 14)], schema = ["location", "count", "qty", "approved_count"])
df.show()
-------- ----- --- --------------
|location|count|qty|approved_count|
-------- ----- --- --------------
| Phoenix| 24|300| 15|
| Dallas| 18|403| 14|
-------- ----- --- --------------
Creating a new dataframe to find the summation of columns
df1 = df.withColumn("location", lit("Grand Total")).groupBy("location").agg(*[sum(c).alias(c) for c in df.columns if c != "location"])
df1.show()
----------- ----- --- --------------
| location|count|qty|approved_count|
----------- ----- --- --------------
|Grand Total| 42|703| 29|
----------- ----- --- --------------
Union both the dataframes to append the rows
df.unionByName(df1).show()
----------- ----- --- --------------
| location|count|qty|approved_count|
----------- ----- --- --------------
| Phoenix| 24|300| 15|
| Dallas| 18|403| 14|
|Grand Total| 42|703| 29|
----------- ----- --- --------------