Home > database >  Iteration over df rows to sum groups of items
Iteration over df rows to sum groups of items

Time:09-24

I am new to coding. I looked for similar questions on this site, that helped me to come up with a working version of my code, but I need help with making it more professional.

I need help with iterating over rows of a data frame in pandas. What I want to do is to find same items (e.g. Groceries) in 'Description' column and total (sum) their values from 'Amount' column, and finally to write the result to a .csv file. The reason I am doing all this is to compile data for a bar graph I want to create based on those categories.

I was able to accomplish all that with the help of the following code, but that most likely is not very pythonic or efficient. What I did is that I used a print statement nested in an if statement to get category label (i) and amount to print to file. The issue is that I had to add a lot of things for the whole to work. First, I had to create an empty list to make sure that the if statement does not trigger .loc every time it sees a desired item in 'Description' column. Second, I am not sure if saving print statements is the best way to go here, as it appears very ersatz. It feels like I am a step away from using punch cards. In short, I would appreciate if someone could help me in making my code more up to standards.

'''

used_list = []
for i in df['Description']:
    if i in used_list:
        continue
    sys.stdout = open(file_to_hist, "a")
    print(i,',', df.loc[df['Description'] == i, 'Amount'].sum())
    used_list.append(i)

'''

I also tried a slightly different approach (save results directly into a df), but then I get NaN values all across the 'Amount' column and no other errors (exit code 0) to help me understand what is going on:

'''

used_list = []
df_hist_data = pd.DataFrame(columns=['Description', 'Amount'])
for i in df['Description']:
    if i in used_list:
        continue
    df_hist_data = df_hist_data.append({'Description' : i}, {'Amount' : df.loc[df['Description'] == i, 'Amount'].sum()})
    used_list.append(i)

print(df_hist_data)

'''

CodePudding user response:

You can select only rows that match a criteria with df[ a boolean matrix here ]

When doing df["a column name"]=="value" you actually get a boolean matrix where rows where "a column name" == "value" are True and other are False

To sumarize this : Dataframe[Dataframe["Description"] == "banana"] is going to give you a view to a new dataframe where only rows matching your condition are kept. (original dataframe is not altered)

If you select column "Amount" of this dataframe and .sum() it, you have what you desired, in one line. That's the typical pandadorable (equivalent of pythonic for pandas) way of doing conditionnal sums.

If require, select the rows of the dataframe where condition can take multiple values, use .isin() to get your boolean matrix

Dataframe["Description"].isin(["banana","apple"])

Then, when scanning all possible values of "Description" in your dataframe, use .unique() when generating your iterator.

And then you can finally append Series to your empty dataframe before saving it as csv.

Overal, we get the code :

import pandas as pd

Dataframe = pd.DataFrame([
    {"Description":"apple","Amount":15},
    {"Description":"banana","Amount":1},
    {"Description":"berry","Amount":155},
    {"Description":"banana","Amount":4}])

df_hist_data = pd.DataFrame(columns=['Description', 'Sum'])

for item in Dataframe["Description"].unique() :
    df_hist_data = df_hist_data.append( pd.Series( 
        { "Description" : item , 
         "Sum" : Dataframe[(Dataframe["Description"].isin([item]))]["Amount"].sum() }
        ), ignore_index=True )

OUT: 
>> 20

You can also do it even more pythonically, in one line with a list comprehension :


selector = "Description"
sum_on = "Amount"
new_df =  pd.DataFrame([  {selector : item , sum_on : df[(df[selector].isin([item]))][sum_on].sum() } for  item in df[selector].unique() ] )

  • Related