Calculating averaged data in and writing to csv from a pandas dataframe-CodePudding

I have a very large spatial dataset stored in a dataframe. I am taking a slice of that dataframe into a new smaller subset to run further calculations. The data has x, y and z coordinates with a number of additional columns, some of which are text and some are numeric. The x and y coordinates are on a defined grid and have a known separation. Data looks like this

x,y,z,text1,text2,text3,float1,float2
75000,45000,120,aa,bbb,ii,12,0.2
75000,45000,110,bb,bbb,jj,22,0.9
75000,45100,120,aa,bbb,ii,11,1.8
75000,45100,110,bb,bbb,jj,45,2.4
75000,45100,100,bb,ccc,ii,13.6,1
75100,45000,120,bb,ddd,jj,8.2,2.1
75100,45000,110,bb,ddd,ii,12,0.6

For each x and y pair I want to iterate over a two series of text values and do three things in the z direction.

Calculate the average of one numeric value for all the values with a third specific text value
Sum another numeric value for all the values with the same text value
Write the a resultant table of 'x, y, average, sum' to a csv.

My code does part three (albeit very slowly) but doesn't calculate 1 or 2 or at least I don't appear to get the average and sum calculations in my output.

What have I done wrong and how can I speed it up?

    for text1 in text_list1:
        for text2 in text_list2:
            # Get the data into smaller dataframe
            df = data.loc[ (data["textfield1"] == text1) & (data["textfield2"] == text2 ) ]
                
            #Get the minimum and maximum x and y 
            minXw = df['x'].min()
            maxXw = df['x'].max()
            minYw = df['y'].min()
            maxYw = df['y'].max()
            
            # dictionary for quicker printing
            dict_out  = {}
            rows_list = []
            
            # Make output filename
            filenameOut = text1 "_" text2 "_Values.csv"
            # Start looping through x values
            for x in np.arange(minXw, maxXw, x_inc):
                xcount  = 1
                # Start looping through y values
                for y in np.arange(minYw, maxYw, y_inc):
                    ycount  = 1
                    
                    # calculate average and sum
                    ave_val = df.loc[df['textfield3'] == 'text3', 'float1'].mean()
                    sum_val = df.loc[df['textfield3'] == 'text3', 'float2'].sum()

                    # Make Dictionary of output values
                    dict_out  = dict([('text1', text1), 
                                      ('text2', text2), 
                                      ('text3', df['text3']),
                                      ('x'    , x-x_inc),
                                      ('y'    , y-y_inc),
                                      ('ave'  , ave_val),
                                      ('sum'   , sum_val)])
                    rows_list_c.append(dict_out)

            # Write csv      
            columns = ['text1','text2','text3','x','y','ave','sum']
            with open(filenameOut, 'w') as csvfile:
                writer = csv.DictWriter(csvfile, fieldnames=columns)
                writer.writeheader()
                for data in dict_out:
                    writer.writerow(data)

My resultant csv gives me:

text1,text2,text3,x,y,ave,sum

text1,text2,,74737.5,43887.5,nan,0.0
text1,text2,,74737.5,43912.5,nan,0.0
text1,text2,,74737.5,43937.5,nan,0.0
text1,text2,,74737.5,43962.5,nan,0.0

CodePudding user response：

Not really clear what you're trying to do. But here is a starting point

If you only need to process rows with a specific text3value, start by filtering out the other rows:

df = df[df.text3=="my_value"]

If at this point, you do not need text3 anymore, you can also drop it

df = df.drop(columns="text3")

Then you process several sub dataframes, and write each of them to their own csv file. groupby is the perfect tool for that:

for (text1, text2), sub_df in df.groupby(["text1", "text2"]):
    filenameOut = text1 "_" text2 "_Values.csv"
    # Process sub df
    output_df = process(sub_df)
    # Write sub df
    output_df.to_csv(filenameOut)

Note that if you keep your data as a DataFrame instead of converting it to a dict, you can use the DataFrame to_csv method to simply write the output csv.

Now let's have a look at the process function (Note that you dont really need to make it a separate function, you could as well dump the function body in the for loop).

At this point, if I understand correctly, you want to compute the sum and the average of every rows that have the same x and y coordinates. Here again you can use groupby and the agg function to compute the mean and the sum of the group.

def process(sub_df):
   # drop the text1 and text2 columns since they are in the filename anyway
   out = sub_df.drop(columns=["text1","text2"])
   
   # Compute mean and max
   return out.groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum"))

And that's preety much it.

Bonus: 2-liner version (but don't do that...)

for (text1, text2), sub_df in df[df.text3=="my_value"].drop(columns="text3").groupby(["text1", "text2"]):
    sub_df.drop(columns=["text1","text2"]).groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum")).to_csv(text1 "_" text2 "_Values.csv")

CodePudding user response：

To do this in an efficient way in pandas you will need to use groupby, agg and the in-built to_csv method rather than using for loops to construct lists of data and writing each one with the csv module. Something like this:

groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
    .groupby(["text1", "text2"])

for (text1, text2), group in groups:
    group.groupby("text3") \
        .agg({"float1": np.mean, "float2": sum}) \
        .to_csv(f"{text1}_{text2}_Values.csv")

It's not clear exactly what you're trying to do with the incrementing of x and y values, which is also what makes your current code very slow. To present sums and averages of the floating point columns by intervals of x and y, you could make bin columns and group by those too.

data["x_bin"] = (data["x"] - data["x"].min()) // x_inc
data["y_bin"] = (data["y"] - data["y"].min()) // y_inc
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
    .groupby(["text1", "text2"])

for (text1, text2), group in groups:
    group.groupby(["text3", "x_bin", "y_bin"]) \
        .agg({"x": "first", "y": "first", "float1": np.mean, "float2": sum}) \
        .to_csv(f"{text1}_{text2}_Values.csv")