Averaging multiple columns at specific indexes in a pandas dataframe, whilst avoiding Performance Wa-CodePudding

I have a fairly large dataframe which I am trying to combine the columns of in a very specific manner. The original dataframe has 2150 columns and the final dataframe should have around 500 by taking the average of some spread of columns to produce new column. The spread changes which is why I have tried a list which has the start of each column group.

My actual code gets the desired results. However, with the warning,

"PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()

df1[str(val)] = df[combine].mean(axis=1)"

I have cannot think of a smart way to use concat for one single combine at the end whilst still taking the mean of each group. I am also new to writing code and any corrections to my style would be appreciated, especially where I have to break out of the loop.

Here is my actual code.

import pandas as pd
df = pd.read_csv("some file location")

new_cols = list(range(350, 702, 3))   list(range(707, 1398, 6))   \
    list(range(1407, 2098, 10))   list(range(2112, 2488, 15))   [2501]
cols = list(map(int, list(df.columns)[1:]))
df1 = df.copy()
for i, val in enumerate(new_cols):
    if val == 2501:
        break
    combine = list(map(str, range(new_cols[i], new_cols[i 1])))
    print(combine)
    df1 = df1.drop(combine, axis=1, inplace=False)
    df1[str(val)] = df[combine].mean(axis=1)
    
df1.to_csv("data_reduced_precision.csv", index=False)

print("Finished")

Here is a minimal example which shows what I am trying to achieve. It doesn't produce the PerformanceWarning as it has only a few columns. But illustrates my method I hope.

df1 = pd.DataFrame({'1': [1, 2, 3, 4], 
                    '2': [5, 6, 7, 8], 
                    '3': [9, 10, 11, 12], 
                    '4': [13, 14, 15, 16],
                    '5': [17, 18, 19, 20], 
                    '6': [21, 22, 23, 24], 
                    '7': [25, 26, 27, 28]})
df2 = df1.copy()
# df2 should have columns 1,2,5 which are the mean of df1 columns [1],[2,3,4],[5,6,7]
new_cols = [1, 2, 5, 8]
for i, val in enumerate(new_cols):
    if val == 8:
        break
    #All the column names are integers as str
    combine = list(map(str, range(new_cols[i], new_cols[i 1])))
    df2 = df2.drop(combine, axis=1, inplace=False)
    df2[str(val)] = df1[combine].mean(axis=1)
print(df2)
     1     2     5
0  1.0   9.0  21.0
1  2.0  10.0  22.0
2  3.0  11.0  23.0
3  4.0  12.0  24.0

CodePudding user response：

I would move your dataframe operations out of your for-loop.

import pandas
df1 = pandas.DataFrame({
    '1': [1, 2, 3, 4], 
    '2': [5, 6, 7, 8], 
    '3': [9, 10, 11, 12], 
    '4': [13, 14, 15, 16],
    '5': [17, 18, 19, 20], 
    '6': [21, 22, 23, 24], 
    '7': [25, 26, 27, 28],
})

# df2 should have columns 1,2,5 which are the mean of df1 columns [1],[2,3,4],[5,6,7]

new_cols = [1, 2, 5, 8]
combos = []
for i, val in enumerate(new_cols):
    if val != 8:
        #All the column names are integers as str
        combos.append(list(map(str, range(new_cols[i], new_cols[i 1]))))

df2 = df1.assign(**{
    str(maincol): df1.loc[:, combo].mean(axis="columns")
    for maincol, combo in zip(new_cols, combos)
}).loc[:, map(str, new_cols[:-1])]

Unless I'm mistaken, this will pass around references to the original df1 instead of making a bunch of copies (i.e., df2 = df2.drop(...).

Printing out df1, I get:

     1     2     5
0  1.0   9.0  21.0
1  2.0  10.0  22.0
2  3.0  11.0  23.0
3  4.0  12.0  24.0

If I scale this up to a 500,000 x 20 dataframe, it completes seemingly instantly without warning on my machine:

import numpy
dfbig = pandas.DataFrame(
    data=numpy.random.normal(size=(500_000, 20)),
    columns=list(map(str, range(1, 21)))
)

new_cols = [1, 2, 5, 8, 12, 13, 16, 17, 19]
combos = []
for i, val in enumerate(new_cols[:-1]):
    combos.append(list(map(str, range(new_cols[i], new_cols[i 1]))))

dfbig2 = dfbig.assign(**{
    str(maincol): dfbig.loc[:, combo].mean(axis="columns")
    for maincol, combo in zip(new_cols, combos)
}).loc[:, map(str, new_cols[:-1])]