how i can precise this python code into few lines?-CodePudding

I want to precise this code where I am finding mean which is updating data frame. How I can find patterns and take this code in few lines.

import pandas as pd
import numpy as np

df = pd.read_csv('Dataset2.csv')
df = df.to_numpy()

for i in range (0,len(df)):

     mean_1 = df[i,1:5].sum() / 4
     mean_2 = (df[i,0:1].sum()   df[i,2:5].sum()) / 4 
     mean_3 = (df[i,0:2].sum()   df[i,3:5].sum()) / 4 
     mean_4 = (df[i,0:3].sum()   df[i,4:5].sum()) / 4
     mean_5 = df[i,0:4].sum() / 4
    
   
     df[i,0] = df[i,0] - mean_1
     df[i,1] = df[i,1] - mean_2
     df[i,2] = df[i,2] - mean_3
     df[i,3] = df[i,3] - mean_4
     df[i,4] = df[i,4] - mean_5

CodePudding user response：

My interpretation of what you are trying to do is

given a dataframe df, create a new dataframe where the value of the element in row i, column j, is given by the mean of all values in row i - except the one in column j

If this is correct then the following will be much quicker. It assumes the dataframe only consists of columns necessary for this calculation. If there are extra columns you will need to adjust the solution with indexing

means = (df.sum(axis=1).reshape((len(df),1)) - df)/4

means will be a numpy array so wrap it up in a pandas Dataframe if that's what you need

CodePudding user response：

You could try

import pandas as pd

df = pd.read_csv('Dataset2.csv')
for col in range(df.shape[1]):
    df.iloc[:, col] -= df.iloc[:, [c for c in range(df.shape[1]) if c != col]].mean(axis="columns")

without converting to a np.ndarray.

Timing with 10 million rows:

from random import random
from time import perf_counter

# Sample dataframe
num_rows = 10_000_000
df = pd.DataFrame(
    {f"col_{i}": [random() for _ in range(num_rows)] for i in range(5)}
)

# Timing
start = perf_counter()
for col in range(df.shape[1]):
    df.iloc[:, col] -= df.iloc[:, [c for c in range(df.shape[1]) if c != col]].mean(axis="columns")
end = perf_counter()
print(f"Duration: {end - start:.2f} seconds")

Result: Duration: 2.64 seconds on a mediocre machine.

In NumPy:

import pandas as pd

df = pd.read_csv('Dataset2.csv')
arr = df.to_numpy()
for col in range(arr.shape[1]):
    arr[:, col] -= arr[:, [c for c in range(arr.shape[1]) if c != col]].mean(axis=1)