I want to precise this code where I am finding mean which is updating data frame. How I can find patterns and take this code in few lines.
import pandas as pd
import numpy as np
df = pd.read_csv('Dataset2.csv')
df = df.to_numpy()
for i in range (0,len(df)):
mean_1 = df[i,1:5].sum() / 4
mean_2 = (df[i,0:1].sum() df[i,2:5].sum()) / 4
mean_3 = (df[i,0:2].sum() df[i,3:5].sum()) / 4
mean_4 = (df[i,0:3].sum() df[i,4:5].sum()) / 4
mean_5 = df[i,0:4].sum() / 4
df[i,0] = df[i,0] - mean_1
df[i,1] = df[i,1] - mean_2
df[i,2] = df[i,2] - mean_3
df[i,3] = df[i,3] - mean_4
df[i,4] = df[i,4] - mean_5
CodePudding user response:
My interpretation of what you are trying to do is
given a dataframe df, create a new dataframe where the value of the element in row i, column j, is given by the mean of all values in row i - except the one in column j
If this is correct then the following will be much quicker. It assumes the dataframe only consists of columns necessary for this calculation. If there are extra columns you will need to adjust the solution with indexing
means = (df.sum(axis=1).reshape((len(df),1)) - df)/4
means
will be a numpy array so wrap it up in a pandas Dataframe if that's what you need
CodePudding user response:
You could try
import pandas as pd
df = pd.read_csv('Dataset2.csv')
for col in range(df.shape[1]):
df.iloc[:, col] -= df.iloc[:, [c for c in range(df.shape[1]) if c != col]].mean(axis="columns")
without converting to a np.ndarray
.
Timing with 10 million rows:
from random import random
from time import perf_counter
# Sample dataframe
num_rows = 10_000_000
df = pd.DataFrame(
{f"col_{i}": [random() for _ in range(num_rows)] for i in range(5)}
)
# Timing
start = perf_counter()
for col in range(df.shape[1]):
df.iloc[:, col] -= df.iloc[:, [c for c in range(df.shape[1]) if c != col]].mean(axis="columns")
end = perf_counter()
print(f"Duration: {end - start:.2f} seconds")
Result: Duration: 2.64 seconds
on a mediocre machine.
In NumPy:
import pandas as pd
df = pd.read_csv('Dataset2.csv')
arr = df.to_numpy()
for col in range(arr.shape[1]):
arr[:, col] -= arr[:, [c for c in range(arr.shape[1]) if c != col]].mean(axis=1)