i have a big dataframe with 1 million rows of time series data. I want to slice it into smaller chunks of 1000 rows each. So this would give me 1000 chunks, and i need every chunk to be copied into a column of a new dataframe.
CodePudding user response:
i am now doing this, which does the job but might be inefficient. Im still happy if people could help:
for df_split in np.array_split(df, len(df) // chunk_size):
#print(df_split['random_nos'].mean())
i=i 1
df_split= df_split.reset_index()
df_split = df_split.rename({'random_nos': 'String' str(i)}, axis=1)
df_all = pd.concat([df_all, df_split], axis=1)
CodePudding user response:
You could use numpy.array_split to achieve this:
import pandas as pd
import numpy as np
def slice_df_into_chunks(df_size, chunk_size):
df = pd.DataFrame(np.random.rand(df_size), columns=['random_nos'])
df_list = []
for i, df_split in enumerate(np.array_split(df, chunk_size)):
df_split = df_split.rename(columns={'random_nos':f'String{i}'})
df_split.reset_index(drop=True, inplace=True)
df_list.append(df_split)
return pd.concat(df_list, axis=1)
slice_df_into_chunks(10**6, 10**3) # Give whatever sizes you want
Note that if df_size is not exactly divisible by chunk_size (e.g: 10 & 3) then one chunk will have extra numbers.
slice_df_into_chunks(10**6, 10**3)
String0 String1 String2
0 0.955620 0.543234 0.509360
1 0.755157 0.174576 0.267600
2 0.816509 0.776549 0.455464
3 0.990282 NaN NaN