I am trying to write a Python code to break .xlsx file having 52000 rows into 11 excel files (10 xlsx files with 5000 rows and 1 file with remaining 2000 records)
having a hard time finding a good online solution
Tried the below solution which is producing exact desired outcome but it is also producing blank 49,996 xlsx files
import pandas as pd df = pd.read_excel("C:\Users\rajat.kapoor\Desktop\All Data Combined\Credit Check Data.xlsx") n_partitions = 5000
for i in range(n_partitions): sub_df = df.iloc[(i*n_partitions): ((i 1)*n_partitions)] sub_df.to_excel(f"C:\Users\rajat.kapoor\Desktop\All Data Combined\Credit Check Data - {i}.xlsx", sheet_name="a")
I am trying to write a Python code to break .xlsx file having 52000 rows into 11 excel files (10 xlsx files with 5000 rows and 1 file with remaining 2000 records)
having a hard time finding a good online solution
Tried the below solution which is producing exact desired outcome but it is also producing blank 49,996 xlsx files
import pandas as pd df = pd.read_excel("C:\Users\rajat.kapoor\Desktop\All Data Combined\Credit Check Data.xlsx") n_partitions = 5000
for i in range(n_partitions): sub_df = df.iloc[(i*n_partitions): ((i 1)*n_partitions)] sub_df.to_excel(f"C:\Users\rajat.kapoor\Desktop\All Data Combined\Credit Check Data - {i}.xlsx", sheet_name="a")
CodePudding user response:
Instead of pandas, you could also use openpyxl
like below, to split your data into different files:
import openpyxl
# Open the .xlsx file using openpyxl
wb = openpyxl.load_workbook('data.xlsx')
# Get the sheet name
sheet_name = wb.sheetnames[0]
# Get the sheet
sheet = wb[sheet_name]
# Set the number of rows per file
rows_per_file = 5000
# Set the starting row and ending row for the first file
start_row = 1
end_row = rows_per_file
# Set the starting file number
file_number = 1
# Iterate over the rows in the sheet
while start_row < sheet.max_row:
# Create a new workbook for the current file
file_wb = openpyxl.Workbook()
# Get the active sheet in the new workbook
file_sheet = file_wb.active
# Iterate over the rows in the current file
for row in sheet[start_row:end_row]:
# Iterate over the cells in the row
for cell in row:
# Write the cell value to the corresponding cell in the new sheet
file_sheet[cell.coordinate].value = cell.value
# Save the new workbook with the current file number
file_wb.save(f'data_{file_number}.xlsx')
# Increment the file number
file_number = 1
# Set the starting and ending rows for the next file
start_row = end_row 1
end_row = rows_per_file
# Save the remaining rows in the last file
file_wb = openpyxl.Workbook()
file_sheet = file_wb.active
for row in sheet[start_row:sheet.max_row]:
for cell in row:
file_sheet[cell.coordinate].value = cell.value
file_wb.save(f'data_{file_number}.xlsx')
This code first loads the .xlsx file using openpyxl, and then gets the sheet name and sheet. It then sets the number of rows per file (in this case, 5000) and the starting and ending rows for the first file. It then enters a loop that iterates over the rows in the sheet, creating a new workbook for each file and writing the cell values to the corresponding cells in the new sheet. Finally, it saves the new workbook with the current file number. The loop continues until all of the rows in the sheet have been processed.
CodePudding user response:
you can change the increment step from 1 to 5000. So you would only create 11 excel files. Since rest similar I am showing the different part below.
for i in range(0,52000,5000):
sub_df = df.iloc[(i): ((i 5000))]