How to read multiple csv files with specific name from a folder and merge them?-CodePudding

I am trying to read multiple files from a folder with specific name (1.car.csv, 2.car.csv and so on) and trying to add a new label after each iteration at right most of the dataset and merge all the csv files into one csv file. As the ".car.csv" is constant, I think I can use a for loop with .format(index) function to run over the csv files. All of the csv files has got same attributes.

Kindly help me!

CodePudding user response：

This can be easily done with a CSV tool like miller:

mlr --csv cat --filename bla1.csv *.car.csv

This will concatenate the files (without repeating the header) and prepend the filename as the first column.

CodePudding user response：

You can use the pandas library this way:

import pandas as pd
import os

# path to folder where the csv files are stored
path = '/path/to/folder'

result = pd.DataFrame()

for i in range(1, n 1):
    filename = "{}.car.csv".format(i)
    file_path = os.path.join(path, filename)
    df = pd.read_csv(file_path)
    
    df['new_label'] = i
    
    result = pd.concat([result, df], ignore_index=True)

result.to_csv('final_result.csv', index=False)

The n in the code above should be replaced with the number of csv files you have in the folder.

If you need any explanation of the code (in case you're new to python or dataframes) just comment below.

CodePudding user response：

glob is used to get all files in the folder that match the pattern *.csv
pd.read_csv is used to read each file as a DataFrame
- index_col=None you are telling Pandas to not use any of the columns as the index, and instead to create a default index for the DataFrame.
- header=0 you are telling Pandas to use the first row of the CSV file as the header row.
pd.concat is used to merge all the DataFrames into a single DataFrame merged_df
- axis=0 means that the concatenation should happen along the rows (vertically)
- ignore_index=True the concatenation is performed such that the original indices of the individual DataFrames are discarded, and a new default index is created for the resulting DataFrame.

import glob
import pandas as pd

path = r'<path to folder containing csv files>'
all_files = glob.glob(path   "/*.csv")

lst = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    lst.append(df)

merged_df = pd.concat(lst, axis=0, ignore_index=True)

CodePudding user response：

Using pathlib and pandas you can use .assign() to enter the new column and finally .concat() to concatenate all the files into one.

from pathlib import Path

import pandas as pd


input_path = Path("path/to/car/files/").glob("*car.csv")
output_path = "path/to/output"

pd.concat(
    (pd.read_csv(x).assign(new_label="new data") for x in input_path), ignore_index=True
).to_csv(f"{output_path}/final.csv", index=False)