How to save specific columns from different files into one file-CodePudding

I am new to python and I'm using Python 3.9.6. I have about 48 different files all with "Cam_Cantera_IDT_output_800K_*.csv" in the name. For example: Cam_Cantera_IDT_output_800K_7401.csv and Cam_Cantera_IDT_output_800K_8012.csv. All the files have a time column t followed by many columns named X_oh, X_ch2, X_h20 etc. I want to write a code that goes to each of these 48 files and takes only the t and X_ch2 columns and saves them in a new file. Since the time column is the same for all 48 files I just want 1 time column t followed by 48 columns of X_ch2, so 49 columns in total.

My first attempt was using append:

#Using .append, this code runs
import pandas as pd
import glob

require_cols = ['t', 'X_ch2']
all_data = pd.DataFrame()

for f in glob.glob("Cam_Cantera_IDT_output_800K_*.csv"):
    df = pd.read_csv(f, usecols = require_cols)
    all_data = all_data.append(df,ignore_index=True)

all_data.to_csv("new_combined_file.csv")

This code ran but it appended each file one underneath the other, so I had only 2 columns, one for t and one for X_ch2, but many rows. I later read that that's what append does, it adds the files one underneath the other.

I then tried using pd.concat to add the columns vertically next to each other. The code I used can be seen below. I unfortunately got the same result as with append. I got just 2 columns, one for time t and one for X_ch2 and all the files were added underneath each other.

#Attempting using pd.concat, this code runs
import glob
import pandas as pd

file_extension = 'Cam_Cantera_IDT_output_800K_*.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]

require_cols = ['t', 'X_ch2']

combined_csv_data = pd.concat([pd.read_csv(f, usecols = require_cols) for f in all_filenames], axis=0)

print(combined_csv_data)

combined_csv_data.to_csv('combined_csv_data.csv')

My last attempt was using pd.merge, I added on='t' for it to merge on the time column so that I only have 1 time column. I keep getting the error that I am missing 'right'. But when I add it to the line it tells me that right is not defined:

combined_csv_data = pd.merge(right, [pd.read_csv(f, usecols = require_cols) for f in all_filenames], on='t') => gives 'right' is not defined.

I tried using right_on = X_ch2 or right_index=True but nothing seems to work.

The original code I tried, without any 'right' is shown below.

# A merge attempt
#TypeError: merge() missing 1 required positional argument: 'right'

import glob
import pandas as pd

file_extension = 'Cam_Cantera_IDT_output_800K_*.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]

require_cols = ['t', 'X_ch2']

combined_csv_data = pd.merge([pd.read_csv(f, usecols = require_cols) for f in all_filenames], on='t')

print(combined_csv_data)

combined_csv_data.to_csv('combined_csv_data.csv')

Any help would be highly appreciated, thank you.

CodePudding user response：

Use axis=1 in concat and convert t column to index, select column X_ch2 for Series:

require_cols = ['t', 'X_ch2']
L = [pd.read_csv(f,usecols = require_cols, index_col=['t'])['X_ch2'] for f in all_filenames]
combined_csv_data = pd.concat(L, axis=1)

If need columns names rename by filenames for avoid duplicated 48 columns names:

import os

L = [pd.read_csv(f, 
                 usecols = require_cols, 
                 index_col=['t'])['X_ch2'].rename(os.path.basename(f))
        for f in all_filenames]

CodePudding user response：

If installing convtools library is an option, then:

import glob

from convtools import conversion as c
from convtools.contrib.tables import Table

required_cols = ["t", "X_ch2"]
table = None
for number, f in enumerate(glob.glob("Cam_Cantera_IDT_output_800K_*.csv")):
    table_ = (
        Table.from_csv(f, header=True)
        .take(*required_cols)
        .rename({"X_ch2": f"X_ch2__{number}"})
    )
    if table is None:
        table = table_
    else:
        table = table.join(table_, on="t", how="full")

table.into_csv("new_combined_file.csv", include_header=True)