Read single column from csv file and rename with the name of the text file-CodePudding

I'm using a for loop to cycle through numerous text files, select a single column from the text files (named ppm), and append these columns to a new data frame. I'd like the columns in the new data frame to have the name of the text file but I'm not sure how to do this..

My code is:

all_files=glob.glob(os.path.join(path,"*.txt"))
df1=pd.DataFrame()
for file in all_files:
    file_name = os.path.basename(file)
    df = pd.read_csv(file, index_col=None, sep='\s ', header = 0, usecols = ['ppm'])
    df1 = pd.concat([df,df1],axis=1)

At the moment every column in the new dataframe is called 'ppm'.

I used to have this code

df1=pd.DataFrame()
for file in all_files:
    file_name = file_name = os.path.basename(file)
    df = pd.read_csv(file, index_col=None, sep='\s ', header = 0)
    df1[file_name] = df['ppm']

But I ran into the warning 'PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() df1[file_name] = df['ppm'].copy()' when I tried to run the code for a large number of files (~ 100s).

CodePudding user response：

Use concat outside loops with append DataFrames to list with rename column ppm:

all_files=glob.glob(os.path.join(path,"*.txt"))

dfs = []
for file in all_files:
    file_name = os.path.basename(file)
    df = pd.read_csv(file, index_col=None, sep='\s ', header = 0, usecols = ['ppm'])
    dfs.append(df.rename(columns={'ppm':file_name}))
df_big = pd.concat(dfs, axis=1)

CodePudding user response：

Assuming index is equal, add all your data into a dictionairy:

all_files=glob.glob(os.path.join(path,"*.txt"))
data_dict = {}
for file in all_files:
    file_name = os.path.basename(file)
    df = pd.read_csv(file, index_col=None, sep='\s ', header = 0, usecols = ['ppm'])
    data_dict[file_name] = df['ppm']
    
df1 = pd.DataFrame(data_dict)

CodePudding user response：

Use df.rename() to rename the column name of the dataframe.

for file in all_files:
    file_name = os.path.basename(file)
    print(file_name)
    df = pandas.read_csv(file, index_col=None, sep=',', header = 0, usecols = ['ppm'])
    df.rename(columns={'ppm': file_name}, inplace=True)
    df1 = pandas.concat([df,df1],axis=1)

Output:

  two.txt one.txt
0   9   3
1   0   6

CodePudding user response：

Rather than concatenating and appending dataframes as you iterate over your list of files, you could consider building a dictionary of the relevant data then construct your dataframe just once. Like this:

import csv
import pandas as pd
import glob
import os

PATH = ''
COL = 'ppm'
FILENAME = 'filename'
D = {COL: [], FILENAME: []}
for file in glob.glob(os.path.join(PATH, '*.csv')):
    with open(file, newline='') as infile:
        for row in csv.DictReader(infile):
            if COL in row:
                D[COL].append(row[COL])
                D[FILENAME].append(file)

df = pd.DataFrame(D)
print(df)