Home > Enterprise >  How to create a pandas data frame with specific columns from multiple text files?
How to create a pandas data frame with specific columns from multiple text files?

Time:12-07

I have 100s of text files that have a header row and then a series of numbers in each column (example below):

Textfile 1

x y
0.001 5.45
0.23 6.83
0.03 4.86

Textfile 2

x y
0.003 7.34
0.89 3.89
0.45 5.66

I would like to create a new pandas data frame (example below) that contains the values in column 2 of each file as a new column that has the name of the text file.

Textfile 1 Textfile 2
5.45 7.34
6.83 3.89
4.86 5.66

I was thinking along the lines of a for loop that loops through all the text files in the folder and at each iteration adds a column to the new data frame (something like the code below)

all_files=glob.glob(os.path.join(path,"*.txt"))

df_ppm = []

for file in all_files:
    file_name = os.path.split(file)[1]
    df_ppm.append(pd.read_csv(file_name))

CodePudding user response:

This should do what you need :)

import os
import pandas as pd
import glob

path = r'C:\Users\dane\Documents\pythonPlayground\csv_files' # use your path
all_files = glob.glob(path   "/*.csv")  # this is a file filter so you only read csv files

new_df = pd.DataFrame()  # create a new empty dataframe 

for file_path in all_files:  # loop through each file 
    file_name = os.path.basename(file_path)  # get the file name 
    df = pd.read_csv(file_path, index_col=None, header=0)  # read a file and add it to a dataframe 
    new_df[file_name] = df['y']  # assuming the column is always called 'y' you can just add it to the new DF this way, otherwise, use df.iloc[1]

print(new_df)

CodePudding user response:

You can read only the columns of files using usecols in read_csv since the column name('y') is the same in all the .txt files

all_files=glob.glob(os.path.join(path,"*.txt"))

df_ppm = {}

for file in all_files:
    file_name = os.path.split(file)[1]
    df_ppm[file_name[:-3]] = pd.read_csv(file_name, usecols=['y'])['y'].to_list()

df_ppm = pd.DataFrame(df_ppm)
  • Related