Data processing python for DL (AE): read multiple .txt files with coordinates, to array or dataframe-CodePudding

I have multiple .txt files of different length:

X,Y
145.33334350585938,596.6666870117188
147.3572998046875,591.2614135742188
149.28125,586.875
151.3013153076172,581.3974609375

X,Y
146.55398559570312,609.2018432617188
146.55398559570312,607.8530883789062
146.55398559570312,605.5582275390625
146.55398559570312,603.2935180664062
147.29171752929688,601.7035522460938
148.74122619628906,600.2540283203125
150.29244995117188,598.7027587890625

I have to load them somehow to do some preprocessing, like bringing them all to the same shape, normalize them to use them as input to an anomalie detection model based on autoencoder.

That is what I have so far. But there are better ways I am sure.

import pandas as pd
import numpy as np
import os

df = pd.DataFrame(columns=range(500)) # 500 is more than the largest file's rows
for filename in os.listdir(path):
    with open(path   filename) as f:
        df_temp = pd.read_csv(f, sep=',', usecols=['X', "Y"])
        df_temp['right'] = list(zip(df_temp['X'], df_temp['Y']))
        s = df_temp['right'].to_frame().T
        df = pd.concat([df, s] )

df.dropna(axis=1, how='all', inplace=True)
df.fillna(0, inplace=True)

I also experimented with numpy arrays directly:

for filename in os.listdir(path):
    with open(path   filename) as f:
        a = np.loadtxt(f, skiprows=1,usecols=(2,3), delimiter=',' )
        c = np.concatenate((c, a))

That would give me the desired array, but all the matrixes are of different length. But how to bring them all to the same shape?

Here's what the output array should look like.

array(
[[145.33334350585938,596.6666870117188],
[147.3572998046875,591.2614135742188]
[149.28125,586.875]
[151.3013153076172,581.3974609375],
[0,0],
[0,0],
[0,0]],

[[146.55398559570312,609.2018432617188],
[146.55398559570312,607.8530883789062],
[146.55398559570312,605.5582275390625],
[146.55398559570312,603.2935180664062],
[147.29171752929688,601.7035522460938],
[148.74122619628906,600.2540283203125],
[150.29244995117188,598.7027587890625]])

I would be happy about any hints of improvement.

CodePudding user response：

IIUC try using list comprehension to iterate over the files using glob

import pandas as pd
import glob

# glob will just be the file path - e.g., 'some/file/path/*.txt'
# *.txt will return all text files in the folder
df = pd.DataFrame([list(pd.read_csv(file).itertuples(index=False, name=None))
                   for file in glob.glob('*.txt')])

CodePudding user response：

You can use np.pad() to expand each array of data to match the largest array that you have.

I would also try to avoid trying to guess how large your largest dataframe will be, and just figure it out from the data.

Example:

dataframes = []
path = 'test219/*.txt'

for filename in sorted(glob.glob(path)):
    dataframes.append(pd.read_csv(filename))

max_rows = max(df.shape[0] for df in dataframes)    
max_cols = max(df.shape[1] for df in dataframes)

padded_data = []

for df in dataframes:
    row_pad, col_pad = max_rows - df.shape[0], max_cols - df.shape[1]
    df_padded = np.pad(df, ((0, row_pad), (0, col_pad)))
    padded_data.append(df_padded)
padded_data = np.array(padded_data)
print(padded_data)

Output:

[[[145.33334351 596.66668701]
  [147.3572998  591.26141357]
  [149.28125    586.875     ]
  [151.30131531 581.39746094]
  [  0.           0.        ]
  [  0.           0.        ]
  [  0.           0.        ]]

 [[146.5539856  609.20184326]
  [146.5539856  607.85308838]
  [146.5539856  605.55822754]
  [146.5539856  603.29351807]
  [147.29171753 601.70355225]
  [148.7412262  600.25402832]
  [150.29244995 598.70275879]]]

Explanation of how it works:

Use the glob module to search for .txt files in in the directory test219. Load each of these with Pandas. (Note: this assumes that every data file has X and Y in the same order.)
Find the largest number of rows, and largest number of columns, for each dataframe.
For each dataframe, pad it on the right and bottom with zeroes until it matches the largest one.
Now that each array has the same shape, the list of arrays can be converted with np.array() into a single numpy array with shape (2, 7, 2). The first index represents the file. The second index represents the row within that file. The third index represents the column.