rearranging column data in loop-CodePudding

I have a data.csv file it contain concatenated data as given below. > are the separator of the concatenated files.

>
1.094   1.128   1.439
3.064   3.227   3.371
>
5.131   5.463   5.584
3.65    3.947   4.135
>
1.895   1.954   2.492
5.307   5.589   5.839

I want to rearrange the column data side by side and finally wants to save to new text file as depicted below.For this demo example we can create three files.Moreover extra 0 and 5 should be appended as extra rows.

cat file1.txt
1.094  5.131  1.895 0 5
3.064  3.65   5.307 0 5

cat file2.txt
1.128  5.463  1.954 0 5
3.227  3.947  5.589 0 5

cat file3.txt
1.439  5.584  2.492 0 5
3.371  4.135  5.839 0 5

My trial code

import pandas as pd
df = pd.read_csv('data.csv', sep='\t')
for columns in df:
    data=df.iloc[:,columns]
data.concat['data']
data.to_csv('file1.txt')

error: ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

I am not getting the expected output.Hope experts may help me. Thanks.

CodePudding user response：

Assumptions:

each > delimited block has 2 data rows
data rows can contain a variable number of columns (3 in the provided sample input)
all data rows have the same number of columns (3 in the provided sample input)
output file names are of the form fileI.txt where I ranges from 1 to the number of columns in an input data row (3 in the provided sample data)
OP's host has enough RAM to hold the entire input file in memory (via awk arrays)

One awk idea:

awk '
/^>/   { next }
       { if (! colcnt) colcnt=NF                         # make note of number of columns; used to deep track of number of output files
         for (i=1;i<=colcnt;i  )
             row1[i]=row1[i] (row1[i] ? OFS : "") $i
         getline
         for (i=1;i<=colcnt;i  )
             row2[i]=row2[i] (row2[i] ? OFS : "") $i
       }
END    { for (i=1;i<=colcnt;i  ) {
             print row1[i],0,5 > "file" i ".txt"
             print row2[i],0,5 > "file" i ".txt"
         }
       }
' data.csv

NOTE: OP's sample code implies tab (\t) delimited input but additional comments from OP seem to indicate data is (variable) space delimited; input/output delimiters can be changed if/when OP provides an updated requirement for input/output delimiters

This generates:

$ head file*.txt
==> file1.txt <==
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5

==> file2.txt <==
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5

==> file3.txt <==
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5

CodePudding user response：

To do this in python, use numpy. The following code should work I think, regardless of how many columns there are in the original file (3 in your example). It does assume blocks of 2 rows.

import pandas as pd
import numpy as np

fname = "data.txt"

df = pd.read_csv(fname, sep="\t", header=None)

# get rid of rows with ">" separators (rest of col values will be NA)
df.dropna(inplace=True)

# change dtype (first col will be dtype "object" due to ">" separator)
df = df.astype(float)

col_len = len(df.columns)

# add some data
df2 = pd.DataFrame(np.array([[0]*(col_len)*2,[5]*(col_len)*2]).reshape(4,col_len))

# concat orig data   added data
df_col = pd.concat([df, df2], ignore_index=True)

# convert to numpy array, and reshape 
arr = df_col.to_numpy().reshape(int(df_col.shape[0]/2),2,col_len).T

# split up again
tup = np.split(arr,col_len)

# loop through tuple and write away the files
for idx, elem in enumerate(tup):
    # numpy arr will be nested, so get elem[0]:
    np.savetxt(f'file{idx 1}.txt', X=elem[0], fmt='%1.3f', delimiter='\t')

Result of print(elem[0]) in the last loop:

[[1.094 5.131 1.895 0.    5.   ]
 [3.064 3.65  5.307 0.    5.   ]]
[[1.128 5.463 1.954 0.    5.   ]
 [3.227 3.947 5.589 0.    5.   ]]
[[1.439 5.584 2.492 0.    5.   ]
 [3.371 4.135 5.839 0.    5.   ]]