rearranging column data in loop


I have a data.csv file it contain concatenated data as given below. > are the separator of the concatenated files.

1.094   1.128   1.439
3.064   3.227   3.371
5.131   5.463   5.584
3.65    3.947   4.135
1.895   1.954   2.492
5.307   5.589   5.839

I want to rearrange the column data side by side and finally wants to save to new text file as depicted below.For this demo example we can create three files.Moreover extra 0 and 5 should be appended as extra rows.

cat file1.txt
1.094  5.131  1.895 0 5
3.064  3.65   5.307 0 5

cat file2.txt
1.128  5.463  1.954 0 5
3.227  3.947  5.589 0 5

cat file3.txt
1.439  5.584  2.492 0 5
3.371  4.135  5.839 0 5

My trial code

import pandas as pd
df = pd.read_csv('data.csv', sep='\t')
for columns in df:

error: ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

I am not getting the expected output.Hope experts may help me. Thanks.

CodePudding user response:


  • each > delimited block has 2 data rows
  • data rows can contain a variable number of columns (3 in the provided sample input)
  • all data rows have the same number of columns (3 in the provided sample input)
  • output file names are of the form fileI.txt where I ranges from 1 to the number of columns in an input data row (3 in the provided sample data)
  • OP's host has enough RAM to hold the entire input file in memory (via awk arrays)

One awk idea:

awk '
/^>/   { next }
       { if (! colcnt) colcnt=NF                         # make note of number of columns; used to deep track of number of output files
         for (i=1;i<=colcnt;i  )
             row1[i]=row1[i] (row1[i] ? OFS : "") $i
         for (i=1;i<=colcnt;i  )
             row2[i]=row2[i] (row2[i] ? OFS : "") $i
END    { for (i=1;i<=colcnt;i  ) {
             print row1[i],0,5 > "file" i ".txt"
             print row2[i],0,5 > "file" i ".txt"
' data.csv

NOTE: OP's sample code implies tab (\t) delimited input but additional comments from OP seem to indicate data is (variable) space delimited; input/output delimiters can be changed if/when OP provides an updated requirement for input/output delimiters

This generates:

$ head file*.txt
==> file1.txt <==
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5

==> file2.txt <==
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5

==> file3.txt <==
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5

CodePudding user response:

Another solution using jq.
Assumptions: Unix line endings, data.csv starts with line containing only ">" and ends with an empty line.

for column in 1 2 3; do jq -Rsr --argjson column $column '
    ) as $arr
         [ $arr[][0][$column-1]],[$arr[][1][$column-1] ] 
         |. ["0","5"] 
' data.csv > file$column.txt; done


$ head file*.txt
==> file1.txt <==
1.094   5.131   1.895   0       5
3.064   3.65    5.307   0       5

==> file2.txt <==
1.128   5.463   1.954   0       5
3.227   3.947   5.589   0       5

==> file3.txt <==
1.439   5.584   2.492   0       5
3.371   4.135   5.839   0       5

CodePudding user response:

To do this in python, use numpy. The following code should work I think, regardless of how many columns there are in the original file (3 in your example). It does assume blocks of 2 rows. Code below updated to take into account that the original file was in fact not separated by tabs, as initially suggested.

import pandas as pd
import numpy as np

fname = "data.txt" # file is apparently only separated with spaces, and then
# (one assumes) only for those lines that include data (not the lines with ">")
# some minor adjustments:

df = pd.read_csv(fname, header=None)

# get rid of rows with ">" separator
df = df[~df[0].str.contains('>')]

# now split all remaining rows
df = df[0].str.split(expand=True)

# change dtype (first col will be dtype "object" due to ">" separator)
df = df.astype(float)

col_len = len(df.columns)

# add some data
df2 = pd.DataFrame(np.array([[0]*(col_len)*2,[5]*(col_len)*2]).reshape(4,col_len))

# concat orig data   added data
df_col = pd.concat([df, df2], ignore_index=True)

# convert to numpy array, and reshape 
arr = df_col.to_numpy().reshape(int(df_col.shape[0]/2),2,col_len).T

# split up again
tup = np.split(arr,col_len)

# loop through tuple and write away the files
for idx, elem in enumerate(tup):
    # numpy arr will be nested, so get elem[0]:
    np.savetxt(f'file{idx 1}.txt', X=elem[0], fmt='%1.3f', delimiter='\t')

Result of print(elem[0]) in the last loop:

[[1.094 5.131 1.895 0.    5.   ]
 [3.064 3.65  5.307 0.    5.   ]]
[[1.128 5.463 1.954 0.    5.   ]
 [3.227 3.947 5.589 0.    5.   ]]
[[1.439 5.584 2.492 0.    5.   ]
 [3.371 4.135 5.839 0.    5.   ]]
