I have a data.csv
file it contain concatenated data as given below.
>
are the separator of the concatenated files.
>
1.094 1.128 1.439
3.064 3.227 3.371
>
5.131 5.463 5.584
3.65 3.947 4.135
>
1.895 1.954 2.492
5.307 5.589 5.839
I want to rearrange the column data side by side and finally wants to save to new text file as depicted below.For this demo example we can create three files.Moreover extra 0 and 5 should be appended as extra rows.
cat file1.txt
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5
cat file2.txt
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5
cat file3.txt
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5
My trial code
import pandas as pd
df = pd.read_csv('data.csv', sep='\t')
for columns in df:
data=df.iloc[:,columns]
data.concat['data']
data.to_csv('file1.txt')
error: ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
I am not getting the expected output.Hope experts may help me. Thanks.
CodePudding user response:
Assumptions:
- each
>
delimited block has 2 data rows - data rows can contain a variable number of columns (
3
in the provided sample input) - all data rows have the same number of columns (
3
in the provided sample input) - output file names are of the form
fileI.txt
whereI
ranges from1
to the number of columns in an input data row (3
in the provided sample data) - OP's host has enough RAM to hold the entire input file in memory (via
awk
arrays)
One awk
idea:
awk '
/^>/ { next }
{ if (! colcnt) colcnt=NF # make note of number of columns; used to deep track of number of output files
for (i=1;i<=colcnt;i )
row1[i]=row1[i] (row1[i] ? OFS : "") $i
getline
for (i=1;i<=colcnt;i )
row2[i]=row2[i] (row2[i] ? OFS : "") $i
}
END { for (i=1;i<=colcnt;i ) {
print row1[i],0,5 > "file" i ".txt"
print row2[i],0,5 > "file" i ".txt"
}
}
' data.csv
NOTE: OP's sample code implies tab (\t
) delimited input but additional comments from OP seem to indicate data is (variable) space delimited; input/output delimiters can be changed if/when OP provides an updated requirement for input/output delimiters
This generates:
$ head file*.txt
==> file1.txt <==
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5
==> file2.txt <==
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5
==> file3.txt <==
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5
CodePudding user response:
Another solution using jq.
Assumptions: Unix line endings, data.csv starts with line containing only ">" and ends with an empty line.
for column in 1 2 3; do jq -Rsr --argjson column $column '
split(">\n")[1:]
|map(
split("\n")[:-1]|map(split("\t"))
) as $arr
|[
[ $arr[][0][$column-1]],[$arr[][1][$column-1] ]
|. ["0","5"]
|join("\t")
]|join("\n")
' data.csv > file$column.txt; done
Result:
$ head file*.txt
==> file1.txt <==
1.094 5.131 1.895 0 5
3.064 3.65 5.307 0 5
==> file2.txt <==
1.128 5.463 1.954 0 5
3.227 3.947 5.589 0 5
==> file3.txt <==
1.439 5.584 2.492 0 5
3.371 4.135 5.839 0 5
CodePudding user response:
To do this in python, use numpy
. The following code should work I think, regardless of how many columns there are in the original file (3
in your example). It does assume blocks of 2
rows. Code below updated to take into account that the original file was in fact not separated by tabs
, as initially suggested.
import pandas as pd
import numpy as np
fname = "data.txt" # file is apparently only separated with spaces, and then
# (one assumes) only for those lines that include data (not the lines with ">")
# some minor adjustments:
df = pd.read_csv(fname, header=None)
# get rid of rows with ">" separator
df = df[~df[0].str.contains('>')]
# now split all remaining rows
df = df[0].str.split(expand=True)
# change dtype (first col will be dtype "object" due to ">" separator)
df = df.astype(float)
col_len = len(df.columns)
# add some data
df2 = pd.DataFrame(np.array([[0]*(col_len)*2,[5]*(col_len)*2]).reshape(4,col_len))
# concat orig data added data
df_col = pd.concat([df, df2], ignore_index=True)
# convert to numpy array, and reshape
arr = df_col.to_numpy().reshape(int(df_col.shape[0]/2),2,col_len).T
# split up again
tup = np.split(arr,col_len)
# loop through tuple and write away the files
for idx, elem in enumerate(tup):
# numpy arr will be nested, so get elem[0]:
np.savetxt(f'file{idx 1}.txt', X=elem[0], fmt='%1.3f', delimiter='\t')
Result of print(elem[0])
in the last loop:
[[1.094 5.131 1.895 0. 5. ]
[3.064 3.65 5.307 0. 5. ]]
[[1.128 5.463 1.954 0. 5. ]
[3.227 3.947 5.589 0. 5. ]]
[[1.439 5.584 2.492 0. 5. ]
[3.371 4.135 5.839 0. 5. ]]