I have a large number of txt files with a single column of data. There are no headers in the files.
The data are an email address followed by a :
and then a string of varchar, which sometimes includes :
s.
My goal is to convert the following
[email protected]:v@rch:r$tR:ng
[email protected]::multipleTypes
[email protected]:&ofTxtGoAfT3rThe:
To a tsv with headers.
column1 column2
[email protected] v@rch:r$tR:ng
[email protected] :multipleTypes
[email protected] &ofTxtGoAfT3rThe:
These files will then be uploaded into a postgres database.
Any insight/advice is greatly appreciated.
CodePudding user response:
sed
can do that. Without /g
, its substitution replaces the first occurrence on each line:
sed -e 's/:/\t/' *.txt > file.tsv
CodePudding user response:
In postgres you can split the text at the appropriate position and export the hole with COPY
to an TSV
this would look like
CREATE TABLE mytab(mytext text)
INSERT INTO mytab VALUES('[email protected]:v@rch:r$tR:ng'),('[email protected]::multipleTypes'),('[email protected]:&ofTxtGoAfT3rThe:')
SELECT 'column1', 'column2' UNION ALL SELECT substr(mytext,1, strpos(mytext, ':') -1) col1, substr(mytext,strpos(mytext, ':') 1) col2 FROM mytab
?column? | ?column? :-------------- | :---------------- column1 | column2 [email protected] | v@rch:r$tR:ng [email protected] | :multipleTypes [email protected] | &ofTxtGoAfT3rThe:
COPY (SELECT 'column1', 'column2' UNION ALL SELECT substr(mytext,1, strpos(mytext, ':') -1) col1, substr(mytext,strpos(mytext, ':') 1) col2 FROM mytab) TO '/tmp/ind.tsv' CSV HEADER DELIMITER E'\t';
SELECT substr('[email protected]::multipleTypes',1, strpos('[email protected]::multipleTypes', ':') -1) col1, substr('[email protected]::multipleTypes',strpos('[email protected]::multipleTypes', ':') 1) col2
col1 | col2 :-------------- | :------------- [email protected] | :multipleTypes
SELECT substr('[email protected]:&ofTxtGoAfT3rThe:',1, strpos('[email protected]:&ofTxtGoAfT3rThe:', ':') -1) col1, substr('[email protected]:&ofTxtGoAfT3rThe:',strpos('[email protected]:&ofTxtGoAfT3rThe:', ':') 1) col2
col1 | col2 :-------------- | :---------------- [email protected] | &ofTxtGoAfT3rThe:
db<>fiddle here
CodePudding user response:
@choroba is correct with Sed. This can also be done in python.
x = '[email protected]::multipleTypes'
x = x.split(':', 1)
print(x)
This splits the text at the first :