I want to create a CSV file from a bunch of text files in a directory with the following structure to import them into a database later.
Title:
Article title
Word Count:
100
Summary:
Article summary.
Can consist of multiple lines.
Keywords:
keyword1, keyword2, keyword3
Article Body:
The rest of the article body.
Till the end of the file.
So desired result is to have them in a CSV file with sections as headers with their contents as follows.
Title | Word Count | Summary | Keywords | Article Body |
Article title | 100 | Article summary.\nCan... | keyword1, keyword2, keyword3 | ... |
Article2 title | 110 | Article summary.\nCan... | keyword1, keyword2, keyword3 | ... |
I've tried a few approaches with awk and shell scripting but couldn't succeed so far. Any ideas?
CodePudding user response:
According to the documentation of COPY, PostgreSQL fully supports the CSV format, and also a TEXT format which is by default compatible with the lossless TSV format.
Because I'm using awk
, I choose to output a TSV format. The reason is that there are newlines in the data and POSIX awk
doesn't allow storing literal newlines in a user defined variable. The TSV format doesn't have this problem because you'll replace the newlines with \n
in the data.
Also, I modified the input format for making it easier to parse. The rule is that one or more empty lines delimit the blocks, meaning that you can't have empty lines in the content of Summary
or Article Body
anymore (I added a space in the example).
Input Example:
Title:
Article title
Word Count:
100
Summary:
Article summary.
Can consist of multiple lines.
Keywords:
keyword1, keyword2, keyword3
Article Body:
The rest of the article body.
Till the end of the file.
And here's the awk
command (which is able to process multiple files in one go):
edit: the fields names in the header are now converted to snake case
awk -v RS='' -v FS='^$' -v OFS='\t' '
FNR == 1 { fn }
/^[^:\n] :/ {
fieldName = substr($0,1,index($0,":")-1)
sub("^[^:] :[^\n]*\n","")
gsub(/\\/,"\\\\")
gsub(/\n/,"\\n")
gsub(/\r/,"\\r")
gsub(/\t/,"\\t")
header[fieldName]
record[fn,fieldName] = $0
}
END {
out = sep = ""
for (fieldName in header) {
gsub(/ /,"_",fieldName)
out = out sep tolower(fieldName)
sep = OFS
}
print out
for (i = 1; i <= fn; i ) {
out = sep = ""
for (fieldName in header) {
out = out sep record[fn,fieldName]
sep = OFS
}
print out
}
}
' *.txt
The output:
summary article_body word_count title keywords
Article summary.\n \nCan consist of multiple lines. The rest of the article body.\n \nTill the end of the file. 100 Article title keyword1, keyword2, keyword3
PS: Once you have a valid TSV file, you can use an other tool like mlr
to convert it to CSV, JSON, etc... but for the purpose of importing it in postgreSQL, it isn't required.
Here would be the SQL command (untested):
COPY tablename FROM '/path/file.tsv' HEADER;
note: It seems like you don't need to specify the FORMAT
nor the DELIMITER
because the defaults are text
and \t
CodePudding user response:
I made some changes on @Fravadona's script and created an insert statement. Seemed more practical to me and it works. But the answer was really helpful, just adding here as a reference, might be useful for other people.
awk -v RS='' -v FS='^$' -v OFS='\t' '
FNR == 1 { fn }
/^[^:] :/ {
fieldName = substr($0,1,index($0,":")-1)
sub("^[^:] :[^\n]*\n","")
gsub(/\\/,"\\\\")
gsub(/\n/,"\\n")
gsub(/\r/,"\\r")
gsub(/\t/,"\\t")
header[fieldName]
record[fn,fieldName] = $0
}
END {
ORS=""
print "insert into article(summary, content, word_count, title, keywords) values(E\047"
for (i = 1; i <= fn; i ) {
sep = "\047,\047"
out = ""
for (fieldName in header) {
out = out record[fn,fieldName] sep
}
print substr(out,0,length(out)-2)")"
}
}
' x.txt
Result:
insert into article(summary, content, word_count, title, keywords) values(E'Article summary.\n \nCan consist of multiple lines.','The rest of the article body.\n \nTill the end of the file.','100','Article title','keyword1, keyword2, keyword3')