Create CSV file from a text file with header tokens using shell scripting-CodePudding

I want to create a CSV file from a bunch of text files in a directory with the following structure to import them into a database later.

Title:
Article title

Word Count:
100

Summary:
Article summary.

Can consist of multiple lines.
    
Keywords:
keyword1, keyword2, keyword3


Article Body:
The rest of the article body.

Till the end of the file.

So desired result is to have them in a CSV file with sections as headers with their contents as follows.

Title          | Word Count | Summary                  | Keywords                     | Article Body |
Article title  | 100        | Article summary.\nCan... | keyword1, keyword2, keyword3 | ...          |
Article2 title | 110        | Article summary.\nCan... | keyword1, keyword2, keyword3 | ...          |

I've tried a few approaches with awk and shell scripting but couldn't succeed so far. Any ideas?

CodePudding user response：

According to the documentation of COPY, PostgreSQL fully supports the CSV format, and also a TEXT format which is by default compatible with the lossless TSV format.

Because I'm using awk, I choose to output a TSV format. The reason is that there are newlines in the data and POSIX awk doesn't allow storing literal newlines in a user defined variable. The TSV format doesn't have this problem because you'll replace the newlines with \n in the data.

Also, I modified the input format for making it easier to parse. The rule is that one or more empty lines delimit the blocks, meaning that you can't have empty lines in the content of Summary or Article Body anymore (I added a space in the example).

Input Example:

Title:
Article title

Word Count:
100

Summary:
Article summary.
 
Can consist of multiple lines.

Keywords:
keyword1, keyword2, keyword3


Article Body:
The rest of the article body.
 
Till the end of the file.

And here's the awk command (which is able to process multiple files in one go):

_{edit: the fields names in the header are now converted to snake case}

awk -v RS='' -v FS='^$' -v OFS='\t' '
    FNR == 1 { fn   }
    /^[^:\n] :/ {
        fieldName = substr($0,1,index($0,":")-1)

        sub("^[^:] :[^\n]*\n","")
        gsub(/\\/,"\\\\")
        gsub(/\n/,"\\n")
        gsub(/\r/,"\\r")
        gsub(/\t/,"\\t")

        header[fieldName]
        record[fn,fieldName] = $0
    }
    END {
        out = sep = ""
        for (fieldName in header) {
            gsub(/ /,"_",fieldName)
            out = out sep tolower(fieldName)
            sep = OFS
        }
        print out

        for (i = 1; i <= fn; i  ) {
            out = sep = ""
            for (fieldName in header) {
                out = out sep record[fn,fieldName]
                sep = OFS
            }
            print out
        }
    }
' *.txt

The output:

summary article_body    word_count  title   keywords
Article summary.\n \nCan consist of multiple lines. The rest of the article body.\n \nTill the end of the file. 100 Article title   keyword1, keyword2, keyword3

PS: Once you have a valid TSV file, you can use an other tool like mlr to convert it to CSV, JSON, etc... but for the purpose of importing it in postgreSQL, it isn't required.

Here would be the SQL command (untested):

COPY tablename FROM '/path/file.tsv' HEADER;

^{note: It seems like you don't need to specify the FORMAT nor the DELIMITER because the defaults are text and \t}

CodePudding user response：

I made some changes on @Fravadona's script and created an insert statement. Seemed more practical to me and it works. But the answer was really helpful, just adding here as a reference, might be useful for other people.

awk -v RS='' -v FS='^$' -v OFS='\t' '
    FNR == 1 { fn   }
    /^[^:] :/ {
        fieldName = substr($0,1,index($0,":")-1)

        sub("^[^:] :[^\n]*\n","")
        gsub(/\\/,"\\\\")
        gsub(/\n/,"\\n")
        gsub(/\r/,"\\r")
        gsub(/\t/,"\\t")

        header[fieldName]
        record[fn,fieldName] = $0
    }
    END {
        ORS=""
        print "insert into article(summary, content, word_count, title, keywords) values(E\047"

        for (i = 1; i <= fn; i  ) {
            sep = "\047,\047"
            out = ""
            for (fieldName in header) {
                out = out record[fn,fieldName] sep
            }
            print substr(out,0,length(out)-2)")"
        }
    }
' x.txt

Result:

insert into article(summary, content, word_count, title, keywords) values(E'Article summary.\n \nCan consist of multiple lines.','The rest of the article body.\n \nTill the end of the file.','100','Article title','keyword1, keyword2, keyword3')