HD|20211210
DT|D-|12/22/2017|12/22/2017 09:41:45.828000|11/01/2017|01/29/2018 14:46:10.666000|1.2|1.2|ABC|ABC|123|123|4554|023|11/01/2017|ACDF|First|0012345||f|ABCD|ABCDEFGH|ABCDEFGH||||
DT|D-|12/25/2017|12/25/2017 09:24:20.202000|12/13/2017|01/29/2018 07:52:23.607000|6.4|6.4|ABC|ABC|123|123|4540|002|12/13/2017|ACDF|First|0012345||f|ABC|ABCDEF|ABCDEFGH||||
TR|0000000002
File name is Datafile.Dat. Scala version 2.11
I need to create header Dataframe with the first line but excluding "HD|", Need to create trailer dataframe with the last line but excluding "TR|", and finally need to create actual dataframe by skipping both the first and last line and excluding "DT|" from each line.
Please help me on this.
CodePudding user response:
I see you have a defined schema for your dataframe (except first and last row). What you can do is to read that file and seperator will be '|' and you can enable "DROPMALFORMED" mode.
schema = 'define your schema here'
df = spark.read.option("mode","DROPMALFORMED").option("delimiter","|").option("header","true").schema(schema).csv("Datafile.Dat")
Another way is to use zipWithIndex.