How to Skip few lines of a large file in hadoop to a new file?-CodePudding

I want to skip first 36 lines of a file in hdfs and copy to another location in hdfs. is there any command similar to head/tail for the same

CodePudding user response：

Quite simply no there's no one-liner to do this. Files in Hadoop can be massive, and so there are no CLI tools to do basic manipulations as the computation engines are decoupled from HDFS. Your best bet - depending on how your cluster is set up - is either a simple MapReduce job (look at Python word count examples) or a Spark job.

CodePudding user response：

You can implement a workarround in Spark:

Read file by file:

val df=spark.read.csv("file1.csv")
Include the line number for each row:

df.withColumn("row_id", monotonically_increasing_id())
Filter first 36 lines and write the file in another location:

df.filter(!col("row_id").isin(1 to 36)).drop("row_id").write.save("destination-path")

CodePudding user response：

As suggested by Ben, I was able to do it using a spark job and excluded those lines based on the right flag