Home > other >  Read a specific file from nested sub-folders
Read a specific file from nested sub-folders

Time:10-13

I'm reading a single file from a subfolder its working fine

 val spark = SparkSession
      .builder()
      .master("local")
      .appName("SparkAndHive")
      .config("spark.sql.warehouse.dir", "/tmp/spark-warehouse2")
      .enableHiveSupport()
      .getOrCreate()

    GeoSparkSQLRegistrator.registerAll(spark.sqlContext)

    
    val spatialRDD = ShapefileReader.readToGeometryRDD(spark.sparkContext, "src/main/resources/IND_rds")

    val spatialRDD = ShapefileReader.readToGeometryRDD(spark.sparkContext, "src/main/resources/IND_rrd")

    val rawSpatialDf = Adapter.toDf(spatialRDD,spark)
    rawSpatialDf.createOrReplaceTempView("rawSpatialDf")

    spark.sql("select * from rawSpatialDf").show 
  

So problem is currently my folder structure is below

.
├── ind
│   ├── IND_rds
│   │   ├── IND_roads.dbf
│   │   ├── IND_roads.prj
│   │   ├── IND_roads.shp
│   │   └── IND_roads.shx
│   └── IND_rrd
│       ├── IND_rails.dbf
│       ├── IND_rails.prj
│       ├── IND_rails.shp
│       └── IND_rails.shx
├── nep
│   ├── NPL_rds
│   │   ├── NPL_roads.dbf
│   │   ├── NPL_roads.prj
│   │   ├── NPL_roads.shp
│   │   └── NPL_roads.shx
│   └── NPL_rrd
│       ├── NPL_rails.dbf
│       ├── NPL_rails.prj
│       ├── NPL_rails.shp
│       └── NPL_rails.shx

  • since currently i have two country and each country has 2 subfolders one road details and rails details now problem is I just want to create two rdd to load every country road data in first rdd and second rdd to load every country rail data

  • so in my current approach I would be creating too many rdd and I would require to manually give the path of every road and rail dir

  • In my current approach I would be creating 4 rdd manually and hardcode the paths of those folders manually

Is there any alternative approach to pick all country and there respective road and rail directory dynamically from the nested folders

CodePudding user response:

You are not giving the complete code, but based on my understanding, you could do this better with partitioned table and spark sql module.

load data into dataframes is way more better than load into rdds.

In your table, you could leave two partitioned columns as country and rail and when you read the data, you could just specify the root directory instead of the path of root/country_name/rail_name. and the schema of the dataframe you obtain would be all the columns you have in your files country_name rail_name.

However you need to rename your directories first like this:

.
├── country_name=ind
│   ├── rail_name=IND_rds
│   │   ├── IND_roads.dbf
│   │   ├── IND_roads.prj
│   │   ├── IND_roads.shp
│   │   └── IND_roads.shx
│   └── rail_name=IND_rrd
│       ├── IND_rails.dbf
│       ├── IND_rails.prj
│       ├── IND_rails.shp
│       └── IND_rails.shx
├── country_name=nep
│   ├── rail_name=NPL_rds
│   │   ├── NPL_roads.dbf
│   │   ├── NPL_roads.prj
│   │   ├── NPL_roads.shp
│   │   └── NPL_roads.shx
│   └── rail_name=NPL_rrd
│       ├── NPL_rails.dbf
│       ├── NPL_rails.prj
│       ├── NPL_rails.shp
│       └── NPL_rails.shx

then:

 val df = spark.read.load("the root path")
 val df_ind = df.filter(col("country_name") === "ind")

for more info you can refer to the partition discovery section in this doc: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Alternatively, if you cannot rename the folder to this format: refer to this link: How to make Spark session read all the files recursively? In simple words, if you are later then spark 3, you use recursiveFileLookup, otherwise you have to play with hdfs listFiles

CodePudding user response:

Try using regex/wild characters

inputpath= "src/main/resources/[A-Za-z]*/*_rds" inputpath= "src/main/resources/[A-Za-z]*/*_rrd"

  • Related