Home > Enterprise >  How can I return a record from a CSV file using the byte position of line?
How can I return a record from a CSV file using the byte position of line?

Time:11-16

I have an assets.csv file with 172 MB, a million rows, and 16 columns. I would like to read it using an offset -> bytes/line/record. In the code below, I am using the byte value.

I have stored the required positions (record.postion.bytes() in assets_index.csv) and I would like to read a particular line in the assets.csv using the saved offset.

I am able to get an output, but I feel there must be a better way to read from a CSV file based on byte position.

Please advise. I am new to programming and also to Rust, and learned a lot using the tutorials.

The assets.csv is of this format:

asset_id,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation,year,depreciation
1000001,2015,10000,2016,10000,2017,10000,2018,10000,2019,10000,2020,10000,2021,10000,2022,10000,2023,10000,2024,10000,2025,10000,2026,10000,2027,10000,2028,10000,2029,10000

I used another function to get the Position { byte: 172999933, line: 1000000, record: 999999 }.

The assets_index.csv is of this format:

asset_id,offset_inbytes
1999999,172999933
fn read_from_position() -> Result<(), Box<dyn Error>> {
    let asset_pos = 172999933 as u64;

    let file_path = "assets.csv";

    let mut rdr = csv::ReaderBuilder::new()
        .flexible(true)
        .from_path(file_path)?;

    let mut wtr = csv::Writer::from_writer(io::stdout());

    let mut record = csv::ByteRecord::new();

    while rdr.read_byte_record(&mut record)? {
        
        let pos = &record.position().expect("position of record");

        if pos.byte() == asset_pos
        { 
            wtr.write_record(&record)?; 
            break;
        }     
    }

    wtr.flush()?;

    Ok(())
}
$ time ./target/release/testcsv
1999999,2015,10000,2016,10000,2017,10000,2018,10000,2019,10000,2020,10000,2021,10000,2022,10000,2023,10000,2024,10000,2025,10000,2026,10000,2027,10000,2028,10000,2029,10000

Time elapsed in readcsv() is: 239.290125ms

./target/release/testcsv  0.22s user 0.02s system 99% cpu 0.245 total

CodePudding user response:

Instead of using from_path you can use from_reader with a File and seek in that file before creating the CsvReader:

use std::{error::Error, fs, io::{self, Seek}};

fn read_from_position() -> Result<(), Box<dyn Error>> {
    let asset_pos = 0x116 as u64; // offset to only record in example
    let file_path = "assets.csv";

    let mut f = fs::File::open(file_path)?;
    f.seek(io::SeekFrom::Start(asset_pos))?;
    let mut rdr = csv::ReaderBuilder::new()
        .flexible(true)
        // edit: as noted by @BurntSushi5 we have to disable headers here.
        .has_headers(false)
        .from_reader(f);

    let mut wtr = csv::Writer::from_writer(io::stdout());
    let mut record = csv::ByteRecord::new();

    rdr.read_byte_record(&mut record)?;
    wtr.write_record(&record)?;
    wtr.flush()?;
    Ok(())
}

Then the first record read will be the one you're looking for.

  • Related