Home > other >  Sort large CSV file by specific column using csvhelper
Sort large CSV file by specific column using csvhelper

Time:01-24

I have a big csv file contains about 20 millions records (size 4GB)

My goal is to read data from this file and sort the list by specific columns (multi columns, not one column) in that csv file. I'm currently using csvhelper library to achieve my goal My idea is to add data to a List so I can use Linq function to order the data as expect

When using csvReader.GetRecords<T>(), it returns an IQueryable<T> data. but when adding .ToList() then it throws a System.OutOfMemoryException

var records = csvReader.GetRecords<T>();

I tried another way to add it to list by calling a for loop and add it to an empty List<T>

List<T> lst = new List<T>();
foreach (var item in datas)
{
  lst.Add(item);
}

but it still throws System.OutOfMemoryException

is there any solution to add data from csv file to List and order my list by specific columns My PC has 16GB RAM and the file has about 4GB size

CodePudding user response:

Chop it into subsets, sort, then merge. Here is code example of extension I wrote:

public static class MergeSortEnumerableExtensions
{
    public static IEnumerable<TElement> MergeOrderBy<TElement, TKey>(this IEnumerable<TElement> sourceEnumerable, Func<TElement, TKey> keyProvider,
        IEnumerableStorage<TElement> storage, IComparer<TKey> comparer = null, int minimalChunkSize = 1024*1024)
    {
        if(sourceEnumerable == null)
            throw new ArgumentNullException(nameof(sourceEnumerable));
        if (storage == null)
            throw new ArgumentNullException(nameof(storage));
        if (keyProvider == null)
            throw new ArgumentNullException(nameof(keyProvider));

        comparer = comparer ?? Comparer<TKey>.Default;
        storage.Clear();
        foreach (var chunk in sourceEnumerable.Chunk(minimalChunkSize))
        {
            storage.Push(chunk.OrderBy(keyProvider, comparer));
        }

        while (storage.Count / 2 > 0)
        {
            var first = storage.Pop();
            var second = storage.Pop();

            storage.Push(MergeSorted(first, second, keyProvider, comparer));
        }

        return storage.Count > 0 ? storage.Pop() : Enumerable.Empty<TElement>();
    }

    private static IEnumerable<TElement> MergeSorted<TElement, TKey>(IEnumerable<TElement> a, IEnumerable<TElement> b, Func<TElement, TKey> keyProvider, IComparer<TKey> comparer)
    {
        using var aiter = a.GetEnumerator();
        using var biter = b.GetEnumerator();

        var amoved = aiter.MoveNext();
        var bmoved = biter.MoveNext();
        while (amoved || bmoved)
        {
            var cmp = amoved && bmoved ? comparer.Compare(keyProvider(aiter.Current), keyProvider(biter.Current)) : (amoved ? -1 : 1);
            if (cmp <= 0)
            {
                yield return aiter.Current;
                amoved = aiter.MoveNext();
            }
            else
            {
                yield return biter.Current;
                bmoved = biter.MoveNext();
            }
        }
    }

    private static IEnumerable<IEnumerable<TValue>> Chunk<TValue>(
        this IEnumerable<TValue> values,
        int chunkSize)
    {
        using var enumerator = values.GetEnumerator();
        while (enumerator.MoveNext())
        {
            yield return GetChunk(enumerator, chunkSize);
        }
    }

    private static IEnumerable<T> GetChunk<T>(
        IEnumerator<T> enumerator,
        int chunkSize)
    {
        do
        {
            yield return enumerator.Current;
        } while (--chunkSize > 0 && enumerator.MoveNext());
    }
}

Usage:

    [TestCase(new int[0])]
    [TestCase(new[] { 1 })]
    [TestCase(new[] { 1, 2, 3, 4 })]
    [TestCase(new[] { 4, 3, 2, 1 })]
    [TestCase(new[] { 1, 2, 3, 4, 5 })]
    [TestCase(new[] { 5, 4, 3, 2, 1 })]
    [TestCase(new[] { 1, 1, 1 })]
    public void CheckPersistent(int[] data)
    {
        using var storage = new JsonEnumerableStorage<int>();
        var expected = data.OrderBy(x => x).ToList();
        var actual = data.MergeOrderBy(x => x, storage, minimalChunkSize: 2).ToList();
        CollectionAssert.AreEqual(expected, actual, storage.TempFolder);
    }

You basically write a streamed version of persistent storage (in CSV/Json any other format, or just straight into SQL) with Push/Pop methods and plug it into extension. Memory tradeoff/consumption can be adjusted selecting minimal size - depends on size and amount of entities you want to sort at a time.

PS

Link at my repo with tests (and storage examples):

https://github.com/eocron/Algorithm/tree/master/Algorithm/Sorted

PS2

With a little bit of time you can make it full async or Parallel.

CodePudding user response:

I just loaded 20 million records of a simple Foo class of Id and Name. The file was 0.4 GB, but the loaded records in memory was close to 2GB. If we assume your class is also going to need about 5 times the memory, then you would need 20GB memory for your 4GB file. Even if you do manage to load all the records, you are going to need more memory to do the sorting.

If you don't want to use the suggestion from Ňɏssa Pøngjǣrdenlarp to upload the file to a database, you might be able to use a smaller SortingClass that only has the columns that you want to sort on and a Row column that will allow you to find the records again in the file. After sorting, if you just need a smaller subset of the records, use the row numbers to find the complete records in the file.

void Main()
{
    List<SortingClass> sortedRecords;

    using (var reader = new StreamReader(@"C:\Temp\VeryLargeFoo.csv"))
    using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
    {
        csv.Context.RegisterClassMap<SortingClassMap>();
        sortedRecords = csv.GetRecords<SortingClass>().OrderByDescending(f => f.Name).ThenBy(f => f.Id).ToList();
    }
    
    var selectedRows = sortedRecords.Take(100).Select(r => r.Row).ToArray();
    var selectedFoo = new List<Foo>();

    using (var reader = new StreamReader(@"C:\Temp\VeryLargeFoo.csv"))
    using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
    {
        csv.Read();
        csv.ReadHeader();
        while (csv.Read())
        {
            if (selectedRows.Contains(csv.Parser.Row)){
                selectedFoo.Add(csv.GetRecord<Foo>());
            }
        }
    }
}

public class SortingClass
{
    public int Row { get; set; }
    public int Id { get; set; }
    public string Name { get; set; }
}

public class SortingClassMap : ClassMap<SortingClass>
{
    public SortingClassMap()
    {
        Map(r => r.Row).Convert(args => args.Row.Context.Parser.Row);
        Map(r => r.Id);
        Map(r => r.Name);
    }
}

public class Foo
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Column3 { get; set; }
    public string Column4 { get; set; }
    public string Column5 { get; set; }
    public string Column6 { get; set; }
}

  • Related