How to optimize the search in an array with more than 50 thousand elements?-CodePudding

I have an array of strings with approximately 50,000 elements.

export const companies = [
  "000014",
  "000016",
  "000017",
  "000019",
  "000020",
  "000021",
  "000023",
  "000025",
  ...
]

These are the names of companies for which I display certain pages. I made a middleware in which I run a loop and walk through this large array.

import { NextResponse, type NextRequest } from "next/server";
import { companies } from "./assets/companies";

export async function middleware(req: NextRequest) {
  const { pathname } = req.nextUrl;

  // cycle for comparing current URL with companies
  await for (let i = 0; i < companies.length; i  ) {
    if (pathname.startsWith(`/${companies[i]}`))
        return NextResponse.redirect(new URL("/", req.url)); // redirect to main page if companies compare with current pathname
  }
}

It takes some time, how can it be optimized? There was an idea to divide the array into chunks, but this is also not a very good option.

CodePudding user response：

Assuming pathname looks something like /000014/abc/xyz then you can get rid of the array iteration entirely. Something like this:

import { NextResponse, type NextRequest } from "next/server";
import { companies } from "./assets/companies";

const companiesSet = new Set(companies);

export async function middleware(req: NextRequest) {
  const { pathname } = req.nextUrl;
  
  // Ideally you could get this from req.params.companyId instead, but whether that exists or not would depend on the routing code, which you haven't shown.
  const companyId = req.pathname.match(/\/([0-9] )\//)?.[1];

  if (companiesSet.has(companyId)) {
    return NextResponse.redirect(new URL("/", req.url)); // redirect to main page if companies compare with current pathname
  }
}

That being said, 50,000 elements isn't really that large, all things considered, and that code shouldn't have been particularly slow. The unnecessary await and the string building inside of the loop may have been an issue.

CodePudding user response：

If for some reason you can't just do map / object lookups, like @Aurast suggested (for example, you need to do free text searching to find closest matching name), then one very simple approach is to group companies so you can divide the list into smaller chunks, and search only one chunk. For example, divide by initial letter:

export const companies = {
    "a": ["aardvarks, inc", "AppendeciteCorp", ],
    "b": ["boulevard bricklayers", "bandits and bakers", "brompton fan club"],
}

Then you can look up by just first letter, in order to get a much smaller array. This would be very simple to implement.

The other approach is to take this to its logical conclusion, and recursively group by first letter in a data structure called a trie or prefix tree. This allows you to very quickly search for stuff even if it's not an exact match.

CodePudding user response：

There are a few ways to improve performance

Algorithm: you can use a faster search algorithm such as binary search or hash table lookup. These algorithms have a lower time complexity and can search through the array more efficiently.

here is an example using Binary Search

Binary Search Algorithm

The array should be sorted in order to use Binary Search

import { NextResponse, type NextRequest } from "next/server";
import { companies } from "./assets/companies";

function binarySearch(array, target) {
  let left = 0;
  let right = array.length - 1;

  while (left <= right) {
    const middle = Math.floor((left   right) / 2);
    const value = array[middle];

    if (value === target) {
      return middle;
    } else if (value < target) {
      left = middle   1;
    } else {
      right = middle - 1;
    }
  }

  return -1;
}

export async function middleware(req: NextRequest) {
  const { pathname } = req.nextUrl;
  const target = pathname.substring(1);
  // Use binary search to find company name in array
  const index = binarySearch(companies, target);
  if (index !== -1) {
    return NextResponse.redirect(new URL("/", req.url));
  }
}

Hash table

import { NextResponse, type NextRequest } from "next/server";
import { companies } from "./assets/companies";

const hashTable = {};
for (const company of companies) {
  hashTable[company] = true;
}

export async function middleware(req: NextRequest) {
  const { pathname } = req.nextUrl;
  const target = pathname.substring(1);

  // Use hash table to find company name in array
  if (hashTable[target]) {
    return NextResponse.redirect(new URL("/", req.url));
  }
}

cache: If the search results are not changing frequently, you can use a cache to store the search results and avoid the need to search through the entire array every time. This can greatly reduce the time it takes to search through the array, especially if the same searches are being performed repeatedly.

Caching

import { NextResponse, type NextRequest } from "next/server";
import { companies } from "./assets/companies";

const cache = new Map(); // Create a new cache

function linearSearch(array, target) {
  // Check if search result is in cache
  if (cache.has(target)) {
    return cache.get(target);
  }

  // Search through the array
  for (let i = 0; i < array.length; i  ) {
    if (array[i] === target) {
      cache.set(target, i); // Add search result to cache
      return i;
    }
  }

  return -1;
}

export async function middleware(req: NextRequest) {
  const { pathname } = req.nextUrl;
  const target = pathname.substring(1);

  // Use linear search to find company name in array
  const index = linearSearch(companies, target);
  if (index !== -1) {
    return NextResponse.redirect(new URL("/", req.url));
  }
}

Alternatively, You can use the Redis database for caching.

Time Complexity

Linear search: O(n), where n is the size of the array. This means that the search time increases linearly with the size of the array. Linear search is the most basic search algorithm and is relatively easy to implement, but it can be slow for large arrays.
Binary search: O(log n), where n is the size of the array. This means that the search time increases logarithmically with the size of the array. Binary search is more efficient than linear search, but it requires that the array be sorted.
Hash table: O(1), where n is the size of the array. This means that the search time is constant and does not depend on the size of the array. Hash tables are very efficient for search and insertion operations, but they can take up a lot of memory.
Cache: O(1), where n is the size of the cache. This means that the search time is constant and does not depend on the size of the array. Caches can be used to store the results of expensive searches and avoid the need to search through the entire array every time. The size of the cache is usually much smaller than the size of the array, so the search time is generally very fast. However, the cache can fill up over time and may need to be periodically cleared to free up memory.