Tuesday, May 3, 2011

What's the most efficient way of filtering an array based on the contents of another array?

Say I have two arrays, items and removeItems and I wanted any values found in removeItems to be removed from items.

The brute force mechanism would probably be:

var animals = ["cow","dog","frog","cat","whale","salmon","zebra","tuna"];
var nonMammals = ["salmon","frog","tuna","spider"];
var mammals = [];
var isMammal;

for(var i=0;i<animals.length;i++){
   isMammal = true;
   for(var j=0;j<nonMammals;j++){
     if(nonMammals[j] === animals[i]){
       isMammal = false;
       break;
     }
   }
   if(isMammal){
     mammals.push(animals[i]);
   }
}

This is what? O(N^2)? Is there a more efficient way?

From stackoverflow
  • That's actually O(M * N).

    Probably you could do better sorting the animals array first, then doing a binary search. You'll be able to reduce to O(N * log N) - well, that's if log N < M anyway.

    Anyway, if you're working with JS and that runs client-side, then just try to keep amount of data at minimum, or their browsers will yell at you with every request.

    Stuart Branham : I was just writing something about sorting the lists first as well. You end up with n*lg(n) + m*lg(n) to sort the main list and then search main list m times, which is effectively m*lg(n) - much better than n^2
  • With jQuery it's pretty easy:

    function is_mammal(animal)
    {
        return $.inArray(animal, nonMammals) == -1;
    }
    
    mammals = $.grep(animals, is_mammal);
    

    See docs for $.grep and $.inArray.

    Seb : That's easy, but doesn't make it any better than O(N^2). Just because you "hide" the loops doesn't mean $.inArray() and $.grep() don't have them.
  • Basically what you want to do is efficiently compute the set difference S \ T. The (asymptotically) fastest way I know is to put T into a hashmap (that makes |T| steps) and go over each s in S (that makes |S| steps) checking wether s in T (which is O(1)). So you get to O(|T| + |S|) steps.

    Seb : While this _is_ faster than my answer, it requires more memory for the hash table. There're pros and cons everywhere, right? :) Anyway, I believe this is the best solution, so there goes my +1.
    bayer : Yes, that's why I wrote the asymptotically there. ;) Personally I'd guess that for small inputs even the squared solution suffices.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.