Neat Way Of Removing Duplicates From Array

Hello and thanks for all of the amazing work,

I was wondering what would be considered a neat way to pass to something an array and have its duplicate values removed ?

Thanks again !

It might be overkill, depending on need, but hashtables in Neko work out fairly well; not a bad read

Still studying Neko, so haven’t played with higher level Haxe hash routines, but I’m assuming it’s nearly a one liner with comprehension.

Or, eat the O(1) and scan for members before insert? Again, I’m assuming that’s a one block expression in Haxe.

Or, someone more experienced will chime in. :slight_smile:

Have good, dedup well

1 Like

For future reference, Mark shared some code on haxe’s gitter :slight_smile:

1 Like

Depending on how large your array is and how many duplicate elements it has, you could sort it and use a haxe.ds.Vector to allocate unique elements while keeping track of the last added element: Try Haxe !.

This might work better for a large array with a smaller range of elements. Not sure what effect it would have for dynamic targets that don’t support fixed-sized arrays though.

1 Like

Probably not what you want, but just a note that if you’re targetting JS, you can use something like lodash’s uniq: directly. That’s what I do currently.

@myclick: I would say it depends on the target language. What I would try to avoid is creating a result array and checking the indexOf function as this has to go through the entire array in worst case for every value in the original array.

What you might want to do is trade speed to memory and create a kind of hash object. For JavaScript this is just a generic object. Finding an index in a hashmap is usually a lot faster than finding something in an array.

A quick sample can be found here: Try Haxe !
Couple of things to note:

  • Because the original array contains integers you have to convert these to a string in order to use them in the hash.
  • I store a simple bool in the hash as value as this is the smallest value available.

Good luck :slight_smile:

There may be another way.

If the code that originally builds or acquires the array is part of the code you talk about here:
Modify the code that does Building / Acquisition to remove duplicates as soon as possible.
Also if any Sorting is needed downstream, strongly suggest you do sorting as part of building / acquisition (the Performance difference is amazing).

This is similar to what I discovered about Sorting and wrote up:

When Sorting Needs To Be Faster - CodeProject

Have Fun!

1 Like

Categorically speaking, instead of a true “array” you probably want to use – or build – some kind of “container class” that is based either on a Tree or a hash-table, according to your needs. (Treat both of them as a “black box.”)

A “Tree” based container will naturally provide for ordering; a “Hash” based container will not. Performance can also vary on insertions and deletions since a Tree will attempt to “re-balance” itself to maintain steady performance.

The “details of how this works, from the point-of-view of your application,” should probably be concealed within another Class of your own making, so that any code which is dependent upon your particular design-choice occurs in only one place and none of the rest of the system has to care exactly how you chose to do it.

P.S.: “What if you really do have ‘an array,’ and you really do need to efficiently remove duplicates, and you promise not to have to do it very often?” How about borrowing an old COBOL / punched-cards trick … sorting.

You can apply an efficient sorting algorithm – such as HeapSort – even to a very(!) large amount of data, very efficiently, and obtain two immediate benefits when you then pass through the sorted file sequentially:

(1) Any duplicate values will be physically adjacent: you can find them by comparing “this record” to the one immediately ahead of it or behind it.

(2) If any gap occurs in the sorted sequence, you know positively that no record occurs anywhere in the file that would fall into that gap.

Of course this was absolutely necessary when “all we had to work with was cards and tapes.” :slight_smile:

(koff, koff …)

Even today(!) it is sometimes significantly more efficient to sort a data-set and then to apply this sort of logic to it, than to use “indexed” processes. Furthermore, when doing “bulk” inserts into any sort of indexed structure – including a database table – significant benefits can be obtained by sorting the incoming record-stream first.