Haskell is a purely functional language and efficiency is quite often a major concern. A well designed Haskell program performs on par with Java, but not as fast as C.
The nub function (defined in Data.List) eliminates duplicate values from a list. It is an oft-used function for Haskell programmers. However it has a worst case complexity of O(n²), which means it is almost useless for very large lists. This blog post presents alternative definitions for reducing complexity under different circumstances.
First, let’s look at (one of) the standard definition of the nub function:
nub :: (Eq a) => [a] -> [a] nub [] = [] nub (x:xs) = x : nub (filter (\y -> x /= y) xs)
This definition compares each element to the (rest of) the list and eliminates the duplicates. The complexity is O(n²), or to be more precise ½O(n²).
The simplest way to reduce the worst case complexity from O(n²) to O(n . log n) is to convert it to a set and then back to a list.
import qualified Data.Set as S nub' :: (Ord a) -> [a] -> [a] nub' = Set.toList . Set.fromList
By definition a set contains only unique values, so this will obviously eliminate all duplicates from the list. However this requires the elements of a list to be Orderable (members of the Ord class), as Eq class is not sufficient.
But the above definition has a side effect: it does not produce the same result as the original definition. Lets compare them:
Prelude> let list = [1,5,3,9,3,9,7,10,1,6,5] Prelude> nub list [1,5,3,9,7,10,6] Prelude> nub' list [1,3,5,6,7,9,10]
So the new definition not only removes the duplicates, it sorts the remaining unique elements of the list. Wow!
If that is what you need, or if you don’t really care about the order, this new method should be perfect. However, very often the order matters, and that is what the nub function should do: preserve the original order of the elements.
Since we know that a set keeps only unique elements, we can use it to define an efficient nub that also preserves the order of the list elements:
import qualified Data.Set as Set nub2 :: (Ord a) => [a] -> [a] nub2 = func Set.empty where func _ [] = [] func set (x:xs) | Set.member x set = func set xs | otherwise = x : func (Set.insert x set) xs
This works perfectly! Now consider a situation where you want to remove duplicates by comparing not the actual items, but some other property of the items. This is not technically the same as nubBy although can be achieved using nubBy. But we want a more efficient version.
An example is a list of pairs and you want to get pairs where the first element is unique. The above definition can easily be modified to do this:
import qualified Data.Set as Set nubWith :: (Ord b) => (a -> b) -> [a] -> [a] nubWith = func Set.empty where func _ _ [] = [] func set f (x:xs) | Set.member x' set = func set f xs | otherwise = x : func (Set.insert x' set) f xs where x' = f x
Another interesting problem is to find unique items in a list, i.e. items that appear only once are returned and items that appear more than once are removed altogether. For example:
Prelude> let list = [1,5,3,9,3,9,7,10,1,6,5] Prelude> nub list [7,10,6]
This is very similar to nub and can be solved in O(n log(n)) worst case complexity (requiring the elements to be members of Ord class).
unique :: (Ord a) => [a] -> [a] unique x = concat . filter (\a -> length a == 1) . group . sort $ x
This definition uses sort and group functions to find and remove any values that appear more than once. The filter function takes items that only appear once. However, the length function is slower for longer lists. To check whether a list contains exactly one item, the filter line in the above definition can be replaced with:
filter (\a -> not (null a) && null (tail a)) .
Since we are using sort here, the result will not preserve the original order of elements in the list (items will be sorted). You can practice Haskell by defining this function such that it preserves the original order.
Do you want to test your Haskell skills? Then write an efficient definition of the nubBy function which should reduce the complexity to O(n log n).