Distinct Duplicate Words IEqualityComparer LINQ List String Remove Duplicate Chars ToList

Remove duplicates

Programs written in C# often need to clean up data as they read it. A string may appear twice, and this is not helpful for other methods.

A List may have duplicate elements—to eliminate these, we call Distinct(). We can use a method like ToList() to go from an IEnumerable to a List again.

Distinct

This program uses the System.Linq namespace. It invokes the Distinct() method to remove duplicates—this is the simplest way.

Step 1 A List with seven int elements is created. The List contains duplicate elements for the values 3 and 4.

Step 2 We use the Distinct() extension method on the List. Duplicates are removed. With ToList, we convert back to the original List type.

using System;
using System.Collections.Generic;
using System.Linq;

// Step 1: create List with duplicate elements.
List<int> list = new List<int>();
list.Add(1);
list.Add(2);
list.Add(3);
list.Add(3);
list.Add(4);
list.Add(4);
list.Add(4);

foreach (int value in list)
{
    Console.WriteLine("Before: {0}", value);
}

// Step 2: get distinct elements and convert into a list again.
List<int> distinct = list.Distinct().ToList();

foreach (int value in distinct)
{
    Console.WriteLine("After: {0}", value);
}Before: 1
Before: 2
Before: 3
Before: 3
Before: 4
Before: 4
Before: 4
After: 1
After: 2
After: 3
After: 4

Unique object fields

Suppose we have a List of objects (class instances). We want to find all objects with just a unique field (like the Key field).

Tip We can use Distinct() with a special IEqualityComparer to treat all objects with an equal field as duplicates of one another.

Tip 2 This System.Linq extension method can be invoked on arrays, Lists, or any IEnumerable type.

Thus We can make the result have each object with a different key. We remove duplicates considering only a part of each object.

using System;
using System.Collections.Generic;
using System.Linq;

class Item
{
    public int Key;
    public string Value;
}

class ItemEqualityComparer : IEqualityComparer<Item>
{
    public bool Equals(Item x, Item y)
    {
        // Two items are equal if their keys are equal.
        return x.Key == y.Key;
    }

    public int GetHashCode(Item obj)
    {
        return obj.Key.GetHashCode();
    }
}

class Program
{
    static void Main()
    {
        var list = new List<Item>();
        list.Add(new Item() { Key = 5, Value = "Bird" });
        list.Add(new Item() { Key = 5, Value = "Frog" });
        list.Add(new Item() { Key = 10, Value = "Bird" });
        list.Add(new Item() { Key = 10, Value = "Frog" });

        // Get distinct items using IEqualityComparer.
        // ... The Key field is used to see if two items are equal.
        var result = list.Distinct(new ItemEqualityComparer());

        foreach (var item in result)
        {
            Console.WriteLine($"DISTINCT ITEM = {item.Key} {item.Value}");
        }
    }
}DISTINCT ITEM = 5 Bird
DISTINCT ITEM = 10 Bird

Iterative method

Sometimes, using nested loops is a good solution. This requires less memory allocation. Here we loop through the list, and then loop through all previous elements.

And If an element that comes before the current one is the same as the current one, we have a duplicate.

Warning This method would become extremely slow on large collections of elements.

using System;
using System.Collections.Generic;

class Program
{
    public static List<string> RemoveDuplicatesIterative(List<string> items)
    {
        List<string> result = new List<string>();
        for (int i = 0; i < items.Count; i++)
        {
            // Assume not duplicate.
            bool duplicate = false;
            for (int z = 0; z < i; z++)
            {
                if (items[z] == items[i])
                {
                    // This is a duplicate.
                    duplicate = true;
                    break;
                }
            }
            // If not duplicate, add to result.
            if (!duplicate)
            {
                result.Add(items[i]);
            }
        }
        return result;
    }

    static void Main()
    {
        // Call method with this input.
        List<string> input = new List<string>() { "x", "x", "y", "y", "y", "z" };
        List<string> output = RemoveDuplicatesIterative(input);
        Console.WriteLine("Input: " + string.Join(",", input));
        Console.WriteLine("Output: " + string.Join(",", output));
    }
}Input: x,x,y,y,y,z
Output: x,y,z

`HashSet` method

This method provides predictable, good performance. It uses a HashSet to store all encountered elements, so we always know if we have a duplicate at a certain point.

Note This is similar to the Distinct method in its implementation. It may be slower for tiny lists than the iterative method.

using System;
using System.Collections.Generic;

class Program
{
    public static List<string> RemoveDuplicatesSet(List<string> items)
    {
        // Use HashSet to maintain table of duplicates encountered.
        var result = new List<string>();
        var set = new HashSet<string>();
        for (int i = 0; i < items.Count; i++)
        {
            // If not duplicate, add to result.
            if (!set.Contains(items[i]))
            {
                result.Add(items[i]);
                // Record as a future duplicate.
                set.Add(items[i]);
            }
        }
        return result;
    }

    static void Main()
    {
        var input = new List<string>() { "a", "b", "a", "b", "b", "b", "c" };
        var output = RemoveDuplicatesSet(input);
        Console.WriteLine("Input: " + string.Join(",", input));
        Console.WriteLine("Output: " + string.Join(",", output));
    }
}Input: a,b,a,b,b,b,c
Output: a,b,c

Generate duplicates method

For performance testing, we develop a method that generates a list of strings with some duplicates. We specify the unique string count and the duplicate count.

Note The list has the values placed in an alternating order, and then all the remaining duplicates are placed at the end.

using System;
using System.Collections.Generic;

class Program
{
    public static List<string> GetListWithDuplicates(int length, int duplicatesLength)
    {
        const string duplicateString = "bird";
        List<string> result = new List<string>();
        // Add all required strings.
        for (int i = 0; i < length; i++)
        {
            result.Add("cat" + i);

            // Add duplicate strings if required.
            if (duplicatesLength > 0)
            {
                result.Add(duplicateString);
                duplicatesLength--;
            }
        }

        // Add all remaining duplicates.
        for (int i = 0; i < duplicatesLength; i++)
        {
            result.Add(duplicateString);
        }
        // Return result.
        return result;
    }

    static void Main()
    {
        Console.WriteLine(string.Join(",", GetListWithDuplicates(5, 2)));
        Console.WriteLine(string.Join(",", GetListWithDuplicates(1, 0)));
        Console.WriteLine(string.Join(",", GetListWithDuplicates(4, 4)));
    }
}cat0,bird,cat1,bird,cat2,cat3,cat4
cat0
cat0,bird,cat1,bird,cat2,bird,cat3,bird

Generic method

We have seen the HashSet method is the fastest way to remove duplicates for large lists. We can use a generic method to avoid writing new versions.

Detail We can use T to mean any type. So we can remove duplicates in a list of strings, ints or anything that has a hash code.

using System;
using System.Collections.Generic;

class Program
{
    public static List<T> RemoveDuplicatesSet<T>(List<T> items)
    {
        // Use HashSet to remember items seen.
        var result = new List<T>();
        var set = new HashSet<T>();
        for (int i = 0; i < items.Count; i++)
        {
            // Add if needed.
            if (!set.Contains(items[i]))
            {
                result.Add(items[i]);
                set.Add(items[i]);
            }
        }
        return result;
    }

    static void Main()
    {
        var input = new List<string>() { "j", "x", "j", "x", "y" };
        var output = RemoveDuplicatesSet(input);
        Console.WriteLine("Input: " + string.Join(",", input));
        Console.WriteLine("Output: " + string.Join(",", output));
    }
}Input: j,x,j,x,y
Output: j,x,y

Benchmark, dedupe

Here we test the performance of these methods on lists of size 3, 300 and 3000 elements. The duplicates are about one-third of the elements on the 2 larger lists.

Version 1 This version of the code uses the Distinct extension method to remove duplicate elements.

Version 2 In this code, we use the HashSet to remove duplicate elements—we build up a new List by checking the HashSet.

Version 3 We use a nested for-loop to build up a List containing all the non-duplicated elements.

Result For the 3-element lists, the nested for-loop (version 3) is fastest. For larger lists, the HashSet method (version 2) is fastest.

Thus For short lists, prefer a for-loop. For long lists, prefer a HashSet method. A branch could test for the optimal dedupe strategy.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

class Program
{
    static void Main()
    {
        for (int test = 0; test < 3; test++)
        {
            // Get test data.
            var testData = GetTestData(test);
            var max = testData.Item4;

            var s1 = Stopwatch.StartNew();
            for (int i = 0; i < max; i++)
            {
                // Version 1: use Distinct.
                var unique = testData.Item2.Distinct().ToList();
                if (unique.Count != testData.Item3)
                {
                    return;
                }
            }
            s1.Stop();
            var s2 = Stopwatch.StartNew();
            for (int i = 0; i < max; i++)
            {
                // Version 2: use HashSet.
                var unique = RemoveDuplicatesSet(testData.Item2);
                if (unique.Count != testData.Item3)
                {
                    return;
                }
            }
            s2.Stop();
            var s3 = Stopwatch.StartNew();
            for (int i = 0; i < max; i++)
            {
                // Version 3: use nested for-loop.
                var unique = RemoveDuplicatesIterative(testData.Item2);
                if (unique.Count != testData.Item3)
                {
                    return;
                }
            }
            s3.Stop();
            // Write description.
            Console.WriteLine(testData.Item1);
            // Write timings.
            Console.WriteLine(s1.Elapsed.TotalMilliseconds + " ms");
            Console.WriteLine(s2.Elapsed.TotalMilliseconds + " ms");
            Console.WriteLine(s3.Elapsed.TotalMilliseconds + " ms");
        }
    }

    static Tuple<string, List<string>, int, int> GetTestData(int test)
    {
        // Tuple contains description string, list, the unique element count, and iterations for test.
        switch (test)
        {
            default:
            case 0:
                return new Tuple<string, List<string>, int, int>("3 ELEMENT LIST, 0 DUPLICATES",
                    GetListWithDuplicates(3, 0),
                    3,
                    100000);
            case 1:
                return new Tuple<string, List<string>, int, int>("300 ELEMENT LIST, 100 DUPLICATES",
                    GetListWithDuplicates(200, 100),
                    201,
                    1000);
            case 2:
                return new Tuple<string, List<string>, int, int>("3000 ELEMENT LIST, 1000 DUPLICATES",
                    GetListWithDuplicates(2000, 1000),
                    2001,
                    100);
        }
    }

    public static List<string> GetListWithDuplicates(int length, int duplicatesLength)
    {
        const string duplicateString = "bird";
        List<string> result = new List<string>();
        for (int i = 0; i < length; i++)
        {
            result.Add("cat" + i);

            if (duplicatesLength > 0)
            {
                result.Add(duplicateString);
                duplicatesLength--;
            }
        }

        for (int i = 0; i < duplicatesLength; i++)
        {
            result.Add(duplicateString);
        }
        return result;
    }

    public static List<string> RemoveDuplicatesSet(List<string> items)
    {
        var result = new List<string>();
        var set = new HashSet<string>();
        for (int i = 0; i < items.Count; i++)
        {
            if (!set.Contains(items[i]))
            {
                result.Add(items[i]);
                set.Add(items[i]);
            }
        }
        return result;
    }

    public static List<string> RemoveDuplicatesIterative(List<string> items)
    {
        List<string> result = new List<string>();
        for (int i = 0; i < items.Count; i++)
        {
            bool duplicate = false;
            for (int z = 0; z < i; z++)
            {
                if (items[z] == items[i])
                {
                    duplicate = true;
                    break;
                }
            }
            if (!duplicate)
            {
                result.Add(items[i]);
            }
        }
        return result;
    }
}3 ELEMENT LIST, 0 DUPLICATES
37.9524 ms
19.9176 ms
8.0359 ms
300 ELEMENT LIST, 100 DUPLICATES
23.1117 ms
20.499 ms
188.2431 ms
3000 ELEMENT LIST, 1000 DUPLICATES
22.7601 ms
21.4638 ms
1765.6447 ms

LINQ extension methods are useful on List collections. Developers often favor the List (over the array) for its resizing behavior and its user-friendliness.

For erasing duplicate elements in a List and making every element unique, HashSet, and Distinct, are good options. A for-loop is faster on tiny lists.