Programs written in C# often need to clean up data as they read it. A string
may appear twice, and this is not helpful for other methods.
A List
may have duplicate elements—to eliminate these, we call Distinct()
. We can use a method like ToList()
to go from an IEnumerable
to a List
again.
This program uses the System.Linq
namespace. It invokes the Distinct()
method to remove duplicates—this is the simplest way.
List
with seven int
elements is created. The List
contains duplicate elements for the values 3 and 4.Distinct()
extension method on the List
. Duplicates are removed. With ToList
, we convert back to the original List
type.using System; using System.Collections.Generic; using System.Linq; // Step 1: create List with duplicate elements. List<int> list = new List<int>(); list.Add(1); list.Add(2); list.Add(3); list.Add(3); list.Add(4); list.Add(4); list.Add(4); foreach (int value in list) { Console.WriteLine("Before: {0}", value); } // Step 2: get distinct elements and convert into a list again. List<int> distinct = list.Distinct().ToList(); foreach (int value in distinct) { Console.WriteLine("After: {0}", value); }Before: 1 Before: 2 Before: 3 Before: 3 Before: 4 Before: 4 Before: 4 After: 1 After: 2 After: 3 After: 4
Suppose we have a List
of objects (class
instances). We want to find all objects with just a unique field (like the Key field).
Distinct()
with a special IEqualityComparer
to treat all objects with an equal field as duplicates of one another.System.Linq
extension method can be invoked on arrays, Lists, or any IEnumerable
type.using System; using System.Collections.Generic; using System.Linq; class Item { public int Key; public string Value; } class ItemEqualityComparer : IEqualityComparer<Item> { public bool Equals(Item x, Item y) { // Two items are equal if their keys are equal. return x.Key == y.Key; } public int GetHashCode(Item obj) { return obj.Key.GetHashCode(); } } class Program { static void Main() { var list = new List<Item>(); list.Add(new Item() { Key = 5, Value = "Bird" }); list.Add(new Item() { Key = 5, Value = "Frog" }); list.Add(new Item() { Key = 10, Value = "Bird" }); list.Add(new Item() { Key = 10, Value = "Frog" }); // Get distinct items using IEqualityComparer. // ... The Key field is used to see if two items are equal. var result = list.Distinct(new ItemEqualityComparer()); foreach (var item in result) { Console.WriteLine($"DISTINCT ITEM = {item.Key} {item.Value}"); } } }DISTINCT ITEM = 5 Bird DISTINCT ITEM = 10 Bird
Sometimes, using nested loops is a good solution. This requires less memory allocation. Here we loop through the list, and then loop through all previous elements.
using System; using System.Collections.Generic; class Program { public static List<string> RemoveDuplicatesIterative(List<string> items) { List<string> result = new List<string>(); for (int i = 0; i < items.Count; i++) { // Assume not duplicate. bool duplicate = false; for (int z = 0; z < i; z++) { if (items[z] == items[i]) { // This is a duplicate. duplicate = true; break; } } // If not duplicate, add to result. if (!duplicate) { result.Add(items[i]); } } return result; } static void Main() { // Call method with this input. List<string> input = new List<string>() { "x", "x", "y", "y", "y", "z" }; List<string> output = RemoveDuplicatesIterative(input); Console.WriteLine("Input: " + string.Join(",", input)); Console.WriteLine("Output: " + string.Join(",", output)); } }Input: x,x,y,y,y,z Output: x,y,z
HashSet
methodThis method provides predictable, good performance. It uses a HashSet
to store all encountered elements, so we always know if we have a duplicate at a certain point.
using System; using System.Collections.Generic; class Program { public static List<string> RemoveDuplicatesSet(List<string> items) { // Use HashSet to maintain table of duplicates encountered. var result = new List<string>(); var set = new HashSet<string>(); for (int i = 0; i < items.Count; i++) { // If not duplicate, add to result. if (!set.Contains(items[i])) { result.Add(items[i]); // Record as a future duplicate. set.Add(items[i]); } } return result; } static void Main() { var input = new List<string>() { "a", "b", "a", "b", "b", "b", "c" }; var output = RemoveDuplicatesSet(input); Console.WriteLine("Input: " + string.Join(",", input)); Console.WriteLine("Output: " + string.Join(",", output)); } }Input: a,b,a,b,b,b,c Output: a,b,c
For performance testing, we develop a method that generates a list of strings with some duplicates. We specify the unique string
count and the duplicate count.
using System; using System.Collections.Generic; class Program { public static List<string> GetListWithDuplicates(int length, int duplicatesLength) { const string duplicateString = "bird"; List<string> result = new List<string>(); // Add all required strings. for (int i = 0; i < length; i++) { result.Add("cat" + i); // Add duplicate strings if required. if (duplicatesLength > 0) { result.Add(duplicateString); duplicatesLength--; } } // Add all remaining duplicates. for (int i = 0; i < duplicatesLength; i++) { result.Add(duplicateString); } // Return result. return result; } static void Main() { Console.WriteLine(string.Join(",", GetListWithDuplicates(5, 2))); Console.WriteLine(string.Join(",", GetListWithDuplicates(1, 0))); Console.WriteLine(string.Join(",", GetListWithDuplicates(4, 4))); } }cat0,bird,cat1,bird,cat2,cat3,cat4 cat0 cat0,bird,cat1,bird,cat2,bird,cat3,bird
We have seen the HashSet
method is the fastest way to remove duplicates for large lists. We can use a generic method to avoid writing new versions.
using System; using System.Collections.Generic; class Program { public static List<T> RemoveDuplicatesSet<T>(List<T> items) { // Use HashSet to remember items seen. var result = new List<T>(); var set = new HashSet<T>(); for (int i = 0; i < items.Count; i++) { // Add if needed. if (!set.Contains(items[i])) { result.Add(items[i]); set.Add(items[i]); } } return result; } static void Main() { var input = new List<string>() { "j", "x", "j", "x", "y" }; var output = RemoveDuplicatesSet(input); Console.WriteLine("Input: " + string.Join(",", input)); Console.WriteLine("Output: " + string.Join(",", output)); } }Input: j,x,j,x,y Output: j,x,y
Here we test the performance of these methods on lists of size 3, 300 and 3000 elements. The duplicates are about one-third of the elements on the 2 larger lists.
HashSet
to remove duplicate elements—we build up a new List
by checking the HashSet
.for
-loop to build up a List
containing all the non-duplicated elements.for
-loop (version 3) is fastest. For larger lists, the HashSet
method (version 2) is fastest.short
lists, prefer a for
-loop. For long lists, prefer a HashSet
method. A branch could test for the optimal dedupe strategy.using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; class Program { static void Main() { for (int test = 0; test < 3; test++) { // Get test data. var testData = GetTestData(test); var max = testData.Item4; var s1 = Stopwatch.StartNew(); for (int i = 0; i < max; i++) { // Version 1: use Distinct. var unique = testData.Item2.Distinct().ToList(); if (unique.Count != testData.Item3) { return; } } s1.Stop(); var s2 = Stopwatch.StartNew(); for (int i = 0; i < max; i++) { // Version 2: use HashSet. var unique = RemoveDuplicatesSet(testData.Item2); if (unique.Count != testData.Item3) { return; } } s2.Stop(); var s3 = Stopwatch.StartNew(); for (int i = 0; i < max; i++) { // Version 3: use nested for-loop. var unique = RemoveDuplicatesIterative(testData.Item2); if (unique.Count != testData.Item3) { return; } } s3.Stop(); // Write description. Console.WriteLine(testData.Item1); // Write timings. Console.WriteLine(s1.Elapsed.TotalMilliseconds + " ms"); Console.WriteLine(s2.Elapsed.TotalMilliseconds + " ms"); Console.WriteLine(s3.Elapsed.TotalMilliseconds + " ms"); } } static Tuple<string, List<string>, int, int> GetTestData(int test) { // Tuple contains description string, list, the unique element count, and iterations for test. switch (test) { default: case 0: return new Tuple<string, List<string>, int, int>("3 ELEMENT LIST, 0 DUPLICATES", GetListWithDuplicates(3, 0), 3, 100000); case 1: return new Tuple<string, List<string>, int, int>("300 ELEMENT LIST, 100 DUPLICATES", GetListWithDuplicates(200, 100), 201, 1000); case 2: return new Tuple<string, List<string>, int, int>("3000 ELEMENT LIST, 1000 DUPLICATES", GetListWithDuplicates(2000, 1000), 2001, 100); } } public static List<string> GetListWithDuplicates(int length, int duplicatesLength) { const string duplicateString = "bird"; List<string> result = new List<string>(); for (int i = 0; i < length; i++) { result.Add("cat" + i); if (duplicatesLength > 0) { result.Add(duplicateString); duplicatesLength--; } } for (int i = 0; i < duplicatesLength; i++) { result.Add(duplicateString); } return result; } public static List<string> RemoveDuplicatesSet(List<string> items) { var result = new List<string>(); var set = new HashSet<string>(); for (int i = 0; i < items.Count; i++) { if (!set.Contains(items[i])) { result.Add(items[i]); set.Add(items[i]); } } return result; } public static List<string> RemoveDuplicatesIterative(List<string> items) { List<string> result = new List<string>(); for (int i = 0; i < items.Count; i++) { bool duplicate = false; for (int z = 0; z < i; z++) { if (items[z] == items[i]) { duplicate = true; break; } } if (!duplicate) { result.Add(items[i]); } } return result; } }3 ELEMENT LIST, 0 DUPLICATES 37.9524 ms 19.9176 ms 8.0359 ms 300 ELEMENT LIST, 100 DUPLICATES 23.1117 ms 20.499 ms 188.2431 ms 3000 ELEMENT LIST, 1000 DUPLICATES 22.7601 ms 21.4638 ms 1765.6447 ms
LINQ extension methods are useful on List
collections. Developers often favor the List
(over the array) for its resizing behavior and its user-friendliness.
For erasing duplicate elements in a List
and making every element unique, HashSet
, and Distinct, are good options. A for
-loop is faster on tiny lists.