char Array Duplicate Words List Remove Duplicates String IndexOf

Duplicate chars

Duplicate chars occur in strings in C# programs, and we can remove them. The string may have 2 A chars—we want it to have only one.

Some implementations, such as those that allocate fewer strings, are faster. But this often depends on the data as well. Typically, using a char array is the best approach.

Example

The essential logic here is to track all the chars that have been encountered. Then we avoid adding further characters that have been encountered.

Here The IndexOf call in the RemoveDuplicateChars method is notable. If the specific character is not found, it returns -1.

Tip Only 1 string is needed in the RemoveDuplicateChars method—we can use IndexOf on the string's own characters.

using System;

class Program
{
    static void Main()
    {
        string result = RemoveDuplicateChars("AABBCCCD");
        Console.WriteLine(result);
    }

    static string RemoveDuplicateChars(string key)
    {
        // Remove duplicate chars using string concats.
        // ... Store encountered letters in this string.
        string result = "";

        foreach (char value in key)
        {
            // See if character is in the result already.
            if (result.IndexOf(value) == -1)
            {
                // Append to the result.
                result += value;
            }
        }
        return result;
    }
}
ABCD

Bitwise

Bitwise operators can be used to detect duplicate letters. The lowercase letters are converted to integers 0 to 25 by subtracting 97 from their ASCII representation.

Then As each letter is encountered, its bit is tested. If it was already encountered, its bit will be set to 1.

Next We record the letter in the mask. The next time it is encountered we will recognize it.

using System;

class Program
{
    static void Main()
    {
        Console.WriteLine("perls");
        ReportDuplicateCharacters("perls");

        Console.WriteLine("massive");
        ReportDuplicateCharacters("massive");

        Console.WriteLine("mississippi");
        ReportDuplicateCharacters("mississippi");
    }

    static void ReportDuplicateCharacters(string value)
    {
        int mask = 0;
        for (int i = 0; i < value.Length; i++)
        {
            int index = value[i] - 97;
            if ((mask & (1 << index)) != 0)
            {
                Console.WriteLine("Duplicate: {0}", value[i]);
            }
            else
            {
                mask |= (1 << index);
            }
        }
        // To zero a bit: mask &= ~(1 << index);
    }
}perls
massive
Duplicate: s
mississippi
Duplicate: s
Duplicate: i
Duplicate: s
Duplicate: s
Duplicate: i
Duplicate: p
Duplicate: i

Benchmark

We test performance on a string that is 16 characters long. Both methods will degrade with longer strings, particularly those with more duplicated characters.

Version 1 This is the original method. It removes duplicate chars, but does end up copying strings many times.

Version 2 This version of the method avoids lots of string copying—it uses character array buffers.

Result The version that uses char arrays (RemoveDuplicateCharsFast) is many times faster than the string version.

using System;
using System.Diagnostics;

class Program
{
    const int _max = 1000000;
    static void Main()
    {
        var s1 = Stopwatch.StartNew();
        // Version 1: use original method.
        for (int i = 0; i < _max; i++)
        {
            string value = RemoveDuplicateChars("AABBCCCD");
            if (value == null)
            {
                return;
            }
        }
        s1.Stop();
        var s2 = Stopwatch.StartNew();
        // Version 2: use optimized method.
        for (int i = 0; i < _max; i++)
        {
            string value = RemoveDuplicateCharsFast("AABBCCCD");
            if (value == null)
            {
                return;
            }
        }
        s2.Stop();
        Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns"));
        Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns"));
    }

    static string RemoveDuplicateChars(string key)
    {
        // See comments in first example.
        string result = "";

        foreach (char value in key)
        {
            if (result.IndexOf(value) == -1)
            {
                result += value;
            }
        }
        return result;
    }

    static string RemoveDuplicateCharsFast(string key)
    {
        // Remove duplicate chars using char arrays.
        int keyLength = key.Length;

        // Store encountered letters in this array.
        char[] table = new char[keyLength];
        int tableLength = 0;

        // Store the result in this array.
        char[] result = new char[keyLength];
        int resultLength = 0;

        foreach (char value in key)
        {
            // Scan the table to see if the letter is in it.
            bool exists = false;
            for (int i = 0; i < tableLength; i++)
            {
                if (value == table[i])
                {
                    exists = true;
                    break;
                }
            }
            // If the letter is new, add to the table and the result.
            if (!exists)
            {
                table[tableLength] = value;
                tableLength++;

                result[resultLength] = value;
                resultLength++;
            }
        }
        // Return the string at this range.
        return new string(result, 0, resultLength);
    }
}177.22 ns    RemoveDuplicateChars
78.34 ns     RemoveDuplicateCharsFast (char[] arrays)

For applications where performance is not critical, testing a string as we go along is adequate—and easy to understand. But the char array approach is often faster.