
You want to remove duplicate English words. Often duplicate words are not useful and natural language is very repetitive. It is possible to remove the duplicate words to simplify your algorithm or improve performance.
This C# program removes duplicate words from a string. It uses Dictionary.
Required output for duplicate words
Input: Do or do not there is no try
Output: Do or not there is no try
Note: [The second 'do' was removed.]First, we want to use a data structure such as Dictionary with provides constant-time lookup times for keys. We will be processing words in a loop, and we need to check each word against all words already encountered. Using two lists would result in higher complexity, potentially making your program useless.
Example program that removes duplicate words [C#]
using System;
using System.Collections.Generic;
using System.Text;
class Program
{
static void Main()
{
string s = "Do or do not there is no try.";
Console.WriteLine(s);
Console.WriteLine(RemoveDuplicateWords(s));
s = "Dot Net Perls is a web site about the Dot Net Framework.";
Console.WriteLine(s);
Console.WriteLine(RemoveDuplicateWords(s));
}
static public string RemoveDuplicateWords(string v)
{
// 1
// Keep track of words found in this Dictionary.
var d = new Dictionary<string, bool>();
// 2
// Build up string into this StringBuilder.
StringBuilder b = new StringBuilder();
// 3
// Split the input and handle spaces and punctuation.
string[] a = v.Split(new char[] { ' ', ',', ';', '.' },
StringSplitOptions.RemoveEmptyEntries);
// 4
// Loop over each word
foreach (string current in a)
{
// 5
// Lowercase each word
string lower = current.ToLower();
// 6
// If we haven't already encountered the word,
// append it to the result.
if (!d.ContainsKey(lower))
{
b.Append(current).Append(' ');
d.Add(lower, true);
}
}
// 7
// Return the duplicate words removed
return b.ToString().Trim();
}
}
Output
Do or do not there is no try.
Do or not there is no try
Dot Net Perls is a web site about the Dot Net Framework.
Dot Net Perls is a web site about the Framework
Overview. Here is the body of RemoveDuplicateWords, as well the console program source that verifies its correctness. You can paste this code into a new C# program. It requires .NET 3.5.
Method contents. It calls Split. By passing a new char array to string Split, we can deal with punctuation properly. Otherwise, punctuation would be considered part of the word. For performance, it uses StringBuilder appends. This is how we build up the result string.
StringBuilder SecretsDictionary usage. We use Dictionary to store the words already encountered. The var keyword refers to the Dictionary. As noted above, the Dictionary improves the asymptotic efficiency.
Var Examples
I used this code, and also a variant that removes stopwords, to implement a full-text-search feature in a Windows Forms program. The optimization provided by this code can be seen in my FTS3 article.
SQLite FTS3 Virtual TableUsing with databases. This style of code in my full-text database helped reduce the file size of the resulting database by 50%. It was also much faster to search. My warning here is not to use any methods that use nested loops. When I last researched this, there was one available, and it would destroy performance with certain input.
Here we saw how you can combine Dictionary with StringBuilder to develop a method that removes duplicate English words efficiently, using the C# programming language. The code does lookups on each word as it encounters them.
Algorithms