Syntax highlighter

Recently I became interested in syntax highlighting, and discovered that many examples use regular expressions. Unfortunately the Regex approach is slow and hard to maintain.

Instead, a Python syntax highlighter written in C# can be done with a simple tokenizer that searches through an array (or Dictionary). Char type methods, like char.IsDigit, can also be used.

Example

Here is a class called PythonSyntaxHighlighter that includes a static array containing keywords. It uses color codes that will render in the terminal.

Info In HighlightSyntaxPython we iterate over every character in the Python source code.

Next We test the character and the previous character. If we have a letter, for example, we see if we are on a keyword.

And If we successfully match a keyword, we call Append() on our StringBuilder with the color codes necessary to render the keyword as blue.

using System;
using System.IO;
using System.Text;

class PythonSyntaxHighlighter
{
    // Support any number of keywords.
    static string[] Keywords = ["if", "else", "class", "def"];

    // Use colors based on type of data.
    const string _keywordStart = "\u001b[34m"; // Blue
    const string _stringStart = "\u001b[32m"; // Green
    const string _commentStart = "\u001b[33m"; // Yellow
    const string _numberStart = "\u001b[31m"; // Red
    const string _reset = "\u001b[0m";

    static string HighlightPythonSyntax(string pythonCode)
    {
        // Build up the formatted code.
        var builder = new StringBuilder(pythonCode.Length * 2);
        var temp = new StringBuilder();
        for (int i = 0; i < pythonCode.Length; i++) 
        {
            var previousByte = ' ';
            if (i >= 1)
            {
                previousByte = pythonCode[i - 1];
            }
            var byteHere = pythonCode[i];
            if (char.IsWhiteSpace(previousByte) && char.IsLetter(byteHere))
            {
                // Handle keywords.
                temp.Clear();
                for ( ; i < pythonCode.Length; i++)
                {
                    if (!char.IsLetter(pythonCode[i]))
                    {
                        break;
                    }
                    temp.Append(pythonCode[i]);
                }
                var keywordHere = temp.ToString();
                if (Keywords.Contains(keywordHere))
                {
                    builder.Append(_keywordStart);
                    builder.Append(keywordHere);
                    builder.Append(_reset);
                }
                else
                {
                    builder.Append(keywordHere);
                }
            }
            else if (byteHere == '"')
            {
                // Handle string literals.
                temp.Clear();
                temp.Append('"');
                i += 1;
                for ( ; i < pythonCode.Length; i++)
                {
                    if (pythonCode[i] == '"')
                    {
                        break;
                    }
                    temp.Append(pythonCode[i]);
                }
                builder.Append(_stringStart);
                builder.Append(temp);
                builder.Append(_reset);
            }
            else if (byteHere == '#' && char.IsWhiteSpace(previousByte))
            {
                // Handle comments.
                temp.Clear();
                for ( ; i < pythonCode.Length; i++)
                {
                    if (pythonCode[i] == '\n')
                    {
                        break;
                    }
                    temp.Append(pythonCode[i]);
                }
                builder.Append(_commentStart);
                builder.Append(temp);
                builder.Append(_reset);
            }
            else if (char.IsDigit(byteHere))
            {
                // Handle numbers.
                temp.Clear();
                for ( ; i < pythonCode.Length; i++)
                {
                    if (!char.IsDigit(pythonCode[i]) && pythonCode[i] != '.')
                    {
                        break;
                    }
                    temp.Append(pythonCode[i]);
                }
                builder.Append(_numberStart);
                builder.Append(temp);
                builder.Append(_reset);
            }
            // Append the following byte.
            builder.Append(pythonCode[i]);
        }
        return builder.ToString();
    }

    static void Main()
    {
        string pythonCode = File.ReadAllText("program.py");
        string highlightedCode = HighlightPythonSyntax(pythonCode);
        Console.WriteLine(highlightedCode);
    }
}
def my_function(x):
    if x > 2.5:
        print("x is greater than 2.5")
    else:
        print("x is not greater than 2.5")

    # This is a comment
    y = 10 + 5

    class MyClass:
        def __init__(self, name):
            self.name = name

Notes, continued

For string literals, numbers, and comments, we perform similar logic as for keywords. We determine how long our token is, and then render it with surrounding color codes.

Important We avoid slow Regex operations, and overall our method will perform much faster than one that uses many Regex tests.

While it is possible to manipulate text with Regex calls, it can end up being slow and hard-to-maintain. A simple loop that tests each char as it proceeds is a better long-term choice.