Recently I became interested in syntax highlighting, and discovered that many examples use regular expressions. Unfortunately the Regex
approach is slow and hard to maintain.
Instead, a Python syntax highlighter written in C# can be done with a simple tokenizer that searches through an array (or Dictionary
). Char
type methods, like char.IsDigit
, can also be used.
Here is a class
called PythonSyntaxHighlighter
that includes a static
array containing keywords. It uses color codes that will render in the terminal.
HighlightSyntaxPython
we iterate over every character in the Python source code.Append()
on our StringBuilder
with the color codes necessary to render the keyword as blue.using System; using System.IO; using System.Text; class PythonSyntaxHighlighter { // Support any number of keywords. static string[] Keywords = ["if", "else", "class", "def"]; // Use colors based on type of data. const string _keywordStart = "\u001b[34m"; // Blue const string _stringStart = "\u001b[32m"; // Green const string _commentStart = "\u001b[33m"; // Yellow const string _numberStart = "\u001b[31m"; // Red const string _reset = "\u001b[0m"; static string HighlightPythonSyntax(string pythonCode) { // Build up the formatted code. var builder = new StringBuilder(pythonCode.Length * 2); var temp = new StringBuilder(); for (int i = 0; i < pythonCode.Length; i++) { var previousByte = ' '; if (i >= 1) { previousByte = pythonCode[i - 1]; } var byteHere = pythonCode[i]; if (char.IsWhiteSpace(previousByte) && char.IsLetter(byteHere)) { // Handle keywords. temp.Clear(); for ( ; i < pythonCode.Length; i++) { if (!char.IsLetter(pythonCode[i])) { break; } temp.Append(pythonCode[i]); } var keywordHere = temp.ToString(); if (Keywords.Contains(keywordHere)) { builder.Append(_keywordStart); builder.Append(keywordHere); builder.Append(_reset); } else { builder.Append(keywordHere); } } else if (byteHere == '"') { // Handle string literals. temp.Clear(); temp.Append('"'); i += 1; for ( ; i < pythonCode.Length; i++) { if (pythonCode[i] == '"') { break; } temp.Append(pythonCode[i]); } builder.Append(_stringStart); builder.Append(temp); builder.Append(_reset); } else if (byteHere == '#' && char.IsWhiteSpace(previousByte)) { // Handle comments. temp.Clear(); for ( ; i < pythonCode.Length; i++) { if (pythonCode[i] == '\n') { break; } temp.Append(pythonCode[i]); } builder.Append(_commentStart); builder.Append(temp); builder.Append(_reset); } else if (char.IsDigit(byteHere)) { // Handle numbers. temp.Clear(); for ( ; i < pythonCode.Length; i++) { if (!char.IsDigit(pythonCode[i]) && pythonCode[i] != '.') { break; } temp.Append(pythonCode[i]); } builder.Append(_numberStart); builder.Append(temp); builder.Append(_reset); } // Append the following byte. builder.Append(pythonCode[i]); } return builder.ToString(); } static void Main() { string pythonCode = File.ReadAllText("program.py"); string highlightedCode = HighlightPythonSyntax(pythonCode); Console.WriteLine(highlightedCode); } }def my_function(x): if x > 2.5: print("x is greater than 2.5") else: print("x is not greater than 2.5") # This is a comment y = 10 + 5 class MyClass: def __init__(self, name): self.name = name
For string
literals, numbers, and comments, we perform similar logic as for keywords. We determine how long our token is, and then render it with surrounding color codes.
Regex
operations, and overall our method will perform much faster than one that uses many Regex
tests.While it is possible to manipulate text with Regex
calls, it can end up being slow and hard-to-maintain. A simple loop that tests each char
as it proceeds is a better long-term choice.