A small compiler - Wilco Syntax Highlighting

Although I am quite busy these days for my projects at hand, I still try to find time to do what I really want: to dig into compiler, operating system, and CLR framework.

Yesterday, I spent hours to analyze a popular syntax highlighter tool - Wilco SyntaxHighter, because it is a small compiler to some degree. :)

The highlighter parses string source (the code to be highlighted) to scan tokens (comment, string, key word). Each token includes position/length information in the string source and related with highlighter style data. Then the parser reads the source again to merge those parsed tokens. The string segment of the token will be updated with style data. Other string segment will leave as-is.

The good feature of that parser is to build a scanner chain. For example, to parse C# code, these scanner will be used: CommentBlockScanner (/* ... */) -- CommentLineScanner (//) -- StringBlockScanner (@) -- StringLineScanner ("") -- WordScanner. When the current and following characters match CommentBlockScanner, the CommentBlockScanner will continue to read characters to the end of the comment block and take that block as a Comment token; if the current character does not match CommentBlockScanner, then it may match the next scanner in the scanner chain ... If the character does not match any scanner, then it does not belong to a token and should be ignored.

For different type of language (e.g. Java, CSS, etc), we can build and use different scanner chain. But the basic parsing logic is still same.

The concept of Compiler is very useful when generating code dynamically. When the theory of "Software Factory" becomes real, code generation tool will be the fundamental in the system.

0 comments: