9.2.词法分析

发表于 2024-04-15 更新于 2024-06-03 分类于 The Elements of Computer System ， Module 9.Syntax Analyze 阅读次数：本文字数： 442 阅读时长 ≈ 2 分钟

\(9.2.\)Lexical Analysis

1. Tokenizing

In the previous unit, we observed that the syntax analyzer consist of two modules: a tokenizer and a parser:

And the tokenizer is an application of a more general area of theory known as lexical analysis. Here's an example of tokenizing in action:

As far as lexical analysis is concerned, the input is nothing more than a stream of characters.
Tokenizing refers to grouping the primitive stream of characters into a stream of meaningful tokens. It provides a very simple yet important preliminary processing of the file.
A token is a string of characters that has a meaning.
- Different programming languages have different definitions of token. For example, x++ in C makes a lot of sense, but not in Jack, for it doesn't have an ++ operation.

Once we come up with these stream of tokens, we can hand it over to the compiler and completely forget the original input file.

2.Jack tokenizer

In Jack, we have 5 kinds of tokens:

And our tokenizer will provide these useful service:

Handle the compiler's input: It allow us to view the input as a stream of tokens.
Allow advancing the input: It allow us to know:
- If there are more tokens to process.
- What the next token is.
- The value, type of this token.
Supply the value, type of the current token.

Here's an example of a program that use these services:

The TokenizerTest is a program that construct the tokenizer object. Here's what it does:

It goes through the input, and for every token, it lists the token to be out of file.
In the first line, if is surrounded by two tags, it describe the type of the token.

And to conduct these functionalities, we can write this pseudo code:

tknzr = new JackTokenizer("Prog.jack")
tknzr.advance();
while tknzr.hasMoreTokens() {
	tokenclassification=  current token classification
	print "<" + tokenclassification + ">"
	print the current token value
	print "</" + tokenclassification + ">"
	print newLine
	tknzr.advance();
)

We get the first token, and it becomes the current token.
We enter a loop as long as we have more tokens to process.
In each line of the program, we have to output the classification of the token, and we have to do it twice.
We:
1. Print the opening tag, then
2. Print the value of of the current token,
3. Print the token itself,
4. At last print end tag with token classification.
We print a newline, and advance() to get the next token from the tokenizer, until we consume all the tokens from the input file.