@@ -17,94 +17,122 @@ So first, let's look at what the compiler does to your code. For now, we will
1717avoid mentioning how the compiler implements these steps except as needed;
1818we'll talk about that later.
1919
20- -  The compile process begins when a user writes a Rust source program in text
21-   and invokes the ` rustc `  compiler on it. The work that the compiler needs to
22-   perform is defined by command-line options. For example, it is possible to
23-   enable nightly features (` -Z `  flags), perform ` check ` -only builds, or emit
24-   LLVM-IR rather than executable machine code. The ` rustc `  executable call may
25-   be indirect through the use of ` cargo ` .
26- -  Command line argument parsing occurs in the [ ` rustc_driver ` ] . This crate
27-   defines the compile configuration that is requested by the user and passes it
28-   to the rest of the compilation process as a [ ` rustc_interface::Config ` ] .
29- -  The raw Rust source text is analyzed by a low-level lexer located in
30-   [ ` rustc_lexer ` ] . At this stage, the source text is turned into a stream of
31-   atomic source code units known as _ tokens_ .  The lexer supports the
32-   Unicode character encoding.
33- -  The token stream passes through a higher-level lexer located in
34-   [ ` rustc_parse ` ]  to prepare for the next stage of the compile process. The
35-   [ ` StringReader ` ]  struct is used at this stage to perform a set of validations
36-   and turn strings into interned symbols (_ interning_  is discussed later).
37-   [ String interning]  is a way of storing only one immutable
38-   copy of each distinct string value.
39- 
40- -  The lexer has a small interface and doesn't depend directly on the
41-   diagnostic infrastructure in ` rustc ` . Instead it provides diagnostics as plain
42-   data which are emitted in ` rustc_parse::lexer::mod `  as real diagnostics.
43- -  The lexer preserves full fidelity information for both IDEs and proc macros.
44- -  The parser [ translates the token stream from the lexer into an Abstract Syntax
45-   Tree (AST)] [ parser ] . It uses a recursive descent (top-down) approach to syntax
46-   analysis. The crate entry points for the parser are the
47-   [ ` Parser::parse_crate_mod() ` ] [ parse_crate_mod ]  and [ ` Parser::parse_mod() ` ] [ parse_mod ] 
48-   methods found in [ ` rustc_parse::parser::Parser ` ] . The external module parsing
49-   entry point is [ ` rustc_expand::module::parse_external_mod ` ] [ parse_external_mod ] .
50-   And the macro parser entry point is [ ` Parser::parse_nonterminal() ` ] [ parse_nonterminal ] .
51- -  Parsing is performed with a set of ` Parser `  utility methods including ` fn bump ` ,
52-   ` fn check ` , ` fn eat ` , ` fn expect ` , ` fn look_ahead ` .
53- -  Parsing is organized by the semantic construct that is being parsed. Separate
54-   ` parse_* `  methods can be found in [ ` rustc_parse `  ` parser ` ] [ rustc_parse_parser_dir ] 
55-   directory. The source file name follows the construct name. For example, the
56-   following files are found in the parser:
57-     -  ` expr.rs ` 
58-     -  ` pat.rs ` 
59-     -  ` ty.rs ` 
60-     -  ` stmt.rs ` 
61- -  This naming scheme is used across many compiler stages. You will find
62-   either a file or directory with the same name across the parsing, lowering,
63-   type checking, THIR lowering, and MIR building sources.
64- -  Macro expansion, AST validation, name resolution, and early linting takes place
65-   during this stage of the compile process.
66- -  The parser uses the standard ` DiagnosticBuilder `  API for error handling, but we
67-   try to recover, parsing a superset of Rust's grammar, while also emitting an error.
68- -  ` rustc_ast::ast::{Crate, Mod, Expr, Pat, ...} `  AST nodes are returned from the parser.
69- -  We then take the AST and [ convert it to High-Level Intermediate
70-   Representation (HIR)] [ hir ] . This is a compiler-friendly representation of the
71-   AST.  This involves a lot of desugaring of things like loops and ` async fn ` .
72- -  We use the HIR to do [ type inference]  (the process of automatic
73-   detection of the type of an expression), [ trait solving]  (the process
74-   of pairing up an impl with each reference to a trait), and [ type
75-   checking]  (the process of converting the types found in the HIR
76-   (` hir::Ty ` ), which represent the syntactic things that the user wrote,
77-   into the internal representation used by the compiler (` Ty<'tcx> ` ),
78-   and using that information to verify the type safety, correctness and
79-   coherence of the types used in the program).
80- -  The HIR is then [ lowered to Mid-Level Intermediate Representation (MIR)] [ mir ] .
81-   -  Along the way, we construct the THIR, which is an even more desugared HIR.
82-     THIR is used for pattern and exhaustiveness checking. It is also more
83-     convenient to convert into MIR than HIR is.
84- -  The MIR is used for [ borrow checking] .
85- -  We (want to) do [ many optimizations on the MIR] [ mir-opt ]  because it is still
86-   generic and that improves the code we generate later, improving compilation
87-   speed too.
88-   -  MIR is a higher level (and generic) representation, so it is easier to do
89-     some optimizations at MIR level than at LLVM-IR level. For example LLVM
90-     doesn't seem to be able to optimize the pattern the [ ` simplify_try ` ]  mir
91-     opt looks for.
92- -  Rust code is _ monomorphized_ , which means making copies of all the generic
93-   code with the type parameters replaced by concrete types. To do
94-   this, we need to collect a list of what concrete types to generate code for.
95-   This is called _ monomorphization collection_ .
96- -  We then begin what is vaguely called _ code generation_  or _ codegen_ .
97-   -  The [ code generation stage (codegen)] [ codegen ]  is when higher level
98-     representations of source are turned into an executable binary. ` rustc ` 
99-     uses LLVM for code generation. The first step is to convert the MIR
100-     to LLVM Intermediate Representation (LLVM IR). This is where the MIR
101-     is actually monomorphized, according to the list we created in the
102-     previous step.
103-   -  The LLVM IR is passed to LLVM, which does a lot more optimizations on it.
104-     It then emits machine code. It is basically assembly code with additional
105-     low-level types and annotations added. (e.g. an ELF object or wasm).
106-   -  The different libraries/binaries are linked together to produce the final
107-     binary.
20+ ### Invocation  
21+ 
22+ Compilation begins when a user writes a Rust source program in text
23+ and invokes the ` rustc `  compiler on it. The work that the compiler needs to
24+ perform is defined by command-line options. For example, it is possible to
25+ enable nightly features (` -Z `  flags), perform ` check ` -only builds, or emit
26+ LLVM-IR rather than executable machine code. The ` rustc `  executable call may
27+ be indirect through the use of ` cargo ` .
28+ 
29+ Command line argument parsing occurs in the [ ` rustc_driver ` ] . This crate
30+ defines the compile configuration that is requested by the user and passes it
31+ to the rest of the compilation process as a [ ` rustc_interface::Config ` ] .
32+ 
33+ ### Lexing and parsing  
34+ 
35+ The raw Rust source text is analyzed by a low-level * lexer*  located in
36+ [ ` rustc_lexer ` ] . At this stage, the source text is turned into a stream of
37+ atomic source code units known as _ tokens_ .  The lexer supports the
38+ Unicode character encoding.
39+ 
40+ The token stream passes through a higher-level lexer located in
41+ [ ` rustc_parse ` ]  to prepare for the next stage of the compile process. The
42+ [ ` StringReader ` ]  struct is used at this stage to perform a set of validations
43+ and turn strings into interned symbols (_ interning_  is discussed later).
44+ [ String interning]  is a way of storing only one immutable
45+ copy of each distinct string value.
46+ 
47+ The lexer has a small interface and doesn't depend directly on the
48+ diagnostic infrastructure in ` rustc ` . Instead it provides diagnostics as plain
49+ data which are emitted in ` rustc_parse::lexer `  as real diagnostics.
50+ The lexer preserves full fidelity information for both IDEs and proc macros.
51+ 
52+ The * parser*  [ translates the token stream from the lexer into an Abstract Syntax
53+ Tree (AST)] [ parser ] . It uses a recursive descent (top-down) approach to syntax
54+ analysis. The crate entry points for the parser are the
55+ [ ` Parser::parse_crate_mod() ` ] [ parse_crate_mod ]  and [ ` Parser::parse_mod() ` ] [ parse_mod ] 
56+ methods found in [ ` rustc_parse::parser::Parser ` ] . The external module parsing
57+ entry point is [ ` rustc_expand::module::parse_external_mod ` ] [ parse_external_mod ] .
58+ And the macro parser entry point is [ ` Parser::parse_nonterminal() ` ] [ parse_nonterminal ] .
59+ 
60+ Parsing is performed with a set of ` Parser `  utility methods including ` bump ` ,
61+ ` check ` , ` eat ` , ` expect ` , ` look_ahead ` .
62+ 
63+ Parsing is organized by semantic construct. Separate
64+ ` parse_* `  methods can be found in the [ ` rustc_parse ` ] [ rustc_parse_parser_dir ] 
65+ directory. The source file name follows the construct name. For example, the
66+ following files are found in the parser:
67+ 
68+ -  ` expr.rs ` 
69+ -  ` pat.rs ` 
70+ -  ` ty.rs ` 
71+ -  ` stmt.rs ` 
72+ 
73+ This naming scheme is used across many compiler stages. You will find
74+ either a file or directory with the same name across the parsing, lowering,
75+ type checking, THIR lowering, and MIR building sources.
76+ 
77+ Macro expansion, AST validation, name resolution, and early linting also take place
78+ during this stage.
79+ 
80+ The parser uses the standard ` DiagnosticBuilder `  API for error handling, but we
81+ try to recover, parsing a superset of Rust's grammar, while also emitting an error.
82+ ` rustc_ast::ast::{Crate, Mod, Expr, Pat, ...} `  AST nodes are returned from the parser.
83+ 
84+ ### HIR lowering  
85+ 
86+ We next take the AST and convert it to [ High-Level Intermediate
87+ Representation (HIR)] [ hir ] , a more compiler-friendly representation of the
88+ AST. This process called "lowering". It involves a lot of desugaring of things
89+ like loops and ` async fn ` .
90+ 
91+ We then use the HIR to do [ * type inference* ]  (the process of automatic
92+ detection of the type of an expression), [ * trait solving* ]  (the process
93+ of pairing up an impl with each reference to a trait), and [ * type
94+ checking* ] . Type checking is the process of converting the types found in the HIR
95+ ([ ` hir::Ty ` ] ), which represent what the user wrote,
96+ into the internal representation used by the compiler ([ ` Ty<'tcx> ` ] ).
97+ That information is usedto verify the type safety, correctness and
98+ coherence of the types used in the program.
99+ 
100+ ### MIR lowering  
101+ 
102+ The HIR is then [ lowered to Mid-level Intermediate Representation (MIR)] [ mir ] ,
103+ which is used for [ borrow checking] .
104+ 
105+ Along the way, we also construct the THIR, which is an even more desugared HIR.
106+ THIR is used for pattern and exhaustiveness checking. It is also more
107+ convenient to convert into MIR than HIR is.
108+ 
109+ We do [ many optimizations on the MIR] [ mir-opt ]  because it is still
110+ generic and that improves the code we generate later, improving compilation
111+ speed too.
112+ MIR is a higher level (and generic) representation, so it is easier to do
113+ some optimizations at MIR level than at LLVM-IR level. For example LLVM
114+ doesn't seem to be able to optimize the pattern the [ ` simplify_try ` ]  mir
115+ opt looks for.
116+ 
117+ Rust code is _ monomorphized_ , which means making copies of all the generic
118+ code with the type parameters replaced by concrete types. To do
119+ this, we need to collect a list of what concrete types to generate code for.
120+ This is called _ monomorphization collection_  and it happens at the MIR level.
121+ 
122+ ### Code generation  
123+ 
124+ We then begin what is vaguely called _ code generation_  or _ codegen_ .
125+ The [ code generation stage] [ codegen ]  is when higher level
126+ representations of source are turned into an executable binary. ` rustc ` 
127+ uses LLVM for code generation. The first step is to convert the MIR
128+ to LLVM Intermediate Representation (LLVM IR). This is where the MIR
129+ is actually monomorphized, according to the list we created in the
130+ previous step.
131+ The LLVM IR is passed to LLVM, which does a lot more optimizations on it.
132+ It then emits machine code. It is basically assembly code with additional
133+ low-level types and annotations added (e.g. an ELF object or WASM).
134+ The different libraries/binaries are then linked together to produce the final
135+ binary.
108136
109137[ String interning ] : https://en.wikipedia.org/wiki/String_interning 
110138[ `rustc_lexer` ] : https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html 
@@ -115,9 +143,9 @@ we'll talk about that later.
115143[ `rustc_parse` ] : https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html 
116144[ parser ] : https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html 
117145[ hir ] : https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/index.html 
118- [ type inference ] : https://rustc-dev-guide.rust-lang.org/type-inference.html 
119- [ trait solving ] : https://rustc-dev-guide.rust-lang.org/traits/resolution.html 
120- [ type checking ] : https://rustc-dev-guide.rust-lang.org/type-checking.html 
146+ [ * type inference* ] : https://rustc-dev-guide.rust-lang.org/type-inference.html 
147+ [ * trait solving* ] : https://rustc-dev-guide.rust-lang.org/traits/resolution.html 
148+ [ * type checking* ] : https://rustc-dev-guide.rust-lang.org/type-checking.html 
121149[ mir ] : https://rustc-dev-guide.rust-lang.org/mir/index.html 
122150[ borrow checking ] : https://rustc-dev-guide.rust-lang.org/borrow_check.html 
123151[ mir-opt ] : https://rustc-dev-guide.rust-lang.org/mir/optimizations.html 
@@ -129,6 +157,8 @@ we'll talk about that later.
129157[ `rustc_parse::parser::Parser` ] : https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html 
130158[ parse_external_mod ] : https://doc.rust-lang.org/nightly/nightly-rustc/rustc_expand/module/fn.parse_external_mod.html 
131159[ rustc_parse_parser_dir ] : https://github.com/rust-lang/rust/tree/master/compiler/rustc_parse/src/parser 
160+ [ `hir::Ty` ] : https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/hir/struct.Ty.html 
161+ [ `Ty<'tcx>` ] : https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Ty.html 
132162
133163## How it does it  
134164
@@ -323,6 +353,7 @@ For more details on bootstrapping, see
323353[ _bootstrapping_ ] : https://en.wikipedia.org/wiki/Bootstrapping_(compilers) 
324354[ rustc-bootstrap ] : building/bootstrapping.md 
325355
356+ <!-- 
326357# Unresolved Questions 
327358
328359- Does LLVM ever do optimizations in debug builds? 
@@ -332,7 +363,8 @@ For more details on bootstrapping, see
332363- What is the main source entry point for `X`? 
333364- Where do phases diverge for cross-compilation to machine code across 
334365  different platforms? 
335- 
366+ --> 
367+   
336368# References  
337369
338370-  Command line parsing
0 commit comments