3. Kaleidoscope: Generating LLVM IR
This chapter focuses on the basics of transforming the ANTLR parse tree into LLVM IR. The general goal is to parse Kaleidoscope source code to generate a BitcodeModule representing the source as LLVM IR.
Basic code flow
The Main function starts out by calling WaitForDebugger(). This is a useful utility that doesn't do anything in a release build, but in debug builds will check for an attached debugger and, if none is found, it will wait for one. This works around a missing feature of the .NET Standard C# project system that does not support launching mixed native+managed debugging. When you need to go all the way into debugging the LLVM code, you can launch the debug version of the app without debugging, then attach to it and select native and managed debugging. (Hopefully this feature will be restored to these projects in the future so this rather hacky trick isn't needed...)
UPDATE: As of VS2019 this hack is no longer needed as it is now possible to set an SDK project to allow native debugging directly from the project's debugging settings page. (Yeah! 😤)
Initializing Llvm.NET
The underlying LLVM library requires initialization for it's internal data, furthermore Llvm.NET must load the actual underlying library specific to the current system architecture. Thus, the Llvm.NET as a whole requires initialization.
using static Llvm.NET.Interop.Library;
// [...]
using( InitializeLLVM() )
{
// [...]
}
The initialization returns an IDisposable so that the calling application can shutdown/cleanup resources
and potentially re-initialize for a different target, if desired. This application only needs to generate
one module and exit so it just applies a standard C# using
scope to ensure proper cleanup.
Initializing Targets
LLVM supports a number of target architectures, however for the Kaleidoscope tutorials the only supported target is the one the host application is running on. So, only the native target is registered.
RegisterNative();
Generator and REPL loop
This chapter supports the simple expressions of the language that are parsed and generated to an LLVM Value. This forms the foundation of the Kaleidoscope samples outer generation loop. Subsequent, chapters will focus on additional functionality including JIT compilation, Debugging information, and Native Module generation.
var parser = new Parser( LanguageFeatureLevel );
using var generator = new CodeGenerator( parser.GlobalState );
var readyState = new ReadyStateManager( );
// Create Observable chain to provide the REPL implementation
var replSeq = parser.Parse( Console.In.ToObservableStatements( ShowPrompt ), ShowCodeGenError )
.GenerateResults( generator, ShowCodeGenError );
// Run the sequence
using( replSeq.Subscribe( ShowResults ) )
{
}
Console.WriteLine( generator.Module.WriteToString( ) );
This adds the GenerateResults
operator to the sequence from Chapter 2 to generate the LLVM IR.
GenerateResults Rx.NET operator
The GenerateResults operator is responsible for transforming the input sequence of AST nodes into a sequence of Llvm.NET.Values. The implementation of the operator is common to all result types and is provided in the Kaleidoscope.Runtime assembly.
public static IObservable<T> GenerateResults<T>( this IObservable<IAstNode> nodes
, IKaleidoscopeCodeGenerator<T> generator
, Action<CodeGeneratorException> codeGeneratorExceptionHandler
)
{
return from n in nodes
let v = generator.Generate( n, codeGeneratorExceptionHandler )
where !EqualityComparer<T>.Default.Equals(v, default) // v != default( T )
select v;
}
Handling errors in code generation
In many cases successfully parsing the input code isn't sufficient to determine correctness of the code in a given context. In particular attempting to re-define a function already defined in the current module is a problem. (Later, chapters deal with re-definition by using a new module for each function, but that is more a side-effect of working with the JIT) To handle errors in the generation, the REPL loop will catch any CodeGenerationException and call the error handler callback provided by the application. The application handles the error by indicating the error to the user. This allows the application to continue processing input while still informing the user that what they tried to do didn't work.
private static void ShowCodeGenError( CodeGeneratorException ex )
{
var color = Console.ForegroundColor;
Console.ForegroundColor = ConsoleColor.Red;
Console.Error.WriteLine( ex.Message );
Console.ForegroundColor = color;
}
Results processing
Processing the results for this chapter, is pretty simple, it just prints out a textual form of the generated LLVM IR.
private static void ShowResults( Value resultValue )
{
switch( resultValue )
{
case IrFunction function:
Console.WriteLine( "Defined function: {0}", function.Name );
Console.WriteLine( function );
break;
default:
throw new InvalidOperationException( );
}
}
Code generation
Initialization
The code generation maintains state for the transformation as private members.
private readonly DynamicRuntimeState RuntimeState;
private readonly Context Context;
private readonly InstructionBuilder InstructionBuilder;
private readonly IDictionary<string, Value> NamedValues = new Dictionary<string, Value>( );
These are initialized in the constructor
public CodeGenerator( DynamicRuntimeState globalState )
: base(null)
{
globalState.ValidateNotNull( nameof( globalState ) );
if( globalState.LanguageLevel > LanguageLevel.SimpleExpressions )
{
throw new ArgumentException( "Language features not supported by this generator", nameof(globalState) );
}
RuntimeState = globalState;
Context = new Context( );
Module = Context.CreateBitcodeModule( "Kaleidoscope" );
InstructionBuilder = new InstructionBuilder( Context );
}
The exact set of members varies for each chapter but the basic ideas remain across each chapter.
Name | Description |
---|---|
RuntimeState | Contains information about the language and dynamic runtime state needed for resolving operator precedence |
Context | Current Context for LLVM generation |
Module | Current BitcodeModule to generate LLVM IR in |
InstructionBuilder | Current InstructionBuilder used to generate LLVM IR instructions |
NamedValues | Contains a mapping of named variables to the generated Value |
Generate Method
The Generate method is used by the REPL loop to generate the final output from a parse tree. The common implementation simply passes the tree to the AST generating parse tree visitor to generate the AST and process the AST nodes from that. Due to the simplicity of the Kaleidoscope language the AST is more of a List than a tree. In fact, the AstBuilder creates an enumerable sequence of nodes that are either a function declaration or a function definition. For the interactive mode only a single element is parsed at a time. However, when doing Ahead of Time (AOT) compilation in Chapter 8 this sequence can contain many declarations and definitions in any order. To handle the different node types the generate method simply uses pattern matching to detect the type of node to dispatch to a visitor function for that kind of node.
public Value Generate( IAstNode ast, Action<CodeGeneratorException> codeGenerationErroHandler )
{
try
{
// Prototypes, including extern are ignored as AST generation
// adds them to the RuntimeState so that already has the declarations
return ( ast is FunctionDefinition ) ? ast.Accept( this ) : null;
}
catch( CodeGeneratorException ex ) when( codeGenerationErroHandler != null )
{
codeGenerationErroHandler( ex );
return null;
}
}
Function Declarations
Function declarations don't actually generate any code. Instead they are captured and added to a collection of declarations used in validating subsequent function calls when generating the AST for function definitions.
// -----------------------------------------------------------------------
// <copyright file="ProtoType.cs" company="Ubiquity.NET Contributors">
// Copyright (c) Ubiquity.NET Contributors. All rights reserved.
// </copyright>
// -----------------------------------------------------------------------
using System.Collections.Generic;
using System.Collections.Immutable;
using System.Linq;
using System.Text;
using Ubiquity.ArgValidators;
using static Kaleidoscope.Grammar.KaleidoscopeParser;
namespace Kaleidoscope.Grammar.AST
{
/// <summary>Encapsulates data describing a function signature</summary>
/// <remarks>
/// This is used to enable consistent representation when the prototype
/// is synthesized during code generation (i.e. Anonymous expressions)
/// </remarks>
public class Prototype
: IAstNode
{
/// <summary>Initializes a new instance of the <see cref="Prototype"/> class.</summary>
/// <param name="location">Source location covering the complete signature</param>
/// <param name="name"><see cref="Name"/> containing the name of the function</param>
/// <param name="parameters">Collection of <see cref="Name"/>s for the names of each parameter</param>
public Prototype( SourceSpan location, string name, IEnumerable<ParameterDeclaration> parameters )
{
Location = location;
Name = name;
Parameters = parameters.ToImmutableArray();
}
/// <summary>Initializes a new instance of the <see cref="Prototype"/> class.</summary>
/// <param name="location">Source location covering the complete signature</param>
/// <param name="name">name of the function</param>
/// <param name="isCompilerGenerated">Indicates if this is a compiler generated prototype</param>
/// <param name="parameters">names of each parameter</param>
public Prototype( SourceSpan location, string name, bool isCompilerGenerated, params ParameterDeclaration[ ] parameters )
: this( location, name, ( IEnumerable<ParameterDeclaration> )parameters )
{
IsCompilerGenerated = isCompilerGenerated;
}
/// <summary>Initializes a new instance of the <see cref="Prototype"/> class.</summary>
/// <param name="location">Source location covering the complete signature</param>
/// <param name="name">name of the function</param>
/// <param name="parameters">names of each parameter</param>
public Prototype( SourceSpan location, string name, params ParameterDeclaration[ ] parameters )
: this( location, name, (IEnumerable<ParameterDeclaration>) parameters )
{
}
/// <summary>Initializes a new instance of the <see cref="Prototype"/> class.</summary>
/// <param name="ctx"><see cref="PrototypeContext"/> to extract parameter and source location information from</param>
public Prototype( PrototypeContext ctx )
: this( ctx.GetSourceSpan(), ctx.Name, ctx.Parameters.Select( p => new ParameterDeclaration(p.Span, p.Name, p.Index ) ) )
{
IsExtern = ctx.Parent is ExternalDeclarationContext;
}
/// <summary>Initializes a new instance of the <see cref="Prototype"/> class.</summary>
/// <param name="name">Name of the function</param>
/// <param name="parameters">Names of each parameter</param>
/// <remarks>
/// This version of the constructor is used to create synthetic prototypes that don't
/// exist within the original source.
/// </remarks>
public Prototype( string name, params string[] parameters)
: this(default, name, parameters.Select( (n,i)=>new ParameterDeclaration(default, name, i)))
{
}
/// <inheritdoc/>
public SourceSpan Location { get; }
/// <summary>Gets the name of the function</summary>
public string Name { get; }
/// <summary>Gets a value indicating whether the function prototype is an extern declaration</summary>
public bool IsExtern { get; }
/// <summary>Gets a value indicating whether the function prototype is generated internally by compiler</summary>
public bool IsCompilerGenerated { get; }
/// <summary>Gets the parameters for the function</summary>
public IReadOnlyList<ParameterDeclaration> Parameters { get; }
/// <inheritdoc/>
public TResult Accept<TResult>( IAstVisitor<TResult> visitor ) => visitor.ValidateNotNull( nameof( visitor ) ).Visit( this );
/// <inheritdoc/>
public IEnumerable<IAstNode> Children => Parameters;
/// <inheritdoc/>
public override string ToString( )
{
var bldr = new StringBuilder( );
if(IsExtern)
{
bldr.Append( "[extern]" );
}
if(IsCompilerGenerated)
{
bldr.Append( "[CompilerGenerated]" );
}
bldr.Append( Name );
bldr.Append( '(' );
if(Parameters.Count > 0)
{
bldr.Append( string.Join( ", ", Parameters.Select( p => p.ToString( ) ) ) );
}
bldr.Append( ')' );
return bldr.ToString( );
}
}
}
Function Definition
Functions with bodies (e.g. not just a declaration to a function defined elsewhere) are handled via the VisitFunctionDefinition() Method.
public override Value Visit( FunctionDefinition definition )
{
definition.ValidateNotNull( nameof( definition ) );
var function = GetOrDeclareFunction( definition.Signature );
if( !function.IsDeclaration )
{
throw new CodeGeneratorException( $"Function {function.Name} cannot be redefined in the same module" );
}
try
{
var entryBlock = function.AppendBasicBlock( "entry" );
InstructionBuilder.PositionAtEnd( entryBlock );
NamedValues.Clear( );
foreach( var param in definition.Signature.Parameters )
{
NamedValues[ param.Name ] = function.Parameters[ param.Index ];
}
var funcReturn = definition.Body.Accept( this );
InstructionBuilder.Return( funcReturn );
function.Verify( );
return function;
}
catch( CodeGeneratorException )
{
function.EraseFromParent( );
throw;
}
}
VisitFunctionDefinition() simply extracts the function prototype from the AST node. A private utility method GetOrDeclareFunction() is used to get an existing function or declare a new one.
// Retrieves a Function for a prototype from the current module if it exists,
// otherwise declares the function and returns the newly declared function.
private IrFunction GetOrDeclareFunction( Prototype prototype )
{
var function = Module.GetFunction( prototype.Name );
if( function != null )
{
return function;
}
var llvmSignature = Context.GetFunctionType( Context.DoubleType, prototype.Parameters.Select( _ => Context.DoubleType ) );
var retVal = Module.AddFunction( prototype.Name, llvmSignature );
int index = 0;
foreach( var argId in prototype.Parameters )
{
retVal.Parameters[ index ].Name = argId.Name;
++index;
}
return retVal;
}
GetOrDeclareFunction() will first attempt to get an existing function and if found returns that function. Otherwise it creates a function signature type then adds a function to the module with the given name and signature and adds the parameter names to the function. In LLVM the signature only contains type information and no names, allowing for sharing the same signature for completely different functions.
The function and the expression representing the body of the function is then used to emit IR for the function.
The generation verifies that the function is a declaration (e.g. does not have a body) as Kaleidoscope doesn't support any sort of overloaded functions.
The generation of a function starts by constructing a basic block for the entry point of the function and attaches the InstructionBuilder to the end of that block. (It's empty so it is technically at the beginning but placing it at the end it will track the end position as new instructions are added so that each instruction added will go on the end of the block). At this point there will only be the one block as the language doesn't yet have support for control flow. (That is introduced in Chapter 5)
The NamedValues map is cleared and each of the parameters is mapped in the NamedValues map to its argument value in IR. The body of the function is visited to produce an LLVM Value. The visiting will, in turn add instructions, and possibly new blocks, as needed to represent the body expression in proper execution order.
If generating the body results in an error, then the function is removed from the parent and the exception propagates up. This allows the user to define the function again, if appropriate.
Finally, a return instruction is applied to return the result of the expression followed by a verification of the function to ensure internal consistency. (Generally the verify is not used in production releases as it is an expensive operation to perform on every function. But when building up a language generator it is quite useful to detect errors early.)
Top Level Expression
Top level expressions in Kaleidoscope are transformed into an anonymous function definition by the AstBuilder. Since this chapter is focused on generating the IR module there isn't any special handling needed for a top level expression - they are simply just another function definition. (JIT execution of the top level expression comes in the next chapter)
Constant expression
In Kaleidoscope all values are floating point and constants are represented in LLVM IR as ConstantFP
The AST provides the value of the constant as a C# double
.
// -----------------------------------------------------------------------
// <copyright file="ConstantExpression.cs" company="Ubiquity.NET Contributors">
// Copyright (c) Ubiquity.NET Contributors. All rights reserved.
// </copyright>
// -----------------------------------------------------------------------
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using Ubiquity.ArgValidators;
namespace Kaleidoscope.Grammar.AST
{
public class ConstantExpression
: IExpression
{
public ConstantExpression( SourceSpan location, double value )
{
Value = value;
Location = location;
}
public double Value { get; }
public SourceSpan Location { get; }
public TResult Accept<TResult>( IAstVisitor<TResult> visitor ) => visitor.ValidateNotNull(nameof(visitor)).Visit( this );
public IEnumerable<IAstNode> Children => Enumerable.Empty<IAstNode>( );
public override string ToString( )
{
return Value.ToString( CultureInfo.CurrentCulture );
}
}
}
Generation of the LLVM IR for a constant is quite simple.
public override Value Visit( ConstantExpression constant )
{
constant.ValidateNotNull( nameof( constant ) );
return Context.CreateConstant( constant.Value );
}
Note
The constant value is uniqued in LLVM so that multiple calls given the same input value will produce the same LLVM Value. LLvm.NET honors this and is implemented in a way to ensure that reference equality reflects the identity of the uniqued values correctly.
Variable reference expression
References to variables in Kaleidoscope, like most other languages, use a name. In this chapter the support of variables is rather simple. The Variable expression generator assumes the variable is declared somewhere else already and simply looks up the value from the private map. At this stage of the development of Kaleidoscope the only place where the named values are generated are function arguments, later chapters will introduce loop induction variables and variable assignment. The implementation uses a standard TryGet pattern to retrieve the value or throw an exception if the variable doesn't exist.
public override Value Visit( VariableReferenceExpression reference )
{
reference.ValidateNotNull( nameof( reference ) );
if( !NamedValues.TryGetValue( reference.Name, out Value value ) )
{
// Source input is validated by the parser and AstBuilder, therefore
// this is the result of an internal error in the generator rather
// then some sort of user error.
throw new CodeGeneratorException( $"ICE: Unknown variable name: {reference.Name}" );
}
return value;
}
Binary Operator Expression
Things start to get a good bit more interesting with binary operators. The AST node for an expression is a simple empty "tagging" interface. Since the interface also requires the IAstNode interface it contains support for walking the chain of operators that form an expression in left to right order, accounting for precedence.
// -----------------------------------------------------------------------
// <copyright file="IExpression.cs" company="Ubiquity.NET Contributors">
// Copyright (c) Ubiquity.NET Contributors. All rights reserved.
// </copyright>
// -----------------------------------------------------------------------
namespace Kaleidoscope.Grammar.AST
{
/// <summary>This is a grouping interface for all AST nodes that are valid expressions</summary>
/// <remarks>
/// This is a grouping interface to allow other parts of the system to distinguish between
/// an arbitrary node and one that is ultimately an expression. This helps to ensure correctness
/// (e.g. a function declaration is not valid as an argument to an operator, only an expression is.
/// </remarks>
public interface IExpression
: IAstNode
{
}
}
Generation of an expression consists a simple visitor method to emit the code for the operands and the actual operator.
public override Value Visit( BinaryOperatorExpression binaryOperator )
{
binaryOperator.ValidateNotNull( nameof( binaryOperator ) );
switch( binaryOperator.Op )
{
case BuiltInOperatorKind.Less:
{
var tmp = InstructionBuilder.Compare( RealPredicate.UnorderedOrLessThan
, binaryOperator.Left.Accept( this )
, binaryOperator.Right.Accept( this )
).RegisterName( "cmptmp" );
return InstructionBuilder.UIToFPCast( tmp, InstructionBuilder.Context.DoubleType )
.RegisterName( "booltmp" );
}
case BuiltInOperatorKind.Pow:
{
var pow = GetOrDeclareFunction( new Prototype( "llvm.pow.f64", "value", "power" ) );
return InstructionBuilder.Call( pow
, binaryOperator.Left.Accept( this )
, binaryOperator.Right.Accept( this )
).RegisterName( "powtmp" );
}
case BuiltInOperatorKind.Add:
return InstructionBuilder.FAdd( binaryOperator.Left.Accept( this )
, binaryOperator.Right.Accept( this )
).RegisterName( "addtmp" );
case BuiltInOperatorKind.Subtract:
return InstructionBuilder.FSub( binaryOperator.Left.Accept( this )
, binaryOperator.Right.Accept( this )
).RegisterName( "subtmp" );
case BuiltInOperatorKind.Multiply:
return InstructionBuilder.FMul( binaryOperator.Left.Accept( this )
, binaryOperator.Right.Accept( this )
).RegisterName( "multmp" );
case BuiltInOperatorKind.Divide:
return InstructionBuilder.FDiv( binaryOperator.Left.Accept( this )
, binaryOperator.Right.Accept( this )
).RegisterName( "divtmp" );
default:
throw new CodeGeneratorException( $"ICE: Invalid binary operator {binaryOperator.Op}" );
}
}
The process of transforming the operator starts by generating an LLVM IR Value from the right-hand side expression. A simple switch statement based on the token type of the operator is used to generate the actual LLVM IR instruction(s) for the operator.
LLVM has strict rules on the operators and their values for the IR, in particular the types of the operands must be identical and, usually must also match the type of the result. For the Kaleidoscope language that's easy to manage as it only supports one data type. Other languages might need to insert additional conversion logic as part of emitting the operators. (Kaleidoscope does this for boolean values when supporting conditional control flow in Chapter 5)
The Generation of the IR instructions uses the current InstructionBuilder and the RegisterName extension method to provide a name for the result in LLVM IR. The name helps with readability of the IR when generated in the textual form of LLVM IR assembly. A nice feature of LLVM is that it will automatically handle duplicate names by appending a value to the name automatically so that generators don't need to keep track of the names to ensure uniqueness.
The Less
operator uses a floating point Unordered less than IR instruction followed by an integer to
float cast to translate the LLVM IR i1 result into a floating point value needed by Kaleidoscope.
The ^
operator for exponentiation uses the llvm.pow.f64
intrinsic to perform the exponentiation as
efficiently as the back-end generator can.
Examples
Llvm.NET Kaleidoscope Interpreter - SimpleExpressions
Ready># simple top level expression
>4+5;
Defined function: __anon_expr$0
define double @"__anon_expr$0"() {
entry:
ret double 9.000000e+00
}
Ready>
Ready># function definitions
>def foo(a b) a*a + 2*a*b + b*b;
Defined function: foo
define double @foo(double %a, double %b) {
entry:
%multmp = fmul double %a, %a
%multmp1 = fmul double 2.000000e+00, %a
%multmp2 = fmul double %multmp1, %b
%addtmp = fadd double %multmp, %multmp2
%multmp3 = fmul double %b, %b
%addtmp4 = fadd double %addtmp, %multmp3
ret double %addtmp4
}
Ready>
Ready>def bar(a) foo(a, 4.0) + bar(31337);
Defined function: bar
define double @bar(double %a) {
entry:
%calltmp = call double @foo(double %a, double 4.000000e+00)
%calltmp1 = call double @bar(double 3.133700e+04)
%addtmp = fadd double %calltmp, %calltmp1
ret double %addtmp
}
Ready>
Ready># external declaration
>extern cos(x);
Defined function: cos
declare double @cos(double)
Ready>
Ready># calling external function
>cos(1.234);
Defined function: __anon_expr$1
define double @"__anon_expr$1"() {
entry:
%calltmp = call double @cos(double 1.234000e+00)
ret double %calltmp
}
Ready>