9. Kaleidoscope: Adding Debug Information
So far in the progress of the Kaleidoscope tutorials we've covered the basics of the language as a JIT engine and even added ahead of time compilation into the mix so it is a full static compiled language. But what happens if something goes wrong in one of the programs written in Kaleidoscope? How can a developer debug applications written in this wonderful new language? Up until now, the answer is, you can't. This chapter will add debugging information to the generated object file so that it is available for debuggers.
Source level debugging uses formatted data bound into the output binaries that helps the debugger map the state of the application to the original source code that created it. The exact format of the data depends on the target platform but the general idea holds for all of them. In order to isolate front-end developers from the actual format - LLVM uses an abstract form of debug data that is based on the common DWARF debugging format. Internally, the LLVM target will transform the abstract representation into the actual target binary form.
Note
Debugging JIT code is rather complex as it requires awareness of the runtime within the debugger to manage the execution and runtime state etc... Such functionality is beyond the scope of this tutorial.
Why is it a hard problem?
Debugging is a tough problem for a number of reasons, mostly revolving around optimized code. Optimizations make keeping source level information more difficult. In LLVM the original source location information is attached to each LLVM IR instruction. Optimization passes should keep the source location for any new instructions created, but merged instructions only get to keep a single source location. This is generally the cause of the observed "jumping around" when debugging optimized code. Additionally, optimizations can move variables in ways that are either optimized out, shared in memory, in registers or otherwise difficult to track. Thus, for the purposes of this tutorial we'll skip optimizations.
Setup for emitting debug information
Debug information in Ubiquity.NET.Llvm is created with the DIBuilder.
This is similar to the InstructionBuilder. Using the
DIBuilder requires a bit more knowledge on the general concepts of the DWARF debugging format, and
in particular the DebuggingMetadata in LLVM. In Ubiquity.NET.Llvm you need
to, create an instance of the DIBuilder class bound to a particular module. Such a builder is disposable and
therefore requires a call to Dispose(). Normally this is handled in a using
expression.
Another important item for debug information is called the Compilation Unit. In Ubiquity.NET.Llvm that is the DICompileUnit. The compile unit is the top level scope for storing debug information generally it represents the full source file that was used to create the module. (Though with IR linking it is plausible that a module has multiple Compile Units associated). Unlike a builder it isn't something that is constructed without more information. Therefore, Ubiquity.NET.Llvm provides overloads for the creation of a module that includes the additional data needed to create the DICompileUnit for you. It is important to note that a DIBuilder may have ONLY one DICompileUnit and that unit is used for all of the debug nodes it builds. It must be set when finalizing the debug information in order to properly resolve items to the compilation unit.
TODO: Discuss DIBuilder as a ref struct and that it must be passed through as part of the "visitor"
Another point to note is that the module ID is derived from the source file path and the source file path is provided so that it becomes the root compile unit.
Important
It is important to note that when using the DIBuilder it must be "finalized" in order to resolve internal forward references in the debug metadata. The exact details of this aren't generally relevant, just remember that somewhere after generating all code and debug information to call the Finish method. (In Ubiquity.NET.Llvm this method is called Finish() to avoid conflicts with the .NET runtime defined Finalize() and to avoid confusion on the term as the idea of "finalization" has a very different meaning in .NET then what applies to the DIBuilder).
The tutorial takes care of finishing the debug information in the generator's Generate method after completing code generation for the module.
public Module? Generate( IAstNode ast )
{
ArgumentNullException.ThrowIfNull( ast );
using var diBuilder = new DIBuilder(Module);
CurrentDIBuilder = diBuilder.AsAlias(); // This gets the underlying unowned resource...
var cu = diBuilder.CreateCompileUnit(SourceLanguage.C, SourcePath, "Kaleidoscope Compiler");
Debug.Assert( cu != null, "Expected non null compile unit" );
Debug.Assert( cu.File != null, "Expected non-null file for compile unit" );
DoubleType = new DebugBasicType( Context.DoubleType, diBuilder, "double", DiTypeKind.Float );
// use this instance and the DIBuilder to visit the AST
ast.Accept( this );
if(AnonymousFunctions.Count > 0)
{
var mainFunction = Module.CreateFunction( "main", Context.GetFunctionType( Context.VoidType ) );
var block = mainFunction.AppendBasicBlock( "entry" );
using var irBuilder = new InstructionBuilder( block );
var printdFunc = Module.CreateFunction( "printd", Context.GetFunctionType( Context.DoubleType, Context.DoubleType ) );
foreach(var anonFunc in AnonymousFunctions)
{
var value = irBuilder.Call( anonFunc );
irBuilder.Call( printdFunc, value );
}
irBuilder.Return();
}
return Module;
}
Functions
With the basics of the DIBuilder and DICompile unit setup for the module it is time to focus on providing debug information for functions. This requires adding a few extra calls to build the context of the debug information for the function. The DWARF debug format that LLVM's debug metadata is based on calls these "SubPrograms". The new code builds a representation of the file the code is contained in as a new DIFile. In this case that is a bit redundant as all the code comes from a single file and the compile unit already has the file info. However, that's not always true for all languages (i.e. some sort of Include mechanism) so the file is created. It's not a problem as LLVM will intern the file definition so that it won't actually end up with duplicates.
// Retrieves a Function for a prototype from the current module if it exists,
// otherwise declares the function and returns the newly declared function.
private Function GetOrDeclareFunction( Prototype prototype )
{
Debug.Assert( CurrentDIBuilder is not null, "Internal error CurrentDIBuilder should be set in Generate already" );
if(Module is null)
{
throw new InvalidOperationException( "ICE: Can't get or declare a function without an active module" );
}
if(Module.TryGetFunction( prototype.Name, out Function? function ))
{
return function;
}
// extern declarations don't get debug information
Function retVal;
if(prototype.IsExtern)
{
var llvmSignature = Context.GetFunctionType( Context.DoubleType, prototype.Parameters.Select( _ => Context.DoubleType ) );
retVal = Module.CreateFunction( prototype.Name, llvmSignature );
}
else
{
var parameters = prototype.Parameters;
// DICompileUnit and File are checked for null in constructor
var debugFile = CurrentDIBuilder.CreateFile( CurrentDIBuilder.CompileUnit!.File!.FileName, CurrentDIBuilder.CompileUnit!.File.Directory );
var signature = Context.CreateFunctionType(CurrentDIBuilder, DoubleType!, prototype.Parameters.Select( _ => DoubleType! ) );
var lastParamLocation = parameters.Count > 0 ? parameters[ parameters.Count - 1 ].Location : prototype.Location;
retVal = Module.CreateFunction( CurrentDIBuilder
, scope: CurrentDIBuilder.CompileUnit
, name: prototype.Name
, linkageName: null
, file: debugFile
, line: (uint)prototype.Location.StartLine
, signature
, isLocalToUnit: false
, isDefinition: true
, scopeLine: (uint)lastParamLocation.EndLine
, debugFlags: prototype.IsCompilerGenerated ? DebugInfoFlags.Artificial : DebugInfoFlags.Prototyped
, isOptimized: false
);
}
int index = 0;
foreach(var argId in prototype.Parameters)
{
retVal.Parameters[ index ].Name = argId.Name;
++index;
}
return retVal;
}
Debug Locations
The AST contains full location information for each parsed node from the parse tree. This allows building debug location information for each node fairly easily. The general idea is to set the location in the InstructionBuilder so that it is applied to all instructions emitted until it is changed. This saves on manually adding the location on every instruction.
private void EmitLocation( IAstNode? node )
{
DILocalScope? scope = null;
if(LexicalBlocks.Count > 0)
{
scope = LexicalBlocks.Peek();
}
else if(InstructionBuilder.InsertFunction != null && InstructionBuilder.InsertFunction.DISubProgram != null)
{
scope = InstructionBuilder.InsertFunction.DISubProgram;
}
DILocation? loc = null;
if(scope != null)
{
loc = new DILocation( InstructionBuilder.Context
, (uint)(node?.Location.StartLine ?? 0)
, (uint)(node?.Location.StartColumn ?? 0)
, scope
);
}
InstructionBuilder.SetDebugLocation( loc );
}
Function Definition
The next step is to update the function definition with attached debug information. The definition starts by pushing a new lexical scope that is the functions declaration. This serves as the parent scope for all the debug information generated for the function's implementation. The debug location info is cleared from the builder to set up all the parameter variables with alloca, as before. Additionally, the debug information for each parameter is constructed. After the function is fully generated the debug information for the function is finalized, this is needed to allow for any optimizations to occur at the function level.
public override Value? Visit( FunctionDefinition definition )
{
ArgumentNullException.ThrowIfNull( definition );
Debug.Assert( InstructionBuilder is not null, "Internal error Instruction builder should be set in Generate already" );
Debug.Assert( CurrentDIBuilder is not null, "Internal error CurrentDIBuilder should be set in Generate already" );
var function = GetOrDeclareFunction( definition.Signature );
if(!function.IsDeclaration)
{
throw new CodeGeneratorException( $"Function {function.Name} cannot be redefined in the same module" );
}
Debug.Assert( function.DISubProgram != null, "Expected function with non-null DISubProgram" );
LexicalBlocks.Push( function.DISubProgram );
try
{
var entryBlock = function.AppendBasicBlock( "entry" );
InstructionBuilder.PositionAtEnd( entryBlock );
// Unset the location for the prologue emission (leading instructions with no
// location in a function are considered part of the prologue and the debugger
// will run past them when breaking on a function)
EmitLocation( null );
using(NamedValues.EnterScope())
{
foreach(var param in definition.Signature.Parameters)
{
var argSlot = InstructionBuilder.Alloca( function.Context.DoubleType )
.RegisterName( param.Name );
AddDebugInfoForAlloca( argSlot, function, param );
InstructionBuilder.Store( function.Parameters[ param.Index ], argSlot );
NamedValues[ param.Name ] = argSlot;
}
foreach(LocalVariableDeclaration local in definition.LocalVariables)
{
var localSlot = InstructionBuilder.Alloca( function.Context.DoubleType )
.RegisterName( local.Name );
AddDebugInfoForAlloca( localSlot, function, local );
NamedValues[ local.Name ] = localSlot;
}
EmitBranchToNewBlock( "body" );
var funcReturn = definition.Body.Accept( this ) ?? throw new CodeGeneratorException( ExpectValidFunc );
InstructionBuilder.Return( funcReturn );
CurrentDIBuilder.Finish( function.DISubProgram );
function.Verify();
if(definition.IsAnonymous)
{
function.AddAttribute( FunctionAttributeIndex.Function, "alwaysinline" )
.Linkage( Linkage.Private );
AnonymousFunctions.Add( function );
}
return function;
}
}
catch(CodeGeneratorException)
{
function.EraseFromParent();
throw;
}
}
Debug info for Parameters and Local Variables
Debug information for parameters and local variables is similar but not quite identical. Thus, two new
overloaded helper methods AddDebugInfoForAlloca
handle attaching the correct debug information for
parameters and local variables.
private void AddDebugInfoForAlloca( Alloca argSlot, Function function, ParameterDeclaration param)
{
Debug.Assert( CurrentDIBuilder is not null, "Internal error CurrentDIBuilder should be set in Generate already" );
uint line = ( uint )param.Location.StartLine;
uint col = ( uint )param.Location.StartColumn;
// Keep compiler happy on null checks by asserting on expectations
// The items were created in this file with all necessary info so
// these properties should never be null.
Debug.Assert( function.DISubProgram != null, "expected function with non-null DISubProgram" );
Debug.Assert( function.DISubProgram.File != null, "expected function with a non-null DISubProgram.File" );
Debug.Assert( InstructionBuilder.InsertBlock != null, "expected Instruction builder with non-null insertion block" );
DILocalVariable debugVar = CurrentDIBuilder.CreateArgument( scope: function.DISubProgram
, name: param.Name
, file: function.DISubProgram.File
, line
, type: DoubleType!
, alwaysPreserve: true
, debugFlags: DebugInfoFlags.None
, argNo: checked(( ushort )( param.Index + 1 )) // Debug index starts at 1!
);
CurrentDIBuilder.InsertDeclare( storage: argSlot
, varInfo: debugVar
, location: new DILocation( Context, line, col, function.DISubProgram )
, insertAtEnd: InstructionBuilder.InsertBlock
);
}
private void AddDebugInfoForAlloca( Alloca argSlot, Function function, LocalVariableDeclaration localVar )
{
Debug.Assert( CurrentDIBuilder is not null, "Internal error CurrentDIBuilder should be set in Generate already" );
uint line = ( uint )localVar.Location.StartLine;
uint col = ( uint )localVar.Location.StartColumn;
// Keep compiler happy on null checks by asserting on expectations
// The items were created in this file with all necessary info so
// these properties should never be null.
Debug.Assert( function.DISubProgram != null, "expected function with non-null DISubProgram" );
Debug.Assert( function.DISubProgram.File != null, "expected function with non-null DISubProgram.File" );
Debug.Assert( InstructionBuilder.InsertBlock != null, "expected Instruction builder with non-null insertion block" );
DILocalVariable debugVar = CurrentDIBuilder.CreateLocalVariable( scope: function.DISubProgram
, name: localVar.Name
, file: function.DISubProgram.File
, line
, type: DoubleType!
, alwaysPreserve: false
, debugFlags: DebugInfoFlags.None
);
CurrentDIBuilder.InsertDeclare( storage: argSlot
, varInfo: debugVar
, location: new DILocation( Context, line, col, function.DISubProgram )
, insertAtEnd: InstructionBuilder.InsertBlock
);
}
Conclusion
Adding debugging information in LLVM IR is rather straight forward. The bulk of the problem is in tracking
the source location information in the parser. Fortunately for Ubiquity.NET.Llvm version of Kaleidoscope, the ANTLR4
generated parsers do this for us already! Thus, combining the parser with Ubiquity.NET.Llvm makes building a full
compiler for custom languages, including debug support a lot easier. The most "complex" part is handling the
correct ownership semantics for a DIBuilder but that is generally enforced by the compiler as it is a
ref struct
type.