9. Kaleidoscope: Adding Debug Information
So far in the progress of the Kaleidoscope tutorials we've covered the basics of the language as a JIT engine and even added ahead of time compilation into the mix so it is a full static compiled language. But what happens if something goes wrong in one of the programs written in Kaleidoscope? How can a developer debug applications written in this wonderful new language? Up until now, the answer is, you can't. This chapter will add debugging information to the generated object file so that it is available for debuggers.
Source level debugging uses formatted data bound into the output binaries that helps the debugger map the state of the application to the original source code that created it. The exact format of the data depends on the target platform but the general idea holds for all of them. In order to isolate front-end developers from the actual format LLVM uses an abstract form of debug data that is based on the common DWARF debugging format. Internally, the LLVM target will transform the abstract representation into the actual target binary form.
Note
Debugging JIT code is rather complex as it requires awareness of the runtime within the debugger to manage the execution and runtime state etc... Such functionality is beyond the scope of this tutorial.
Why is it a hard problem?
Debugging is a tough problem for a number of reasons, mostly revolving around optimized code. Optimizations make keeping source level information more difficult. In LLVM the original source location information is attached to each LLVM IR instruction. Optimization passes should keep the source location for any new instructions created, but merged instructions only get to keep a single source location. This is generally the cause of the observed "jumping around" when debugging optimized code. Additionally, optimizations can move variables in ways that are either optimized out, shared in memory, in registers or otherwise difficult to track. Thus, for the purposes of this tutorial we'll disable optimizations. (The DisableOptimizations property of the CodeGenerator was added previously to aid in observing the effects of optimizations and will serve to disable the optimizations for debugging in this chapter.)
Setup for emitting debug information
Debug information in Ubiquity.NET.Llvm is created with the DebugInfoBuilder. This is similar to the InstructionBuilder. Using the DebugInfoBuilder requires a bit more knowledge on the general concepts of the DWARF debugging format, and in particular the DebuggingMetadata in LLVM. In Ubiquity.NET.Llvm you don't need to, and in fact can't, create an instance of the DebugInfoBuilder class. Instead it is lazy constructed internally to a BitcodeModule and accessible through the DIBuilder property. This simplifies creating the builder since it is bound to the module.
Another important item for debug information is called the Compilation Unit. In Ubiquity.NET.Llvm that is the DICompileUnit. The compile unit is the top level scope for storing debug information, there is only ever one per module and generally it represents the full source file that was used to create the module. Since the compile unit, like the builder is really tied to the module it is exposed as the DICompileUnit property. However, unlike a builder it isn't something that a module can automatically construct without more information. Therefore, Ubiquity.NET.Llvm provides overloads for the creation of a module that includes the additional data needed to create the DICompileUnit for you.
The updated InitializeModuleAndPassManager() function looks like this:
Module = Context.CreateBitcodeModule( Path.GetFileName( sourcePath ), SourceLanguage.C, sourcePath, "Kaleidoscope Compiler" );
Debug.Assert( Module.DICompileUnit != null, "Expected non null compile unit" );
Debug.Assert( Module.DICompileUnit.File != null, "Expected non-null file for compile unit" );
Module.TargetTriple = machine.Triple;
Module.Layout = TargetMachine.TargetData;
DoubleType = new DebugBasicType( Context.DoubleType, Module, "double", DiTypeKind.Float );
FunctionPassManager = new FunctionPassManager( Module );
FunctionPassManager.AddPromoteMemoryToRegisterPass( );
if( !DisableOptimizations )
{
FunctionPassManager.AddInstructionCombiningPass( )
.AddReassociatePass( )
.AddGVNPass( )
.AddCFGSimplificationPass( );
}
FunctionPassManager.Initialize( );
There are a few points of interest here. First the compile unit is created for the Kaleidoscope language, however it is using the SourceLanguage.C value. This is because a debugger won't likely understand the Kaleidoscope language, runtime, or calling conventions. (We just invented it and only now setting up debugger support after all!) The good news is that the language follows the C language ABI in the code generation (generally a good idea unless you have a really good reason not to). Therefore, the C language is fairly accurate. This allows calling functions from the debugger and it will execute them.
Another point to note is that the module ID is derived from the source file path and the source file path is provided so that it becomes the root compile unit.
Important
It is important to note that when using the DIBuilder it must be "finalized" in order to resolve internal forward references in the debug metadata. The exact details of this aren't generally relevant, just remember that somewhere after generating all code and debug information to call the Finish method. (In Ubiquity.NET.Llvm this method is called Finish() to avoid conflicts with the .NET runtime defined Finalize() and to avoid confusion on the term as the idea of "finalization" has a very different meaning in .NET then what applies to the DIBuilder).
The tutorial takes care of finishing the debug information in the generator's Generate method after completing code generation for the module.
public OptionalValue<BitcodeModule> Generate( IAstNode ast )
{
ast.ValidateNotNull( nameof( ast ) );
ast.Accept( this );
if( AnonymousFunctions.Count > 0 )
{
var mainFunction = Module.CreateFunction( "main", Context.GetFunctionType( Context.VoidType ) );
var block = mainFunction.AppendBasicBlock( "entry" );
var irBuilder = new InstructionBuilder( block );
var printdFunc = Module.CreateFunction( "printd", Context.GetFunctionType( Context.DoubleType, Context.DoubleType ) );
foreach( var anonFunc in AnonymousFunctions )
{
var value = irBuilder.Call( anonFunc );
irBuilder.Call( printdFunc, value );
}
irBuilder.Return( );
// Use always inline and Dead Code Elimination module passes to inline all of the
// anonymous functions. This effectively strips all the calls just generated for main()
// and inlines each of the anonymous functions directly into main, dropping the now
// unused original anonymous functions all while retaining all of the original source
// debug information locations.
var mpm = new ModulePassManager( );
mpm.AddAlwaysInlinerPass( )
.AddGlobalDCEPass( )
.Run( Module );
Module.DIBuilder.Finish( );
}
return OptionalValue.Create( Module );
}
Functions
With the basics of the DIBuilder and DICompile unit setup for the module it is time to focus on providing debug information for functions. This requires adding a few extra calls to build the context of the debug information for the function. The DWARF debug format that LLVM's debug metadata is based on calls these "SubPrograms". The new code builds a representation of the file the code is contained in as a new DIFile. In this case that is a bit redundant as all the code comes from a single file and the compile unit already has the file info. However, that's not always true for all languages (i.e. some sort of Include mechanism) so the file is created. It's not a problem as LLVM will intern the file definition so that it won't actually end up with duplicates.
// Retrieves a Function for a prototype from the current module if it exists,
// otherwise declares the function and returns the newly declared function.
private IrFunction GetOrDeclareFunction( Prototype prototype )
{
if( Module.TryGetFunction( prototype.Name, out IrFunction? function ) )
{
return function;
}
// extern declarations don't get debug information
IrFunction retVal;
if( prototype.IsExtern )
{
var llvmSignature = Context.GetFunctionType( Context.DoubleType, prototype.Parameters.Select( _ => Context.DoubleType ) );
retVal = Module.CreateFunction( prototype.Name, llvmSignature );
}
else
{
var parameters = prototype.Parameters;
// DICompileUnit and File are checked for null in constructor
var debugFile = Module.DIBuilder.CreateFile( Module.DICompileUnit!.File!.FileName, Module.DICompileUnit!.File.Directory );
var signature = Context.CreateFunctionType( Module.DIBuilder, DoubleType, prototype.Parameters.Select( _ => DoubleType ) );
var lastParamLocation = parameters.Count > 0 ? parameters[ parameters.Count - 1 ].Location : prototype.Location;
retVal = Module.CreateFunction( scope: Module.DICompileUnit
, name: prototype.Name
, linkageName: null
, file: debugFile
, line: ( uint )prototype.Location.StartLine
, signature
, isLocalToUnit: false
, isDefinition: true
, scopeLine: ( uint )lastParamLocation.EndLine
, debugFlags: prototype.IsCompilerGenerated ? DebugInfoFlags.Artificial : DebugInfoFlags.Prototyped
, isOptimized: false
);
}
int index = 0;
foreach( var argId in prototype.Parameters )
{
retVal.Parameters[ index ].Name = argId.Name;
++index;
}
return retVal;
}
Debug Locations
The AST contains full location information for each parsed node from the parse tree. This allows building debug location information for each node fairly easily. The general idea is to set the location in the InstructionBuilder so that it is applied to all instructions emitted until it is changed. This saves on manually adding the location on every instruction.
private void EmitLocation( IAstNode? node )
{
DILocalScope? scope = null;
if( LexicalBlocks.Count > 0 )
{
scope = LexicalBlocks.Peek( );
}
else if( InstructionBuilder.InsertFunction != null && InstructionBuilder.InsertFunction.DISubProgram != null )
{
scope = InstructionBuilder.InsertFunction.DISubProgram;
}
DILocation? loc = null;
if( scope != null )
{
loc = new DILocation( InstructionBuilder.Context
, ( uint )( node?.Location.StartLine ?? 0 )
, ( uint )( node?.Location.StartColumn ?? 0 )
, scope
);
}
InstructionBuilder.SetDebugLocation( loc );
}
Function Definition
The next step is to update the function definition with attached debug information. The definition starts by pushing a new lexical scope that is the functions declaration. This serves as the parent scope for all the debug information generated for the function's implementation. The debug location info is cleared from the builder to set up all the parameter variables with alloca, as before. Additionally, the debug information for each parameter is constructed. After the function is fully generated the debug information for the function is finalized, this is needed to allow for any optimizations to occur at the function level.
public override Value? Visit( FunctionDefinition definition )
{
definition.ValidateNotNull( nameof( definition ) );
var function = GetOrDeclareFunction( definition.Signature );
if( !function.IsDeclaration )
{
throw new CodeGeneratorException( $"Function {function.Name} cannot be redefined in the same module" );
}
Debug.Assert( function.DISubProgram != null, "Expected function with non-null DISubProgram" );
LexicalBlocks.Push( function.DISubProgram );
try
{
var entryBlock = function.AppendBasicBlock( "entry" );
InstructionBuilder.PositionAtEnd( entryBlock );
// Unset the location for the prologue emission (leading instructions with no
// location in a function are considered part of the prologue and the debugger
// will run past them when breaking on a function)
EmitLocation( null );
using( NamedValues.EnterScope( ) )
{
foreach( var param in definition.Signature.Parameters )
{
var argSlot = InstructionBuilder.Alloca( function.Context.DoubleType )
.RegisterName( param.Name );
AddDebugInfoForAlloca( argSlot, function, param );
InstructionBuilder.Store( function.Parameters[ param.Index ], argSlot );
NamedValues[ param.Name ] = argSlot;
}
foreach( LocalVariableDeclaration local in definition.LocalVariables )
{
var localSlot = InstructionBuilder.Alloca( function.Context.DoubleType )
.RegisterName( local.Name );
AddDebugInfoForAlloca( localSlot, function, local );
NamedValues[ local.Name ] = localSlot;
}
EmitBranchToNewBlock( "body" );
var funcReturn = definition.Body.Accept( this ) ?? throw new CodeGeneratorException( ExpectValidFunc );
InstructionBuilder.Return( funcReturn );
Module.DIBuilder.Finish( function.DISubProgram );
function.Verify( );
FunctionPassManager.Run( function );
if( definition.IsAnonymous )
{
function.AddAttribute( FunctionAttributeIndex.Function, AttributeKind.AlwaysInline )
.Linkage( Linkage.Private );
AnonymousFunctions.Add( function );
}
return function;
}
}
catch( CodeGeneratorException )
{
function.EraseFromParent( );
throw;
}
}
Debug info for Parameters and Local Variables
Debug information for parameters and local variables is similar but not quite identical. Thus, two new
overloaded helper methods AddDebugInfoForAlloca
handle attaching the correct debug information for
parameters and local variables.
private void AddDebugInfoForAlloca( Alloca argSlot, IrFunction function, ParameterDeclaration param )
{
uint line = ( uint )param.Location.StartLine;
uint col = ( uint )param.Location.StartColumn;
// Keep compiler happy on null checks by asserting on expectations
// The items were created in this file with all necessary info so
// these properties should never be null.
Debug.Assert( function.DISubProgram != null, "expected function with non-null DISubProgram" );
Debug.Assert( function.DISubProgram.File != null, "expected function with a non-null DISubProgram.File" );
Debug.Assert( InstructionBuilder.InsertBlock != null, "expected Instruction builder with non-null insertion block" );
DILocalVariable debugVar = Module.DIBuilder.CreateArgument( scope: function.DISubProgram
, name: param.Name
, file: function.DISubProgram.File
, line
, type: DoubleType
, alwaysPreserve: true
, debugFlags: DebugInfoFlags.None
, argNo: checked(( ushort )( param.Index + 1 )) // Debug index starts at 1!
);
Module.DIBuilder.InsertDeclare( storage: argSlot
, varInfo: debugVar
, location: new DILocation( Context, line, col, function.DISubProgram )
, insertAtEnd: InstructionBuilder.InsertBlock
);
}
private void AddDebugInfoForAlloca( Alloca argSlot, IrFunction function, LocalVariableDeclaration localVar )
{
uint line = ( uint )localVar.Location.StartLine;
uint col = ( uint )localVar.Location.StartColumn;
// Keep compiler happy on null checks by asserting on expectations
// The items were created in this file with all necessary info so
// these properties should never be null.
Debug.Assert( function.DISubProgram != null, "expected function with non-null DISubProgram" );
Debug.Assert( function.DISubProgram.File != null, "expected function with non-null DISubProgram.File" );
Debug.Assert( InstructionBuilder.InsertBlock != null, "expected Instruction builder with non-null insertion block" );
DILocalVariable debugVar = Module.DIBuilder.CreateLocalVariable( scope: function.DISubProgram
, name: localVar.Name
, file: function.DISubProgram.File
, line
, type: DoubleType
, alwaysPreserve: false
, debugFlags: DebugInfoFlags.None
);
Module.DIBuilder.InsertDeclare( storage: argSlot
, varInfo: debugVar
, location: new DILocation( Context, line, col, function.DISubProgram )
, insertAtEnd: InstructionBuilder.InsertBlock
);
}
Conclusion
Adding debugging information in LLVM IR is rather straight forward. The bulk of the problem is in tracking the source location information in the parser. Fortunately for Ubiquity.NET.Llvm version of Kaleidoscope, the ANTLR4 generated parsers do this for us already! Thus, combining the parser with Ubiquity.NET.Llvm makes building a full compiler for custom languages, including debug support a lot easier.