Performance is a Feature!

Analysing .NET start-up time with Flamegraphs

2020-03-03T00:00:00+00:00

Recently I gave a talk at the NYAN Conference called ‘From ‘dotnet run’ to ‘hello world’:

In the talk I demonstrate how you can use PerfView to analyse where the .NET Runtime is spending it’s time during start-up:

From 'dotnet run' to 'hello world' from Matt Warren

This post is a step-by-step guide to that demo.

Code Sample

For this exercise I delibrately only look at what the .NET Runtime is doing during program start-up, so I ensure the minimum amount of user code is runing, hence the following ‘Hello World’:

using System;

namespace HelloWorld
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
            Console.WriteLine("Press <ENTER> to exit");
            Console.ReadLine();
        }
    }
}

The Console.ReadLine() call is added because I want to ensure the process doesn’t exit whilst PerfView is still collecting data.

Data Collection

PerfView is a very powerful program, but not the most user-friendly of tools, so I’ve put togerther a step-by-step guide:

Download and run a recent version of ‘PerfView.exe’
Click ‘Run a command’ or (Alt-R’) and “collect data while the command is running”
Ensure that you’ve entered values for:
1. “Command”
2. “Current Dir”
Tick ‘Cpu Samples’ if it isn’t already selected
Set ‘Max Collect Sec’ to 15 seconds (because our ‘HelloWorld’ app never exits, we need to ensure PerfView stops collecting data at some point)
Ensure that ‘.NET Symbol Collection’ is selected
Hit ‘Run Command

If you then inspect the log you can see that it’s collecting data, obtaining symbols and then finally writing everything out to a .zip file. Once the process is complete you should see the newly created file in the left-hand pane of the main UI, in this case it’s called ‘PerfViewData.etl.zip’

Data Processing

Once you have your ‘.etl.zip’ file, double-click on it and you will see a tree-view with all the available data. Now, select ‘CPU Stacks’ and you’ll be presented with a view like this:

Notice there’s alot of ‘?’ characters in the list, this means that PerfView is not able to work out the method names as it hasn’t resolved the necessary symbols for the Runtime dlls. Lets fix that:

Open ‘CPU Stacks’
In the list, select the ‘HelloWorld’ process (PerfView collects data machine-wide)
In the ‘GroupPats’ drop-down, select ‘[no grouping]’
Optional, change the ‘Symbol Path’ from the default to something else
In the ‘By name’ tab, hit ‘Ctrl+A’ to select all the rows
Right-click and select ‘Lookup Symbols’ (or just hit ‘Alt+S’)

Now the ‘CPU Stacks’ view should look something like this:

Finally, we can get the data we want:

Select the ‘Flame Graph’ tab
Change ‘GroupPats’ to one of the following for a better flame graph:
1. [group module entries] {%}!=>module $1
2. [group class entries] {%!*}.%(=>class $1;{%!*}::=>class $1
Change ‘Fold%’ to a higher number, maybe 3%, to get rid of any thin bars (any higher and you start to loose information)

Now, at this point I actually recommend exporting the PerfView data into a format that can be loaded into https://speedscope.app/ as it gives you a much better experience. To do this click File -> Save View As and then in the ‘Save as type’ box select Speed Scope Format. Once that’s done you can ‘browse’ that file at speedscope.app, or if you want you can just take a look at one I’ve already created.

Note: If you’ve never encountered ‘flamegraphs’ before, I really recommend reading this excellent explanation by Julia Evans:

perf & flamegraphs pic.twitter.com/duzWs2hoLT
— 🔎Julia Evans🔍 (@b0rk) December 26, 2017

Anaylsis of .NET Runtime Startup

Finally, we can answer our original question:

Where does the .NET Runtime spend time during start-up?

Here’s the data from the flamegraph summarised as text, with links the corresponding functions in the ‘.NET Core Runtime’ source code:

Entire Application - 100% - 233.28ms
Everything except helloworld!wmain - 21%
helloworld!wmain - 79% - 184.57ms
1. hostpolicy!create_hostpolicy_context - 30% - 70.92ms here
2. hostpolicy!create_coreclr - 22% - 50.51ms here
  1. coreclr!CorHost2::Start - 9% - 20.98ms here
  2. coreclr!CorHost2::CreateAppDomain - 10% - 23.52ms here
3. hostpolicy!runapp - 20% - 46.20ms here, ends up calling into Assembly::ExecuteMainMethod here
  1. coreclr!RunMain - 9.9% - 23.12ms here
  2. coreclr!RunStartupHooks - 8.1% - 19.00ms here
4. hostfxr!resolve_frameworks_for_app - 3.4% - 7.89ms here

So, the main places that the runtime spends time are:

30% of total time is spent Launching the runtime, controlled via the ‘host policy’, which mostly takes place in hostpolicy!create_hostpolicy_context (30% of total time)
22% of time is spend on Initialisation of the runtime itself and the initial (and only) AppDomain it creates, this can be see in CorHost2::Start (native) and CorHost2::CreateAppDomain (managed). For more info on this see The 68 things the CLR does before executing a single line of your code
20% was used JITting and executing the Main method in our ‘Hello World’ code sample, this started in Assembly::ExecuteMainMethod above.

To confirm the last point, we can return to PerfView and take a look at the ‘JIT Stats Summary’ it produces. From the main menu, under ‘Advanced Group’ -> ‘JIT Stats’ we see that 23.1 ms or 9.1% of the total CPU time was spent JITing:

Under the hood of "Default Interface Methods"

2020-02-19T00:00:00+00:00

Background

‘Default Interface Methods’ (DIM) sometimes referred to as ‘Default Implementations in Interfaces’, appeared in C# 8. In case you’ve never heard of the feature, here’s some links to get you started:

Default implementations in interfaces (official announcement)
Default Interface Methods (C# Language Proposal), here’s some notable sections:
Champion “default interface methods” (including links for ‘Language Design Meeting’ notes)
Tutorial: Update interfaces with default interface methods in C# 8.0

Also, there are quite a few other blogs posts discussing this feature, but as you can see opinion is split on whether it’s useful or not:

But this post isn’t about what they are, how you can use them or if they’re useful or not. Instead we will be exploring how ‘Default Interface Methods’ work under-the-hood, looking at what the .NET Core Runtime has to do to make them work and how the feature was developed.

Table of Contents

Background
Development Timeline and PRs
Default Interface Methods ‘in action’
Enabling Methods on an Interface
Resolving the Method Dispatch
Analysis of FindDefaultInterfaceImplementation(..)
Diamond Inheritance Problem
Summary

Development Timeline and PRs

First of all, there are a few places you can go to get a ‘high-level’ understanding of what was done:

GitHub Project for Default Interface Methods
List of all the PRs done during the Project
To see which parts of the runtime are affected, you can search for ‘FEATURE_DEFAULT_INTERFACES’ in the .NET (Core) Runtime source code as the entire feature is behind a #define.
In addition, you can see the corresponding work being done in Mono, Epic: Default Interface Implementation #6961 and Update default interfaces support #11267

Initial work, Prototype and Timeline

The entire prototype is split across several PRs, running from March - July 2017:
All the initial work was merged into master in December 2017 in Merge dev/defaultintf to master #15370
The entire feature was turned on by default in March 2019 in Enable FeatureDefaultInterfaces unconditionally #23225
It was then announced/released in May 2019.

Interesting PR’s done after the prototype (newest -> oldest)

Once the prototype was merged in, there was additional feature work done to ensure that DIM’s worked across different scenarios:

Bug fixes done since the Prototype (newest -> oldest)

In addition, there were various bugs fixes done to ensure that existing parts of the CLR played nicely with DIMs:

Possible future work

Finally, there’s no guarantee if or when this will be done, but here are the remaining issues associated with the project:

Default Interface Methods ‘in action’

Now that we’ve seen what was done, let’s look at what that all means, starting with this code that simply demonstrates ‘Default Interface Methods’ in action:

interface INormal {
    void Normal();
}

interface IDefaultMethod {
    void Default() => WriteLine("IDefaultMethod.Default");
}

class CNormal : INormal {
    public void Normal() => WriteLine("CNormal.Normal");
}

class CDefault : IDefaultMethod {
    // Nothing to do here!
}

class CDefaultOwnImpl : IDefaultMethod {
    void IDefaultMethod.Default() => WriteLine("CDefaultOwnImpl.IDefaultMethod.Default");
}

// Test out the Normal/DefaultMethod Interfaces
INormal iNormal = new CNormal();
iNormal.Normal(); // prints "CNormal.Normal"

IDefaultMethod iDefault = new CDefault();
iDefault.Default(); // prints "IDefaultMethod.Default"

IDefaultMethod iDefaultOwnImpl = new CDefaultOwnImpl();
iDefaultOwnImpl.Default(); // prints "CDefaultOwnImpl.IDefaultMethod.Default"

The first way we can understand how they are implemented is by using Type.GetInterfaceMap(Type) (which actually had to be fixed to work with DIMs), this can be done with code like this:

private static void ShowInterfaceMapping(Type @implemetation, Type @interface) {
    InterfaceMapping map = @implemetation.GetInterfaceMap(@interface);
    Console.WriteLine($"{map.TargetType}: GetInterfaceMap({map.InterfaceType})");
    for (int counter = 0; counter < map.InterfaceMethods.Length; counter++) {
        MethodInfo im = map.InterfaceMethods[counter];
        MethodInfo tm = map.TargetMethods[counter];
        Console.WriteLine($"   {im.DeclaringType}::{im.Name} --> {tm.DeclaringType}::{tm.Name} ({(im == tm ? "same" : "different")})");
        Console.WriteLine("       MethodHandle 0x{0:X} --> MethodHandle 0x{1:X}",
            im.MethodHandle.Value.ToInt64(), tm.MethodHandle.Value.ToInt64());
        Console.WriteLine("       FunctionPtr  0x{0:X} --> FunctionPtr  0x{1:X}",
            im.MethodHandle.GetFunctionPointer().ToInt64(), tm.MethodHandle.GetFunctionPointer().ToInt64());
    }
    Console.WriteLine();
}

Which gives the following output:

//ShowInterfaceMapping(typeof(CNormal), @interface: typeof(INormal));
//ShowInterfaceMapping(typeof(CDefault), @interface: typeof(IDefaultMethod));
//ShowInterfaceMapping(typeof(CDefaultOwnImpl), @interface: typeof(IDefaultMethod));

TestApp.CNormal: GetInterfaceMap(TestApp.INormal)
   TestApp.INormal::Normal --> TestApp.CNormal::Normal (different)
       MethodHandle 0x7FF993916A80 --> MethodHandle 0x7FF993916B10
       FunctionPtr  0x7FF99385FC50 --> FunctionPtr  0x7FF993861880

TestApp.CDefault: GetInterfaceMap(TestApp.IDefaultMethod)
   TestApp.IDefaultMethod::Default --> TestApp.IDefaultMethod::Default (same)
       MethodHandle 0x7FF993916BD8 --> MethodHandle 0x7FF993916BD8
       FunctionPtr  0x7FF99385FC78 --> FunctionPtr  0x7FF99385FC78

TestApp.CDefaultOwnImpl: GetInterfaceMap(TestApp.IDefaultMethod)
   TestApp.IDefaultMethod::Default --> TestApp.CDefaultOwnImpl::TestApp.IDefaultMethod.Default (different)
       MethodHandle 0x7FF993916BD8 --> MethodHandle 0x7FF993916D10
       FunctionPtr  0x7FF99385FC78 --> FunctionPtr  0x7FF9938663A0

So here we can see that in the case of IDefaultMethod interface on the CDefault class the interface and method implementations are the same. As you can see, in the other scenarios the interface method maps to a different method implementation.

But lets look at bit lower, making use of WinDBG and the SOS extension to get a peek into the internal ‘data structures’ that the runtime uses.

First, lets take a look at the MethodTable (dumpmt) for the INormal interface:

> dumpmt -md 00007ff8bcc31dd8
EEClass:         00007FF8BCC2C420
Module:          00007FF8BCC0F788
Name:            TestApp.INormal
mdToken:         0000000002000002
File:            C:\DefaultInterfaceMethods\TestApp\bin\Debug\netcoreapp3.0\TestApp.dll
BaseSize:        0x0
ComponentSize:   0x0
Slots in VTable: 1
Number of IFaces in IFaceMap: 0
--------------------------------------
MethodDesc Table
           Entry       MethodDesc    JIT Name
00007FF8BCB70580 00007FF8BCC31DC8   NONE TestApp.INormal.Normal()

So we can see that the interface has an entry for the Normal() method, as expected, but lets look in more detail at the MethodDesc (dumpmd):

> dumpmd 00007FF8BCC31DC8                                    
Method Name:          TestApp.INormal.Normal()               
Class:                00007ff8bcc2c420                       
MethodTable:          00007ff8bcc31dd8                       
mdToken:              0000000006000001                       
Module:               00007ff8bcc0f788                       
IsJitted:             no                                     
Current CodeAddr:     ffffffffffffffff                       
Version History:                                             
  ILCodeVersion:      0000000000000000                       
  ReJIT ID:           0                                      
  IL Addr:            0000000000000000                       
     CodeAddr:           0000000000000000  (MinOptJitted)    
     NativeCodeVersion:  0000000000000000 

So whilst the method exists in the interface definition, it’s clear that the method has not been jitted (IsJitted: no) and in fact it never will, as it can never be executed.

Now lets compare that output with the one for the IDefaultMethod interface, again the MethodTable (dumpmt) and the MethodDesc (dumpmd):

> dumpmt -md 00007ff8bcc31e68
EEClass:         00007FF8BCC2C498
Module:          00007FF8BCC0F788
Name:            TestApp.IDefaultMethod
mdToken:         0000000002000003
File:            C:\DefaultInterfaceMethods\TestApp\bin\Debug\netcoreapp3.0\TestApp.dll
BaseSize:        0x0
ComponentSize:   0x0
Slots in VTable: 1
Number of IFaces in IFaceMap: 0
--------------------------------------
MethodDesc Table
           Entry       MethodDesc    JIT Name
00007FF8BCB70590 00007FF8BCC31E58    JIT TestApp.IDefaultMethod.Default()

> dumpmd 00007FF8BCC31E58
Method Name:          TestApp.IDefaultMethod.Default()
Class:                00007ff8bcc2c498
MethodTable:          00007ff8bcc31e68
mdToken:              0000000006000002
Module:               00007ff8bcc0f788
IsJitted:             yes
Current CodeAddr:     00007ff8bcb765c0
Version History:
  ILCodeVersion:      0000000000000000
  ReJIT ID:           0
  IL Addr:            0000000000000000
     CodeAddr:           00007ff8bcb765c0  (MinOptJitted)
     NativeCodeVersion:  0000000000000000

Here we see something very different, the MethodDesc entry in the MethodTable actually has jitted, executable code associated with it.

Enabling Methods on an Interface

So we’ve seen that ‘default interface methods’ are wired up by the runtime, but how does that happen?

Firstly, it’s very illuminating to look at the initial prototype of the feature in CoreCLR PR #10505, because we can understand at the lowest level what the feature is actually enabling, from /src/vm/classcompat.cpp:

Here we see why DIM didn’t require any changes to the .NET ‘Intermediate Language’ (IL) op-codes, instead they are enabled by relaxing a previous restriction. Before this change, you weren’t able to add ‘virtual, non-abstract’ or ‘non-virtual’ methods to an interface:

“Virtual Non-Abstract Interface Method.” (BFA_VIRTUAL_NONAB_INT_METHOD)
“Nonvirtual Instance Interface Method.” (BFA_NONVIRT_INST_INT_METHOD)

This ties in with the proposed changes to the ECMA-335 specification, from the ‘Default interface methods’ design doc:

The major changes are:

Interfaces are now allowed to have instance methods (both virtual and non-virtual). Previously we only allowed abstract virtual methods.

Interfaces obviously still can’t have instance fields.

Interface methods are allowed to MethodImpl other interface methods the interface requires (but we require the MethodImpls to be final to keep things simple) - i.e. an interface is allowed to provide (or override) an implementation of another interface’s method

However, just allowing ‘virtual, non-abstract’ or ‘non-virtual’ methods to exist on an interface is only the start, the runtime then needs to allow code to call those methods and that is far harder!

Resolving the Method Dispatch

In .NET, since version 2.0, all interface methods calls have taken place via a mechanism known as Virtual Stub Dispatch:

Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.

Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.

For more information I recommend reading the section on C#’s slotmaps in the excellent article on ‘Interface Dispatch’ by Lukas Atkinson.

So, to make DIM work, the runtime has to wire up any ‘default methods’, so that they integrate with the ‘virtual stub dispatch’ mechanism. We can see this in action by looking at the call stack from the hand-crafted assembly stub (ResolveWorkerAsmStub) all the way down to FindDefaultInterfaceImplementation(..) which finds the correct method, given an interface (pInterfaceMD) and the default method to call (pInterfaceMT):

- coreclr.dll!MethodTable::FindDefaultInterfaceImplementation(MethodDesc *pInterfaceMD, MethodTable *pInterfaceMT, MethodDesc **ppDefaultMethod, int allowVariance, int throwOnConflict) Line 6985	C++
- coreclr.dll!MethodTable::FindDispatchImpl(unsigned int typeID, unsigned int slotNumber, DispatchSlot *pImplSlot, int throwOnConflict) Line 6851	C++
- coreclr.dll!MethodTable::FindDispatchSlot(unsigned int typeID, unsigned int slotNumber, int throwOnConflict) Line 7251	C++
- coreclr.dll!VirtualCallStubManager::Resolver(MethodTable *pMT, DispatchToken token, OBJECTREF *protectedObj, unsigned __int64 *ppTarget, int throwOnConflict) Line 2208	C++
- coreclr.dll!VirtualCallStubManager::ResolveWorker(StubCallSite *pCallSite, OBJECTREF *protectedObj, DispatchToken token, VirtualCallStubManager::StubKind stubKind) Line 1874	C++
- coreclr.dll!VSD_ResolveWorker(TransitionBlock *pTransitionBlock, unsigned __int64 siteAddrForRegisterIndirect, unsigned __int64 token, unsigned __int64 flags) Line 1683	C++
- coreclr.dll!ResolveWorkerAsmStub() Line 42	Unknown

If you want to explore the call-stack in more detail, you can follow the links below:

ResolveWorkerAsmStub here
- This is the ‘Generic Resolver’ phase of ‘Virtual Stub Dispatch’.
VSD_ResolveWorker(..) here
VirtualCallStubManager::ResolveWorker(..) here
VirtualCallStubManager::Resolver(..)here
MethodTable::FindDispatchSlot(..) here [MethodTable::FindDispatchImpl(..) here or here
Finally ending up in MethodTable::FindDefaultInterfaceImplementation(..) here

Analysis of `FindDefaultInterfaceImplementation(..)`

So the code in FindDefaultInterfaceImplementation(..) is at the heart of the feature, but what does it need to do and how does it do it? This list from Finalize override lookup algorithm #12753 gives us some idea of the complexity:

properly detect diamond shape positive case (where I4 overrides both I2/I3 which both overrides I1) by keep tracking of a current list of best candidates. I went for the simplest algorithm and didn’t build any complex graph / DFS since the majority case the list of interfaces would be small, and interface dispatch cache would ensure majority of cases we don’t need to redo the (slow) dispatch. If needed we can revisit this to make it a proper topological sort.

VerifyVirtualMethodsImplemented now properly validates default interface scenarios - it is happy if there is at least one implementation and early returns. It doesn’t worry about conflicting overrides, for performance reasons.

NotSupportedException thrown in conflicting override scenario now has a proper error message

properly supports GVM when detecting method impl overrides

Revisited code that adds method impl for interfaces. added proper methodimpl validation and ensure methodimpl are virtual and final (and throw exception if it is not final)

Added test scenario with method that has multiple method impl. found and fixed a bug where the slot array is not big enough when building method impls for interfaces.

In addition, the ‘two-pass’ algorithm was implemented in Implement two pass algorithm for variant interface dispatch #21355, which contains an interesting discussion of the edge-cases that need to be handled.

So onto the code, this is the high-level view of the algorithm:

Which actually starts in MethodTable::FindDispatchImpl(..) here, where FindDefaultInterfaceImplementation can be called twice:
1. First time to try and find an ‘exact match’ (allowVariance=false)
2. Then if that fails, it’s called again to try and find a ‘variant match’ (allowVariance=true)
The entire FindDefaultInterfaceImplementation method is here, it’s fairly straight-forward and relatively easy to understand, plus there’s only ~270 LOC and they’re all very well commented. The high-level algorithm is the following:
1. Walk interface from derived class to parent class here, this is a straight-forward implementation that may me revisited if it doesn’t scale well
2. Then scan through each class looking for a match:
  1. an ‘exact match’
  2. a ‘generic variance match’, i.e. the interfaces match via ‘casting’, but ultimately have the same TypeDef
  3. a ‘more specific interface’ that matches, this match is made more complicated by the fact that ‘generic instantiations’ are involved
  4. a ‘more specific interface’ matches, but without generics involved, so much simpler to calculate
3. If the previous step produced a match, double-check that it is the most specific interface match seen so far, by keeping a ‘candidates list’ and classifying each scenario as:
  1. a ‘tie’ which is ignored, i.e. a ‘variant match’ on the same type
  2. a ‘more specific’ match, which is used to update the ‘candidates list’
  3. a ‘less-specific’ match, so no need to carry on with this candidate
4. Finally, a scan is done to see if there are any conflicts here, which is acceptable when allowVariance=true, but otherwise throws an exception
5. That’s it, the ‘best-candidate’ is then returned to the caller (assuming there is one)

Diamond Inheritance Problem

Finally, the ‘diamond inheritance problem’ was mentioned in a few of the PRs/Issues related to the feature, but what is it?

A good place to starts is one of the test cases, diamondshape.cs. However there’s a more concise example in the C#8 Language Proposal:

interface IA
{
    void M();
}
interface IB : IA
{
    override void M() { WriteLine("IB"); }
}
class Base : IA
{
    void IA.M() { WriteLine("Base"); }
}
class Derived : Base, IB // allowed?
{
    static void Main()
    {
        Ia a = new Derived();
        a.M();           // what does it do?
    }
}

So the issue is which of the matching interface methods should be used, in this case IB.M() or Base.IA.M()? The resolution, as outlined in the C#8 language proposal was to use the most specific override:

Closed Issue: Confirm the draft spec, above, for most specific override as it applies to mixed classes and interfaces (a class takes priority over an interface). See https://github.com/dotnet/csharplang/blob/master/meetings/2017/LDM-2017-04-19.md#diamonds-with-classes.

Which ties in with the ‘more-specific’ and ‘less-specific’ steps we saw in the outline of FindDefaultInterfaceImplementation above.

Summary

So there you have it, an entire feature delivered end-to-end, yay for .NET (Core) being open source! Thanks to the runtime engineers for making their Issues and PRs easy to follow and for adding such great comments to their code! Also kudos to the language designers for making their proposals and meeting notes available for all to see (e.g. LDM-2017-04-19).

Whether you think they are useful or not, it’s hard to argue that ‘Default Interface Methods’ aren’t well designed and well implemented.

But what makes it even more unique feature is that it required the compiler and runtime teams working together to make it possible!

Research based on the .NET Runtime

2019-10-25T00:00:00+00:00

Over the last few years, I’ve come across more and more research papers based, in some way, on the ‘Common Language Runtime’ (CLR).

So armed with Google Scholar and ably assisted by Semantic Scholar, I put together the list below.

Note: I put the papers into the following categories to make them easier to navigate (papers in each category are sorted by date, newest -> oldest):

Using the .NET Runtime as a case-study
- to prove its correctness, study how it works or analyse its behaviour
Research carried out by Microsoft Research, the research subsidiary of Microsoft.
- “It was formed in 1991, with the intent to advance state-of-the-art computing and solve difficult world problems through technological innovation in collaboration with academic, government, and industry researchers” (according to Wikipedia)
Papers based on the Mono Runtime
- a ‘Cross-Platform, open-source .NET framework’
Using ‘Rotor’, real name ‘Shared Source CLI (SSCLI)’
- from Wikipedia “Microsoft provides the Shared Source CLI as a reference CLI implementation suitable for educational use”

Any papers I’ve missed? If so, please let me know in the comments or on Twitter

.NET Runtime as a Case-Study
Microsoft Research
Mono Runtime
Shared Source Common Language Infrastructure (SSCLI) - a.k.a ‘Rotor’

.NET Runtime as a Case-Study

Pitfalls of C# Generics and Their Solution Using Concepts (Belyakova & Mikhalkovich, 2015)

Abstract

In comparison with Haskell type classes and C ++ concepts, such object-oriented languages as C# and Java provide much limited mechanisms of generic programming based on F-bounded polymorphism. Main pitfalls of C# generics are considered in this paper. Extending C# language with concepts which can be simultaneously used with interfaces is proposed to solve the problems of generics; a design and translation of concepts are outlined.

Efficient Compilation of .NET Programs for Embedded Systems (Sallenaveab & Ducournaub, 2011)

Abstract

Compiling under the closed-world assumption (CWA) has been shown to be an appropriate way for implementing object-oriented languages such as Java on low-end embedded systems. In this paper, we explore the implications of using whole program optimizations such as Rapid Type Analysis (RTA) and coloring on programs targeting the .NET infrastructure. We extended RTA so that it takes into account .NET specific features such as (i) array covariance, a language feature also supported in Java, (ii) generics, whose specifications in .Net impacts type analysis and (iii) delegates, which encapsulate methods within objects. We also use an intraprocedural control flow analysis in addition to RTA . We eval-uated the optimizations that we implemented on programs written in C#. Preliminary results show a noticeable reduction of the code size, class hierarchy and polymorphism of the programs we optimize. Array covariance is safe in almost all cases, and some delegate calls can be implemented as direct calls.

Type safety of C# and .Net CLR (Fruja, 2007)

Abstract

Type safety plays a crucial role in the security enforcement of any typed programming language. This thesis presents a formal proof of C#’s type safety. For this purpose, we develop an abstract framework for C#, comprising formal specifications of the language’s grammar, of the statically correct programs, and of the static and operational semantics. Using this framework, we prove that C# is type-safe, by showing that the execution of statically correct C# programs does not lead to type errors.

Modeling the .NET CLR Exception Handling Mechanism for a Mathematical Analysis (Fruja & Börger, 2006)

Abstract

This work is part of a larger project which aims at establishing some important properties of C# and CLR by mathematical proofs. Examples are the correctness of the bytecode verifier of CLR, the type safety (along the lines of the first author’s correctness proof for the definite assignment rules) of C#, the correctness of a general compilation scheme.

Analysis of the .NET CLR Exception Handling Mechanism (Fruja & Börger, 2005)

Abstract

We provide a complete mathematical model for the exception handling mechanism of the Common Language Runtime (CLR), the virtual machine underlying the interpretation of .NET programs. The goal is to use this rigorous model in the corresponding part of the still-to-be-developed soundness proof for the CLR bytecode verifier.

A Modular Design for the Common Language Runtime (CLR) Architecture (Fruja, 2005)

Abstract

This paper provides a modular high-level design of the Common Language Runtime (CLR) architecture. Our design is given in terms of Abstract State Machines (ASMs) and takes the form of an interpreter. We describe the CLR as a hierarchy of eight submachines, which correspond to eight submodules into which the Common Intermediate Language (CIL) instruction set can be decomposed.

Cross-language Program Slicing in the .NET Framework (Pócza, Biczó & Porkoláb, 2005)

Abstract

Dynamic program slicing methods are very attractive for debugging because many statements can be ignored in the process of localizing a bug. Although language interoperability is a key concept in modern development platforms, current slicing techniques are still restricted to a single language. In this paper a cross-language dynamic program slicing technique is introduced for the .NET environment. The method is utilizing the CLR Debugging Services API, hence it can be applied to large multi-language applications.

Design and Implementation of a high-level multi-language . NET Debugger (Strein, 2005)

Abstract

The Microsoft .NET Common Language Runtime (CLR) provides a low-level debugging application programmers interface (API), which can be used to implement traditional source code debuggers but can also be useful to implement other dynamic program introspection tools. This paper describes our experience in using this API for the implementation of a high-level debugger. The API is difficult to use from a technical point of view because it is implemented as a set of Component Object Model (COM) interfaces instead of a managed .NET API. Nevertheless, it is possible to implement a debugger in managed C# code using COM-interop. We describe our experience in taking this approach. We define a high-level debugging API and implement it in the C# language using COM-interop to access the low-level debugging API. Furthermore, we describe the integration of this high-level API in the multi-language development environment X-develop to enable source code debugging of .NET languages. This paper can be useful for anybody who wants to take the same approach to implement debuggers or other tools for dynamic program introspection.

A High-Level Modular Definition of the Semantics of C# (Börger, Fruja, Gervasi & Stärk, 2004)

Abstract

We propose a structured mathematical definition of the semantics of programs to provide a platform-independent interpreter view of the language for the programmer, which can also be used for a precise analysis of the ECMA standard of the language and as a reference model for teaching. The definition takes care to reflect directly and faithfully—as much as possible without becoming inconsistent or incomplete—the descriptions in the standard to become comparable with the corresponding models for Java in Stärk et al. (Java and Java Virtual Machine—Definition, Verification, Validation, Springer, Berlin, 2001) and to provide for implementors the possibility to check their basic design decisions against an accurate high-level model. The model sheds light on some of the dark corners of and on some critical differences between the ECMA standard and the implementations of the language.

An ASM Specification of C# Threads and the .NET Memory Model (Stärk and Börger, 2004)

Abstract

We present a high-level ASM model of C# threads and the .NET memory model. We focus on purely managed, fully portable threading features of C#. The sequential model interleaves the computation steps of the currently running threads and is suitable for uniprocessors. The parallel model addresses problems of true concurrency on multiprocessor systems. The models provide a sound basis for the development of multi-threaded applications in C#. The thread and memory models complete the abstract operational semantics of C# in.

Common Language Runtime : a new virtual machine (Ferreira, 2004)

Abstract

Virtual Machines provide a runtime execution platform combining bytecode portability with a performance close to native code. An overview of current approaches precedes an insight into Microsoft CLR (Common Language Runtime), comparing it to Sun JVM (Java Virtual Machine) and to a native execution environment (IA 32). A reference is also made to CLR in a Unix platform and to techniques on how CLR improves code execution.

JVM versus CLR: a comparative study (Singer, 2003)

Abstract

We present empirical evidence to demonstrate that there is little or no difference between the Java Virtual Machine and the .NET Common Language Runtime, as regards the compilation and execution of object-oriented programs. Then we give details of a case study that proves the superiority of the Common Language Runtime as a target for imperative programming language compilers (in particular GCC).

Runtime Code Generation with JVM And CLR (Sestoft, 2002)

Abstract

Modern bytecode execution environments with optimizing just-in-time compilers, such as Sun’s Hotspot Java Virtual Machine, IBM’s Java Virtual Machine, and Microsoft’s Common Language Runtime, provide an infrastructure for generating fast code at runtime. Such runtime code generation can be used for efficient implementation of parametrized algorithms. More generally, with runtime code generation one can introduce an additional binding-time without performance loss. This permits improved performance and improved static correctness guarantees.

Microsoft Research

Project Snowflake: Non-blocking safe manual memory management in .NET (Parkinson, Vaswani, Costa, Deligiannis, Blankstein, McDermott, Balkind & Vytiniotis, 2017)

Abstract

Garbage collection greatly improves programmer productivity and ensures memory safety. Manual memory management on the other hand often delivers better performance but is typically unsafe and can lead to system crashes or security vulnerabilities. We propose integrating safe manual memory management with garbage collection in the .NET runtime to get the best of both worlds. In our design, programmers can choose between allocating objects in the garbage collected heap or the manual heap. All existing applications run unmodified, and without any performance degradation, using the garbage collected heap. Our programming model for manual memory management is flexible: although objects in the manual heap can have a single owning pointer, we allow deallocation at any program point and concurrent sharing of these objects amongst all the threads in the program. Experimental results from our .NET CoreCLR implementation on real-world applications show substantial performance gains especially in multithreaded scenarios: up to 3x savings in peak working sets and 2x improvements in runtime.

Simple, Fast and Safe Manual Memory Management (Kedia, Costa, Vytiniotis, Parkinson, Vaswani & Blankstein, 2017)

Abstract

Safe programming languages are readily available, but many applications continue to be written in unsafe languages, because the latter are more efficient. As a consequence, many applications continue to have exploitable memory safety bugs. Since garbage collection is a major source of inefficiency in the implementation of safe languages, replacing it with safe manual memory management would be an important step towards solving this problem.

Previous approaches to safe manual memory management use programming models based on regions, unique pointers, borrowing of references, and ownership types. We propose a much simpler programming model that does not require any of these concepts. Starting from the design of an imperative type safe language (like Java or C#), we just add a delete operator to free memory explicitly and an exception which is thrown if the program dereferences a pointer to freed memory. We propose an efficient implementation of this programming model that guarantees type safety. Experimental results from our implementation based on the C# native compiler show that this design achieves up to 3x reduction in peak working set and run time.

Uniqueness and Reference Immutability for Safe Parallelism (Gordon, Parkinson, Parsons, Bromfield & Duffy, 2012)

Abstract

A key challenge for concurrent programming is that side-effects (memory operations) in one thread can affect the behavior of another thread. In this paper, we present a type system to restrict the updates to memory to prevent these unintended side-effects. We provide a novel combination of immutable and unique (isolated) types that ensures safe parallelism (race freedom and deterministic execution). The type system includes support for polymorphism over type qualifiers, and can easily create cycles of immutable objects. Key to the system’s flexibility is the ability to recover immutable or externally unique references after violating uniqueness without any explicit alias tracking. Our type system models a prototype extension to C# that is in active use by a Microsoft team. We describe their experiences building large systems with this extension. We prove the soundness of the type system by an embedding into a program logic.

A study of concurrent real-time garbage collectors (Pizlo, Petrank & Steensgaard, 2008)

Abstract

Concurrent garbage collection is highly attractive for real-time systems, because offloading the collection effort from the executing threads allows faster response, allowing for extremely short deadlines at the microseconds level. Concurrent collectors also offer much better scalability over incremental collectors. The main problem with concurrent real-time collectors is their complexity. The first concurrent real-time garbage collector that can support fine synchronization, STOPLESS, has recently been presented by Pizlo et al. In this paper, we propose two additional (and different) algorithms for concurrent real-time garbage collection: CLOVER and CHICKEN. Both collectors obtain reduced complexity over the first collector STOPLESS, but need to trade a benefit for it. We study the algorithmic strengths and weaknesses of CLOVER and CHICKEN and compare them to STOPLESS. Finally, we have implemented all three collectors on the Bartok compiler and runtime for C# and we present measurements to compare their efficiency and responsiveness.

Optimizing concurrency levels in the. net threadpool: A case study of controller design and implementation (Hellerstein, Morrison & Eilebrecht, 2008)

Abstract

This paper presents a case study of developing a hill climb-ing concurrency controller (HC 3) for the .NET ThreadPool. The intent of the case study is to provide insight into soft-ware considerations for controller design, testing, and imple-mentation. The case study is structured as a series of issues encountered and approaches taken to their resolution. Ex-amples of issues and approaches include: (a) addressing the need to combine a hill climbing control law with rule-based techniques by the use of hybrid control; (b) increasing the ef-ficiency and reducing the variability of the test environment by using resource emulation; and (c) effectively assessing design choices by using test scenarios for which the optimal concurrency level can be computed analytically and hence desired test results are known a priori. We believe that these issues and approaches have broad application to controllers for resource management of software systems.

Stopless: a real-time garbage collector for multiprocessors. (Pizlo, Frampton, Petrank & Steensgaard, 2007)

Abstract

We present STOPLESS: a concurrent real-time garbage collector suitable for modern multiprocessors running parallel multithreaded applications. Creating a garbage-collected environment that sup- ports real-time on modern platforms is notoriously hard, especially if real-time implies lock-freedom. Known real-time collectors ei- ther restrict the real-time guarantees to uniprocessors only, rely on special hardware, or just give up supporting atomic operations (which are crucial for lock-free software). STOPLESS is the first collector that provides real-time responsiveness while preserving lock-freedom, supporting atomic operations, controlling fragmen- tation by compaction, and supporting modern parallel platforms. STOPLESS is adequate for modern languages such as C# or Java. It was implemented on top of the Bartok compiler and runtime for C# and measurements demonstrate high responsiveness (a factor of a 100 better than previously published systems), virtually no pause times, good mutator utilization, and acceptable overheads.

Securing the .NET Programming Model (Kennedy, 2006)

Abstract

The security of the .NET programming model is studied from the standpoint of fully abstract compilation of C#. A number of failures of full abstraction are identified, and fixes described. The most serious problems have recently been fixed for version 2.0 of the .NET Common Language Runtime.

Abstract

We describe problems that have arisen when combining the proposed design for generics for the Microsoft .NET Common Language Runtime (CLR) with two resource-related features supported by the Microsoft CLR implementation: application domains and pre-compilation. Application domains are “software based processes” and the interaction between application domains and generics stems from the fact that code and descriptors are generated on a pergeneric-instantiation basis, and thus instantiations consume resources which are preferably both shareable and recoverable. Pre-compilation runs at install-time to reduce startup overheads. This interacts with application domain unloading: compilation units may contain shareable generated instantiations. The paper describes these interactions and the diﬀerent approaches that can be used to avoid or ameliorate the problems.

Formalization of Generics for the .NET Common Language Runtime (Yu, Kennedy & Syme, 2004)

Abstract

We present a formalization of the implementation of generics in the .NET Common Language Runtime (CLR), focusing on two novel aspects of the implementation: mixed specialization and sharing, and efficient support for run-time types. Some crucial constructs used in the implementation are dictionaries and run-time type representations. We formalize these aspects type-theoretically in a way that corresponds in spirit to the implementation techniques used in practice. Both the techniques and the formalization also help us understand the range of possible implementation techniques for other languages, e.g., ML, especially when additional source language constructs such as run-time types are supported. A useful by-product of this study is a type system for a subset of the polymorphic IL proposed for the .NET CLR.

Runtime Verification of .NET Contracts (Barnett & Schulte, 2003)

Abstract

We propose a method for implementing behavioral interface specifications on the .NET platform. Our interface specifications are expressed as executable model programs. Model programs can be run either as stand-alone simulations or used as contracts to check the conformance of an implementation class to its specification. We focus on the latter, which we call runtime verification.In our framework, model programs are expressed in the new specification language AsmL. We describe how AsmL can be used to describe contracts independently from any implementation language, how AsmL allows properties of component interaction to be specified using mandatory calls, and how AsmL is used to check the behavior of a component written in any of the .NET languages, such as VB, C#, or C++.

Design and Implementation of Generics for the .NET Common Language Runtime (Kennedy & Syme, 2001)

Abstract

The Microsoft .NET Common Language Runtime provides a shared type system, intermediate language and dynamic execution environment for the implementation and inter-operation of multiple source languages. In this paper we extend it with direct support for parametric polymorphism (also known as generics), describing the design through examples written in an extended version of the C# programming language, and explaining aspects of implementation by reference to a prototype extension to the runtime. Our design is very expressive, supporting parameterized types, polymorphic static, instance and virtual methods, “F-bounded” type parameters, instantiation at pointer and value types, polymorphic recursion, and exact run-time types. The implementation takes advantage of the dynamic nature of the runtime, performing justin-time type specialization, representation-based code sharing and novel techniques for efﬁcient creation and use of run-time types. Early performance results are encouraging and suggest that programmers will not need to pay an overhead for using generics, achieving performance almost matching hand-specialized code.

Typing a Multi-Language Intermediate Code (Gordon & Syme, 2001)

Abstract

The Microsoft .NET Framework is a new computing architecture designed to support a variety of distributed applications and web-based services. .NET software components are typically distributed in an object-oriented intermediate language, Microsoft IL, executed by the Microsoft Common Language Runtime. To allow convenient multi-language working, IL supports a wide variety of high-level language constructs, including class-based objects, inheritance, garbage collection, and a security mechanism based on type safe execution. This paper precisely describes the type system for a substantial fragment of IL that includes several novel features: certain objects may be allocated either on the heap or on the stack; those on the stack may be boxed onto the heap, and those on the heap may be unboxed onto the stack; methods may receive arguments and return results via typed pointers, which can reference both the stack and the heap, including the interiors of objects on the heap. We present a formal semantics for the fragment. Our typing rules determine well-typed IL instruction sequences that can be assembled and executed. Of particular interest are rules to ensure no pointer into the stack outlives its target. Our main theorem asserts type safety, that well-typed programs in our IL fragment do not lead to untrapped execution errors. Our main theorem does not directly apply to the product. Still, the formal system of this paper is an abstraction of informal and executable specifications we wrote for the full product during its development. Our informal specification became the basis of the product team’s working specification of type-checking. The process of writing this specification, deploying the executable specification as a test oracle, and applying theorem proving techniques, helped us identify several security critical bugs during development.

Mono Runtime

Static and Dynamic Analysis of Android Malware and Goodware Written with Unity Framework (Shim, Lim, Cho, Han & Park, 2018)

Abstract

Unity is the most popular cross-platform development framework to develop games for multiple platforms such as Android, iOS, and Windows Mobile. While Unity developers can easily develop mobile apps for multiple platforms, adversaries can also easily build malicious apps based on the “write once, run anywhere” (WORA) feature. Even thoughmalicious apps were discovered among Android apps written with Unity framework (Unity apps), little research has been done on analysing the malicious apps. We propose static and dynamic reverse engineering techniques for malicious Unity apps. We first inspect the executable file format of a Unity app and present an effective static analysis technique of the Unity app. Then, we also propose a systematic technique to analyse dynamically the Unity app. Using the proposed techniques, the malware analyst can statically and dynamically analyse Java code, native code in C or C ++, and the Mono runtime layer where the C# code is running.

Reducing startup time of a deterministic virtualizing runtime environment (Däumler & Werner, 2013)

Abstract

Virtualized runtime environments like Java Virtual Machine (JVM) or Microsoft .NET’s Common Language Runtime (CLR) introduce additional challenges to real-time software development. Since applications for such environments are usually deployed in platform independent intermediate code, one issue is the timing of code transformation from intermediate code into native code. We have developed a solution for this problem, so that code transformation is suitable for real-time systems. It combines pre-compilation of intermediate code with the elimination of indirect references in native code. The gain of determinism comes with an increased application startup time. In this paper we present an optimization that utilizes an Ahead-of-Time compiler to reduce the startup time while keeping the real-time suitable timing behaviour. In an experiment we compare our approach with existing ones and demonstrate its benefits for certain application cases.

Detecting Clones Across Microsoft .NET Programming Languages (Al-Omari, Keivanloo, Roy & Rilling, 2012)

Abstract

The Microsoft .NET framework and its language family focus on multi-language development to support interoperability across several programming languages. The framework allows for the development of similar applications in different languages through the reuse of core libraries. As a result of such a multi-language development, the identification and trace ability of similar code fragments (clones) becomes a key challenge. In this paper, we present a clone detection approach for the .NET language family. The approach is based on the Common Intermediate Language, which is generated by the .NET compiler for the different languages within the .NET framework. In order to achieve an acceptable recall while maintaining the precision of our detection approach, we define a set of filtering processes to reduce noise in the raw data. We show that these filters are essential for Intermediate Language-based clone detection, without significantly affecting the precision of the detection approach. Finally, we study the quantitative and qualitative performance aspects of our clone detection approach. We evaluate the number of reported candidate clone-pairs, as well as the precision and recall (using manual validation) for several open source cross-language systems, to show the effectiveness of our proposed approach.

Language-independent sandboxing of just-in-time compilation and self-modifying code (Ansel & Marchenko, 2012)

Abstract

When dealing with dynamic, untrusted content, such as on the Web, software behavior must be sandboxed, typically through use of a language like JavaScript. However, even for such specially-designed languages, it is difficult to ensure the safety of highly-optimized, dynamic language runtimes which, for efficiency, rely on advanced techniques such as Just-In-Time (JIT) compilation, large libraries of native-code support routines, and intricate mechanisms for multi-threading and garbage collection. Each new runtime provides a new potential attack surface and this security risk raises a barrier to the adoption of new languages for creating untrusted content. Removing this limitation, this paper introduces general mechanisms for safely and efficiently sandboxing software, such as dynamic language runtimes, that make use of advanced, low-level techniques like runtime code modification. Our language-independent sandboxing builds on Software-based Fault Isolation (SFI), a traditionally static technique. We provide a more flexible form of SFI by adding new constraints and mechanisms that allow safety to be guaranteed despite runtime code modifications. We have added our extensions to both the x86-32 and x86-64 variants of a production-quality, SFI-based sandboxing platform; on those two architectures SFI mechanisms face different challenges. We have also ported two representative language platforms to our extended sandbox: the Mono common language runtime and the V8 JavaScript engine. In detailed evaluations, we find that sandboxing slowdown varies between different benchmarks, languages, and hardware platforms. Overheads are generally moderate and they are close to zero for some important benchmark/platform combinations.

VMKit: a Substrate for Managed Runtime Environments (Geoffray, Thomas, Lawall, Muller & Folliot, 2010)

Abstract

Managed Runtime Environments (MREs), such as the JVM and the CLI, form an attractive environment for program execution, by providing portability and safety, via the use of a bytecode language and automatic memory management, as well as good performance, via just-in-time (JIT) compilation. Nevertheless, developing a fully featured MRE, including e.g. a garbage collector and JIT compiler, is a herculean task. As a result, new languages cannot easily take advantage of the benefits of MREs, and it is difficult to experiment with extensions of existing MRE based languages. This paper describes and evaluates VMKit, a first attempt to build a common substrate that eases the development of high-level MREs. We have successfully used VMKit to build two MREs: a Java Virtual Machine and a Common Language Runtime. We provide an extensive study of the lessons learned in developing this infrastructure, and assess the ease of implementing new MREs or MRE extensions and the resulting performance. In particular, it took one of the authors only one month to develop a Common Language Runtime using VMKit. VMKit furthermore has performance comparableto the well established open source MREs Cacao, Apache Harmony and Mono, and is 1.2 to 3 times slower than JikesRVM on most of the Dacapo benchmarks.

MMC: the Mono Model Checker (Ruys & Aan de Brugh, 2007)

Abstract

The Mono Model Checker (mmc) is a software model checker for cil bytecode programs. mmc has been developed on the Mono platform. mmc is able to detect deadlocks and assertion violations in cil programs. The design of mmc is inspired by the Java PathFinder (jpf), a model checker for Java programs. The performance of mmc is comparable to jpf. This paper introduces mmc and presents its main architectural characteristics.

Numeric performance in C, C# and Java (Sestoft, 2007)

Abstract

We compare the numeric performance of C, C# and Java on three small cases.

Mono versus .Net: A Comparative Study of Performance for Distributed Processing. (Blajian, Eggen, Eggen & Pitts, 2006)

Abstract

Microsoft has released .NET, a platform dependent standard for the C#,programming language. Sponsored by Ximian/Novell, Mono, the open source development platform based on the .NET framework, has been developed to be a platform independent version of the C#,programming environment. While .NET is platform dependent, Mono allows developers to build Linux and crossplatform applications. Mono’s .NET implementation is based on the ECMA standards for C#. This paper examines both of these programming environments with the goal of evaluating the performance characteristics of each. Testing is done with various algorithms. We also assess the trade-offs associated with using a cross-platform versus a platform.

Automated detection of performance regressions: the mono experience (Kalibera, Bulej & Tuma, 2005)

Abstract

Engineering a large software project involves tracking the impact of development and maintenance changes on the software performance. An approach for tracking the impact is regression benchmarking, which involves automated benchmarking and evaluation of performance at regular intervals. Regression benchmarking must tackle the nondeterminism inherent to contemporary computer systems and execution environments and the impact of the nondeterminism on the results. On the example of a fully automated regression benchmarking environment for the mono open-source project, we show how the problems associated with nondeterminism can be tackled using statistical methods.

Shared Source Common Language Infrastructure (SSCLI) - a.k.a ‘Rotor’

Efficient virtual machine support of runtime structural reflection (Ortina, Redondoa & Perez-Schofield, 2009)

Abstract

Increasing trends towards adaptive, distributed, generative and pervasive software have made object-oriented dynamically typed languages become increasingly popular. These languages offer dynamic software evolution by means of reflection, facilitating the development of dynamic systems. Unfortunately, this dynamism commonly imposes a runtime performance penalty. In this paper, we describe how to extend a production JIT-compiler virtual machine to support runtime object-oriented structural reflection offered by many dynamic languages. Our approach improves runtime performance of dynamic languages running on statically typed virtual machines. At the same time, existing statically typed languages are still supported by the virtual machine.

We have extended the .Net platform with runtime structural reflection adding prototype-based object-oriented semantics to the statically typed class-based model of .Net, supporting both kinds of programming languages. The assessment of runtime performance and memory consumption has revealed that a direct support of structural reflection in a production JIT-based virtual machine designed for statically typed languages provides a significant performance improvement for dynamically typed languages.

Extending the SSCLI to Support Dynamic Inheritance (Redondo, Ortin & Perez-Schofield, 2008)

Abstract

This paper presents a step forward on a research trend focused on increasing runtime adaptability of commercial JIT-based virtual machines, describing how to include dynamic inheritance into this kind of platforms. A considerable amount of research aimed at improving runtime performance of virtual machines has converted them into the ideal support for developing different types of software products. Current virtual machines do not only provide benefits such as application interoperability, distribution and code portability, but they also offer a competitive runtime performance.

Since JIT compilation has played a very important role in improving runtime performance of virtual machines, we first extended a production JIT-based virtual machine to support efficient language-neutral structural reflective primitives of dynamically typed programming languages. This article presents the next step in our research work: supporting language-neutral dynamic inheritance for both statically and dynamically typed programming languages. Executing both kinds of programming languages over the same platform provides a direct interoperation between them.

Sampling profiler for Rotor as part of optimizing compilation system (Chilingarova & Safonov, 2006)

Abstract

This paper describes a low-overhead self-tuning sampling-based runtime profiler integrated into SSCLI virtual machine. Our profiler estimates how “hot” a method is and builds a call context graph based on managed stack samples analysis. The frequency of sampling is tuned dynamically at runtime, based on the information of how often the same activation record appears on top of the stack. The call graph is presented as a novel Call Context Map (CC-Map) structure that combines compact representation and accurate information about the context. It enables fast extraction of data helpful in making compilation decisions, as well as fast placing data into the map. Sampling mechanism is integrated with intrinsic Rotor mechanisms of thread preemption and stack walk. A separate system thread is responsible for organizing data in the CC-Map. This thread gathers and stores samples quickly queued by managed threads, thus decreasing the time they must hold up their user-scheduled job

To JIT or not to JIT: The effect of code-pitching on the performance of .NET framework (Anthony, Leung & Srisa-an, 2005)

Abstract

The.NET Compact Framework is designed to be a highperformance virtual machine for mobile and embedded devices that operate on Windows CE (version 4.1 and later). It achieves fast execution time by compiling methods dynamically instead of using interpretation. Once compiled, these methods are stored in a portion of the heap called code-cache and can be reused quickly to satisfy future method calls. While code-cache provides a high-level of reusability, it can also use a large amount of memory. As a result, the Compact Framework provides a “code pitching ” mechanism that can be used to discard the previously compiled methods as needed. In this paper, we study the effect of code pitching on the overall performance and memory utilization of.NET applications. We conduct our experiments using Microsoft’s Shared-Source Common Language Infrastructure (SSCLI). We profile the access behavior of the compiled methods. We also experiment with various code-cache configurations to perform pitching. We find that programs can operate efficiently with a small code-cache without incurring substantial recompilation and execution overheads.

Adding structural reflection to the SSCLI (Ortin, Redondo, Vinuesa & Lovelle, 2005)

Abstract

Although dynamic languages are becoming widely used due to the flexibility needs of specific software prod- ucts, their major drawback is their runtime performance. Compiling the source program to an abstract machine’s intermediate language is the current technique used to obtain the best performance results. This intermediate code is then executed by a virtual machine developed as an interpreter. Although JIT adaptive optimizing com- pilation is currently used to speed up Java and .net intermediate code execution, this practice has not been em- ployed successfully in the implementation of dynamically adaptive platforms yet. We present an approach to improve the runtime performance of a specific set of structural reflective primitives, extensively used in adaptive software development. Looking for a better performance, as well as interaction with other languages, we have employed the Microsoft Shared Source CLI platform, making use of its JIT compiler. The SSCLI computational model has been enhanced with semantics of the prototype-based object-oriented com- putational model. This model is much more suitable for reflective environments. The initial assessment of per- formance results reveals that augmenting the semantics of the SSCLI model, together with JIT generation of native code, produces better runtime performance than the existing implementations.

Static Analysis for Identifying and Allocating Clusters of Immortal Objects (Ravindar & Srikant, 2005)

Abstract

Long living objects lengthen the trace time which is a critical phase of the garbage collection process. However, it is possible to recognize object clusters i.e. groups of long living objects having approximately the same lifetime and treat them separately to reduce the load on the garbage collector and hence improve overall performance. Segregating objects this way leaves the heap for objects with shorter lifetimes and now a typical collection can nd more garbage than before. In this paper, we describe a compile time analysis strategy to identify object clusters in programs. The result of the compile time analysis is the set of allocation sites that contribute towards allocating objects belonging to such clusters. All such allocation sites are replaced by a new allocation method that allocates objects into the cluster area rather than the heap. This study was carried out for a concurrent collector which we developed for Rotor, Microsoft’s Shared Source Implementation of .NET. We analyze the performance of the program with combina- tions of the cluster and stack allocation optimizations. Our results show that the clustering optimization reduces the number of collections by 66.5% on average, even eliminating the need for collection in some programs. As a result, the total pause time reduces by 62.8% on average. Using both stack allocation and the cluster optimizations brings down the number of collections by 91.5% thereby improving the total pause time by 79.33%.

An Optimizing Just-InTime Compiler for Rotor (Trindade & Silva, 2005)

Abstract

The Shared Source CLI (SSCLI), also known as Rotor, is an implementation of the CLI released by Microsoft in source code. Rotor includes a single pass just-in-time compiler that generates non-optimized code for Intel IA-32 and IBM PowerPC processors. We extend Rotor with an optimizing justin-time compiler for IA-32. This compiler has three passes: control flow graph generation, data dependence graph generation and final code generation. Dominance relations in the control flow graph are used to detect natural loops. A number of optimizations are performed during the generation of the data dependence graph. During native code generation, the rich address modes of IA32 are used for instruction folding, reducing code size and usage of register names. Despite the overhead of three passes and optimizations, this compiler is only 1.4 to 1.9 times slower than the original SSCLI compiler and generates code that runs 6.4 to 10 times faster.

Software Interactions into the SSCLI platform (Charfi & Emsellem, 2004)

Abstract

By using an Interaction Specification Language (ISL), interactions between components can be expressed in a language independent way. At class level, interaction pattern specified in ISLrepresent model s of future interactions when applied on some component instances. The Interaction Server is in charge of managing the life cycle of interactions (interaction pattern registration and instantiation, destruction of interactions, merging). It acts as a central repository that keeps the global coherency of the adaptations realized on the component instances.The Interaction service allows creati ng interactions between heterogeneous components. Noah is an implementation of this Interaction Service. It can be thought as a dynamic aspect repository with a weaver that uses an aspect composition mechanism that insures commutable and associative adaptations. In this paper, we propose the implementation of the Interaction Service in the SSCLI. In contrast to other implementations such as Java where interaction management represents an additional layer, SSCLI enables us to integrate Interaction Management as in intrinsic part of the CLI runtime.

Experience Integrating a New Compiler and a New Garbage Collector Into Rotor (Anderson, Eng, Glew, Lewis, Menon & Stichnoth, 2004)

Abstract

Microsoft’s Rotor is a shared-source CLI implementation intended for use as a research platform. It is particularly attractive for research because of its complete implementation and extensive libraries, and because its modular design allows dierent implementations of certain components such as just-in-time compilers (JITs). Our group has independently developed our own high-performance JIT and garbage collector (GC) and wanted to take advantage of Rotor to experiment with these components in a CLI environment. In this paper, we describe our experience integrating these components into Rotor and evaluate the flexibility of Rotor’s design toward this goal. We found it easier to integrate our JIT than our GC because Rotor has a well-defined interface for the former but not the latter. However, our JIT integration still required significant changes to both Rotor and our JIT. For example, we modified Rotor to support multiple JITs. We also added support for a second JIT manager in Rotor, and implemented a new code manager compatible with our JIT. We had to change our JIT compiler to support Rotor’s calling conventions, helper functions, and exception model. Our GC integration was complicated by the many places in Rotor where components make assumptions about how its garbage collector is implemented, as well as Rotor’s lack of a well-defined GC interface. We also had to reconcile the dierent assumptions made by Rotor and our garbage collector about the layout of objects, virtual-method tables, and thread structures.

"Stubs" in the .NET Runtime

2019-09-26T00:00:00+00:00

As the saying goes:

“All problems in computer science can be solved by another level of indirection”

- David Wheeler

and it certainly seems like the ‘.NET Runtime’ Engineers took this advice to heart!

‘Stubs’, as they’re known in the runtime (sometimes ‘Thunks’), provide a level of indirection throughout the source code, there’s almost 500 mentions of them!

This post will explore what they are, how they work and why they’re needed.

Table of Contents

What are stubs?
Types of stubs
Other Types of Stubs
Stubs in the Mono Runtime
Conclusion

What are stubs?

In the context of the .NET Runtime, ‘stubs’ look something like this:

   Call-site                                         Callee
+--------------+           +---------+           +-------------+
|              |           |         |           |             |
|              +---------->+  Stub   + - - - - ->+             |
|              |           |         |           |             |
+--------------+           +---------+           +-------------+

So they sit between a method ‘call-site’ (i.e. code such as var result = Foo(..);) and the ‘callee’ (where the method itself is implemented, the native/assembly code) and I like to think of them as doing tidy-up or fix-up work. Note that moving from the ‘stub’ to the ‘callee’ isn’t another full method call (hence the dotted line), it’s often just a single jmp or call assembly instruction, so the 2nd transition doesn’t involve all the same work that was initially done at the call-site (pushing/popping arguments into registers, increasing the stack space, etc).

The stubs themselves can be as simple as just a few assembly instructions or something more complicated, we’ll look at individual examples later on in this post.

Now, to be clear, not all method calls require a stub, if you’re doing a regular call to an static or instance method that just goes directly from the ‘call-site’ to the ‘callee’. But once you involve virtual methods, delegates or generics things get a bit more complicated.

Why are stubs needed?

There are several reasons that stubs need to be created by the runtime:

Required Functionality
- For instance Delegates and Arrays must be provided but the runtime, their method bodies are not generated by the C#/F#/VB.NET compiler and neither do they exist in the Base-Class Libraries. This requirement is outlined in the ECMA 355 Spec, for instance ‘Partition I’ in section ‘8.9.1 Array types’ says:
  
  Exact array types are created automatically by the VES when they are required. Hence, the operations on an array type are defined by the CTS. These generally are: allocating the array based on size and lower-bound information, indexing the array to read and write a value, computing the address of an element of the array (a managed pointer), and querying for the rank, bounds, and the total number of values stored in the array.
  
  Likewise for delegates, which are covered in ‘I.8.9.3 Delegates’:
  
  While, for the most part, delegates appear to be simply another kind of user-defined class, they are tightly controlled. The implementations of the methods are provided by the VES, not user code. The only additional members that can be defined on delegate types are static or instance methods.
Performance
- Other types of ‘stubs’, such as Virtual Stub Dispatch and Generic Instantiation Stubs are there to make those operations perform well or to have an positive impact on the entire runtime, such as reducing the memory footprint (in the case of ‘shared generic code’).
Consistent method calls
- A final factor is that having ‘stubs’ makes the work of the JIT compiler easier. As we will see in the rest of the post, stubs deal with a variety of different types of method calls. This means the the JIT can generate more straightforward code for any given ‘call site’, because it (mostly) doesn’t care whats happening in the ‘callee’. If stubs didn’t exist, for a given method call the JIT would have to generate different code depending on whether generics where involved or not, if it was a virtual or non-virtual call, if it was going via a delegate, etc. Stubs abstact a lot of this behaviour away from the JIT, allowing it to deal with a more simple ‘Application Binary Interface’ (ABI).

CLR ‘Application Binary Interface’ (ABI)

Therefore, another way to think about ‘stubs’ is that they are part of what makes the CLR-specific ‘Application Binary Interface’ (ABI) work.

All code needs to work with the ABI or ‘calling convention’ of the CPU/OS that it’s running on, for instance by following the x86 calling convention, x64 calling convention or System V ABI. This applies across runtimes, for more on this see:

As an aside, if you want more information about ‘calling conventions’ here’s some links that I found useful:

However, on-top of what the CLR has to support due to the CPU/OS conventions, it also has it’s own extended ABI for .NET-specific use cases, including:

“this” pointer:

The managed “this” pointer is treated like a new kind of argument not covered by the native ABI, so we chose to always pass it as the first argument in (AMD64) RCX or (ARM, ARM64) R0. AMD64-only: Up to .NET Framework 4.5, the managed “this” pointer was treated just like the native “this” pointer (meaning it was the second argument when the call used a return buffer and was passed in RDX instead of RCX). Starting with .NET Framework 4.5, it is always the first argument.
Generics or more specifically to handle ‘Shared generics’:

In cases where the code address does not uniquely identify a generic instantiation of a method, then a ‘generic instantiation parameter’ is required. Often the “this” pointer can serve dual-purpose as the instantiation parameter. When the “this” pointer is not the generic parameter, the generic parameter is passed as an additional argument..
Hidden Parameters, covering ‘Stub dispatch’, ‘Fast Pinvoke’, ‘Calli Pinvoke’ and ‘Normal PInvoke’. For instance, here’s why ‘PInvoke’ has a hidden parameter:

Normal PInvoke - The VM shares IL stubs based on signatures, but wants the right method to show up in call stack and exceptions, so the MethodDesc for the exact PInvoke is passed in the (x86) EAX / (AMD64) R10 / (ARM, ARM64) R12 (in the JIT: REG_SECRET_STUB_PARAM). Then in the IL stub, when the JIT gets CORJIT_FLG_PUBLISH_SECRET_PARAM, it must move the register into a compiler temp.

Not all of these scenarios need a stub, for instance the ‘this’ pointer is handled directly by the JIT, but many do as we’ll see in the rest of the post.

Stub Management

So we’ve seen why stubs are needed and what type of functionality they can provide. But before we look at all the specific examples that exist in the CoreCLR source, I just wanted to take some time to understand the common or shared concerns that apply to all stubs.

Stubs in the CLR are snippets of assembly code, but they have to be stored in memory and have their life-time managed. Also, they have to play nice with the debugger, from What Every CLR Developer Must Know Before Writing Code:

2.8 Is your code compatible with managed debugging?

..

If you add a new stub (or way to call managed code), make sure that you can source-level step-in (F11) it under the debugger. The debugger is not psychic. A source-level step-in needs to be able to go from the source-line before a call to the source-line after the call, or managed code developers will be very confused. If you make that call transition be a giant 500 line stub, you must cooperate with the debugger for it to know how to step-through it. (This is what StubManagers are all about. See src\vm\stubmgr.h). Try doing a step-in through your new codepath under the debugger.

So every type of stub has a StubManager which deals with the allocation, storage and lookup of the stubs. The lookup is significant, as it provides the mapping from an arbitrary memory address to the type of stub (if any) that created the code. As an example, here’s what the CheckIsStub_Internal(..) method here and DoTraceStub(..) method here look like for the DelegateInvokeStubManager:

BOOL DelegateInvokeStubManager::CheckIsStub_Internal(PCODE stubStartAddress)
{
    LIMITED_METHOD_DAC_CONTRACT;

    bool fIsStub = false;

#ifndef DACCESS_COMPILE
#ifndef _TARGET_X86_
    fIsStub = fIsStub || (stubStartAddress == GetEEFuncEntryPoint(SinglecastDelegateInvokeStub));
#endif
#endif // !DACCESS_COMPILE

    fIsStub = fIsStub || GetRangeList()->IsInRange(stubStartAddress);

    return fIsStub;
}

BOOL DelegateInvokeStubManager::DoTraceStub(PCODE stubStartAddress, TraceDestination *trace)
{
    LIMITED_METHOD_CONTRACT;

    LOG((LF_CORDB, LL_EVERYTHING, "DelegateInvokeStubManager::DoTraceStub called\n"));

    _ASSERTE(CheckIsStub_Internal(stubStartAddress));

    // If it's a MC delegate, then we want to set a BP & do a context-ful
    // manager push, so that we can figure out if this call will be to a
    // single multicast delegate or a multi multicast delegate
    trace->InitForManagerPush(stubStartAddress, this);

    LOG_TRACE_DESTINATION(trace, stubStartAddress, "DelegateInvokeStubManager::DoTraceStub");

    return TRUE;
}

The code to initialise the various stub managers is here in SystemDomain::Attach() and by working through the list we can get a sense of what each category of stub does (plus the informative comments in the code help!)

PrecodeStubManager implemented here
- ‘Stub manager functions & globals’
DelegateInvokeStubManager implemented here
- ‘Since we don’t generate delegate invoke stubs at runtime on IA64, we can’t use the StubLinkStubManager for these stubs. Instead, we create an additional DelegateInvokeStubManager instead.’
JumpStubStubManager implemented here
- ‘Stub manager for jump stubs created by ExecutionManager::jumpStub() These are currently used only on the 64-bit targets IA64 and AMD64’
RangeSectionStubManager implemented here
- ‘Stub manager for code sections. It forwards the query to the more appropriate stub manager, or handles the query itself.’
ILStubManager implemented here
- ‘This is the stub manager for IL stubs’
InteropDispatchStubManager implemented here
- ‘This is used to recognize GenericComPlusCallStub, VarargPInvokeStub, and GenericPInvokeCalliHelper.’
StubLinkStubManager implemented here
ThunkHeapStubManager implemented here
- ‘Note, the only reason we have this stub manager is so that we can recgonize UMEntryThunks for IsTransitionStub. ..’
TailCallStubManager implemented here
- ‘This is the stub manager to help the managed debugger step into a tail call. It helps the debugger trace through JIT_TailCall().’ (from stubmgr.h)
ThePreStubManager implemented here (in prestub.cpp)
- ‘The following code manages the PreStub. All method stubs initially use the prestub.’
VirtualCallStubManager implemented here (in virtualcallstub.cpp)
- ‘VirtualCallStubManager is the heart of the stub dispatch logic. See the book of the runtime entry’ (BOTR - Virtual Stub Dispatch)

Finally, we can also see the ‘StubManagers’ in action if we use the eeheap SOS command to inspect the ‘heap dump’ of a .NET Process, as it helps report the size of the different ‘stub heaps’:

> !eeheap -loader

Loader Heap:
--------------------------------------
System Domain: 704fd058
LowFrequencyHeap: Size: 0x0(0)bytes.
HighFrequencyHeap: 002e2000(8000:1000) Size: 0x1000(4096)bytes.
StubHeap: 002ea000(2000:1000) Size: 0x1000(4096)bytes.
Virtual Call Stub Heap:
- IndcellHeap: Size: 0x0(0)bytes.
- LookupHeap: Size: 0x0(0)bytes.
- ResolveHeap: Size: 0x0(0)bytes.
- DispatchHeap: Size: 0x0(0)bytes.
- CacheEntryHeap: Size: 0x0(0)bytes.
Total size: 0x2000(8192)bytes
--------------------------------------

(output taken from .NET Generics and Code Bloat (or its lack thereof))

You can see that in this case the entire ‘stub heap’ is taking up 4096 bytes and in addition there are more in-depth statistics covering the heaps used by virtual call dispatch.

Types of stubs

The different stubs used by the runtime fall into 3 main categories:

Hand-written assembly code e.g. /vm/amd64/PInvokeStubs.asm
Dynamically emitted assembly code, implemented in C++, e.g. StubLinkerCPU::EmitShuffleThunk(..) in /vm/arm64/stubs.cpp
‘Stubs-as-IL’ which we discuss later on in this post, for example COMDelegate::GetMulticastInvoke(..) in /vm/comdelegate.cpp

Most stubs are wired up in MethodDesc::DoPrestub(..), in this section of code or this section for COM Interop. The stubs generated include the following (definitions taken from BOTR - ‘Kinds of MethodDescs’, also see enum MethodClassification here):

Instantiating in (FEATURE_SHARE_GENERIC_CODE, on by default) in MakeInstantiatingStubWorker(..) here
- Used for less common IL methods that have generic instantiation or that do not have preallocated slot in method table.
P/Invoke (a.k.a NDirect) in GetStubForInteropMethod(..) here
- P/Invoke methods. These are methods marked with DllImport attribute.
FCall methods in ECall::GetFCallImpl(..) here
- Internal methods implemented in unmanaged code. These are methods marked with MethodImplAttribute(MethodImplOptions.InternalCall) attribute, delegate constructors and tlbimp constructors.
Array methods in GenerateArrayOpStub(..) here
- Array methods whose implementation is provided by the runtime (Get, Set, Address)
EEImpl in PCODE COMDelegate::GetInvokeMethodStub(EEImplMethodDesc* pMD) here
- Delegate methods, implementation provided by the runtime
COM Interop (FEATURE_COMINTEROP, on by default) in GetStubForInteropMethod(..) here
- COM interface methods. Since the non-generic interfaces can be used for COM interop by default, this kind is usually used for all interface methods.
Unboxing in Stub * MakeUnboxingStubWorker(MethodDesc *pMD) here

Right, now lets look at the individual stub in more detail.

Precode

First up, we’ll take a look at ‘precode’ stubs, because they are used by all other types of stubs, as explained in the BotR page on Method Descriptors:

The precode is a small fragment of code used to implement temporary entry points and an efficient wrapper for stubs. Precode is a niche code-generator for these two cases, generating the most efficient code possible. In an ideal world, all native code dynamically generated by the runtime would be produced by the JIT. That’s not feasible in this case, given the specific requirements of these two scenarios. The basic precode on x86 may look like this:
mov eax,pMethodDesc // Load MethodDesc into scratch register
jmp target          // Jump to a target
Efficient Stub wrappers: The implementation of certain methods (e.g. P/Invoke, delegate invocation, multi dimensional array setters and getters) is provided by the runtime, typically as hand-written assembly stubs. Precode provides a space-efficient wrapper over stubs, to multiplex them for multiple callers.

The worker code of the stub is wrapped by a precode fragment that can be mapped to the MethodDesc and that jumps to the worker code of the stub. The worker code of the stub can be shared between multiple methods this way. It is an important optimization used to implement P/Invoke marshalling stubs.

By providing a ‘pointer’ to the MethodDesc class, the precode allows any subsequent stub to have access to a lot of information about a method call and it’s containing Type via the MethodTable (‘hot’) and EEClass (‘cold’) data structures. The MethodDesc data-structure is one of the most fundamental types in the runtime, hence why it has it’s own BotR page.

Each ‘precode’ is created in MethodDesc::GetOrCreatePrecode() here and there are several different types as we can see in this enum from /vm/precode.h:

enum PrecodeType {
    PRECODE_INVALID         = InvalidPrecode::Type,
    PRECODE_STUB            = StubPrecode::Type,
#ifdef HAS_NDIRECT_IMPORT_PRECODE
    PRECODE_NDIRECT_IMPORT  = NDirectImportPrecode::Type,
#endif // HAS_NDIRECT_IMPORT_PRECODE
#ifdef HAS_FIXUP_PRECODE
    PRECODE_FIXUP           = FixupPrecode::Type,
#endif // HAS_FIXUP_PRECODE
#ifdef HAS_THISPTR_RETBUF_PRECODE
    PRECODE_THISPTR_RETBUF  = ThisPtrRetBufPrecode::Type,
#endif // HAS_THISPTR_RETBUF_PRECODE
};

As always, the BotR page describes the different types in great detail, but in summary:

StubPrecode - .. is the basic precode type. It loads MethodDesc into a scratch register and then jumps. It must be implemented for precodes to work. It is used as fallback when no other specialized precode type is available.
FixupPrecode - .. is used when the final target does not require MethodDesc in scratch register. The FixupPrecode saves a few cycles by avoiding loading MethodDesc into the scratch register. The most common usage of FixupPrecode is for method fixups in NGen images.
ThisPtrRetBufPrecode - .. is used to switch a return buffer and the this pointer for open instance delegates returning valuetypes. It is used to convert the calling convention of MyValueType Bar(Foo x) to the calling convention of MyValueType Foo::Bar().
NDirectImportPrecode (a.k.a P/Invoke) - .. is used for lazy binding of unmanaged P/Invoke targets. This precode is for convenience and to reduce amount of platform specific plumbing.

Finally, to give you an idea of some real-world scenarios for ‘precode’ stubs, take a look at this comment from the DoesSlotCallPrestub(..) method (AMD64):

// AMD64 has the following possible sequences for prestub logic:
// 1. slot -> temporary entrypoint -> prestub
// 2. slot -> precode -> prestub
// 3. slot -> precode -> jumprel64 (jump stub) -> prestub
// 4. slot -> precode -> jumprel64 (NGEN case) -> prestub

‘Just-in-time’ (JIT) and ‘Tiered’ Compilation

However, another piece of functionality that ‘precodes’ provide is related to ‘just-in-time’ (JIT) compilation, again from the BotR page:

Temporary entry points: Methods must provide entry points before they are jitted so that jitted code has an address to call them. These temporary entry points are provided by precode. They are a specific form of stub wrappers.

This technique is a lazy approach to jitting, which provides a performance optimization in both space and time. Otherwise, the transitive closure of a method would need to be jitted before it was executed. This would be a waste, since only the dependencies of taken code branches (e.g. if statement) require jitting.

Each temporary entry point is much smaller than a typical method body. They need to be small since there are a lot of them, even at the cost of performance. The temporary entry points are executed just once before the actual code for the method is generated.

So these ‘temporary entry points’ provide something concrete that can be referenced before a method has been JITted. They then trigger the JIT-compilation which does the job of generating the native code for a method. The entire process looks like this (dotted lines represent a pointer indirection, solid lines are a ‘control transfer’ e.g. a jmp/call assembly instruction):

Before JITing

Here we see the ‘temporary entry point’ pointing to the ‘fixup precode’, which ultimately calls into the PrestubWorker() function here.

After JIting

Once the method has been JITted, we can see that the PrestubWorker is now out of the picture and instead we have the native code for the function. In addition, there is now a ‘stable entry point’ that can be used by any other code that wants to execute the function. Also, we can see that the ‘fixup precode’ has been ‘backpatched’ to also point at the ‘native code’. For an idea of how this ‘back-patching’ works, see the StubPrecode ::SetTargetInterlocked(..) method here (ARM64).

After JIting - Tiered Compilation

However, there is also another ‘after’ scenario, now that .NET Core has ‘Tiered Compilation’. Here we see that the ‘stable entry point’ still goes via the ‘fixup precode’, it doesn’t directly call into the ‘native code’. This is because ‘tiered compilation’ counts how many times a method is called and once it decides the method is ‘hot’, it re-compiles a more optimised version that will give better performance. This ‘call counting’ takes place in this code in MethodDesc::DoPrestub(..) which calls into CodeVersionManager::PublishNonJumpStampVersionableCodeIfNecessary(..) here and then if shouldCountCalls is true, it ends up calling CallCounter::OnMethodCodeVersionCalledSubsequently(..) here.

What’s been interesting to watch during the development of ‘tiered compilation’ is that (not surprisingly) there has been a significant amount of work to ensure that the extra level of indirection doesn’t make the entire process slower, for instance see Patch vtable slots and similar when tiering is enabled #21292.

Like all the other stubs, ‘precodes’ have different versions for different CPU architectures. As a reference, the list below contains links to all of them:

Precodes (a.k.a ‘Precode Fixup Thunk’):
- x86 in /vm/i386/asmhelpers.S
- x64 in /vm/amd64/AsmHelpers.asm
- ARM in /vm/arm/asmhelpers.S
- ARM64 in /vm/arm64/asmhelpers.asm
ThePreStub:
- x86 in /vm/i386/asmhelpers.S
- x64 in /vm/amd64/ThePreStubAMD64.asm
- ARM in /vm/arm/asmhelpers.S
- ARM64 in /vm/arm64/asmhelpers.asm
PreStubWorker(..) in /vm/prestub.cpp
MethodDesc::DoPrestub(..) here
MethodDesc::DoBackpatch(..) here

Finally, for even more information on the JITing process, see:

Stubs-as-IL

‘Stubs as IL’ actually describes several types of individual stubs, but what they all have in common is they’re generated from ‘Intermediate Language’ (IL) which is then compiled by the JIT, in exactly the same way it handles the code we write (after it’s first been compiled from C#/F#/VB.NET into IL by another compiler).

This makes sense, it’s far easier to write the IL once and then have the JIT worry about compiling it for different CPU architectures, rather than having to write raw assembly each time (for x86/x64/arm/etc). However all stubs were hand-written assembly in .NET Framework 1.0:

What you have described is how it actually works. The only difference is that the shuffle thunk is hand-emitted in assembly and not generated by the JIT for historic reasons. All stubs (including all interop stubs) were hand-emitted like this in .NET Framework 1.0. Starting with .NET Framework 2.0, we have been converting the stubs to be generated by the JIT (the runtime generates IL for the stub, and then the JIT compiles the IL as regular method). The shuffle thunk is one of the few remaining ones not converted yet. Also, we have the IL path on some platforms but not others - FEATURE_STUBS_AS_IL is related to it.

In the CoreCLR source code, ‘stubs as IL’ are controlled by the feature flag FEATURE_STUBS_AS_IL, with the following additional flags for each specific type:

StubsAsIL
ArrayStubAsIL
MulticastStubAsIL

On Windows only some features are implemented with IL stubs, see this code, e.g. ‘ArrayStubAsIL’ is disabled on ‘x86’, but enabled elsewhere.

<PropertyGroup Condition="'$(TargetsWindows)' == 'true'">
   <FeatureArrayStubAsIL Condition="'$(Platform)' != 'x86'">true</FeatureArrayStubAsIL>
   <FeatureMulticastStubAsIL Condition="'$(Platform)' != 'x86'">true</FeatureMulticastStubAsIL>
   <FeatureStubsAsIL Condition="'$(Platform)' == 'arm64'">true</FeatureStubsAsIL>
    ...
</PropertyGroup>

On Unix they are all done in IL, regardless of CPU Arch, as this code shows:

<PropertyGroup Condition="'$(TargetsUnix)' == 'true'">
   ...
   <FeatureArrayStubAsIL>true</FeatureArrayStubAsIL>
   <FeatureMulticastStubAsIL>true</FeatureMulticastStubAsIL>
   <FeatureStubsAsIL>true</FeatureStubsAsIL>
</PropertyGroup>

Finally, here’s the complete list of stubs that can be implemented in IL from /vm/ilstubresolver.h:

 enum ILStubType
 {
     Unassigned = 0,
     CLRToNativeInteropStub,
     CLRToCOMInteropStub,
     CLRToWinRTInteropStub,
     NativeToCLRInteropStub,
     COMToCLRInteropStub,
     WinRTToCLRInteropStub,
#ifdef FEATURE_ARRAYSTUB_AS_IL 
     ArrayOpStub,
#endif
#ifdef FEATURE_MULTICASTSTUB_AS_IL
     MulticastDelegateStub,
#endif
#ifdef FEATURE_STUBS_AS_IL
     SecureDelegateStub,
     UnboxingILStub,
     InstantiatingStub,
#endif
 };

But the usage of IL stubs has grown over time and it seems that they are the preferred mechanism where possible as they’re easier to write and debug. See [x86/Linux] Enable FEATURE_ARRAYSTUB_AS_IL, Switch multicast delegate stub on Windows x64 to use stubs-as-il and Fix GenerateShuffleArray to support cyclic shuffles #26169 (comment) for more information.

P/Invoke, Reverse P/Invoke and ‘calli’

All these stubs have one thing in common, they allow a transition between ‘managed’ and ‘un-managed’ (or native) code. To make this safe and to preserve the guarantees that the .NET runtime provides, stubs are used every time the transition is made.

This entire process is outlined in great detail in the BotR page CLR ABI - PInvokes, from the ‘Per-call-site PInvoke work’ section:

For direct calls, the JITed code sets InlinedCallFrame->m_pDatum to the MethodDesc of the call target.

For JIT64, indirect calls within IL stubs sets it to the secret parameter (this seems redundant, but it might have changed since the per-frame initialization?).

For JIT32 (ARM) indirect calls, it sets this member to the size of the pushed arguments, according to the comments. The implementation however always passed 0.

For JIT64/AMD64 only: Next for non-IL stubs, the InlinedCallFrame is ‘pushed’ by setting Thread->m_pFrame to point to the InlinedCallFrame (recall that the per-frame initialization already set InlinedCallFrame->m_pNext to point to the previous top). For IL stubs this step is accomplished in the per-frame initialization.

The Frame is made active by setting InlinedCallFrame->m_pCallerReturnAddress.

The code then toggles the GC mode by setting Thread->m_fPreemptiveGCDisabled = 0.

Starting now, no GC pointers may be live in registers. RyuJit LSRA meets this requirement by adding special refPositon RefTypeKillGCRefs before unmanaged calls and special helpers.

Then comes the actual call/PInvoke.

The GC mode is set back by setting Thread->m_fPreemptiveGCDisabled = 1.

Then we check to see if g_TrapReturningThreads is set (non-zero). If it is, we call CORINFO_HELP_STOP_FOR_GC.

For ARM, this helper call preserves the return register(s): R0, R1, S0, and D0.

For AMD64, the generated code must manually preserve the return value of the PInvoke by moving it to a non-volatile register or a stack location.

Starting now, GC pointers may once again be live in registers.

Clear the InlinedCallFrame->m_pCallerReturnAddress back to 0.

For JIT64/AMD64 only: For non-IL stubs ‘pop’ the Frame chain by resetting Thread->m_pFrame back to InlinedCallFrame.m_pNext.

Saving/restoring all the non-volatile registers helps by preventing any registers that are unused in the current frame from accidentally having a live GC pointer value from a parent frame. The argument and return registers are ‘safe’ because they cannot be GC refs. Any refs should have been pinned elsewhere and instead passed as native pointers.

For IL stubs, the Frame chain isn’t popped at the call site, so instead it must be popped right before the epilog and right before any jmp calls. It looks like we do not support tail calls from PInvoke IL stubs?

As you can see, quite a bit of the work is to keep the Garbage Collector (GC) happy. This makes sense because once execution moves into un-managed/native code the .NET runtime has no control over what’s happening, so it needs to ensure that the GC doesn’t clean up or move around objects that are being used in the native code. It achives this by constraining what the GC can do (on the current thread) from the time execution moves into un-managed code and keeps that in place until it returns back to the mamanged side.

On top of that, there needs to be support for allowing ‘stack walking’ or ‘unwinding, to allowing debugging and produce meaningful stack traces. This is done by setting up frames that are put in place when control transitions from managed -> un-managed, before being removed (‘popped’) when transitioning back. Here’s a list of the different scenarios that are covered, from /vm/frames.h:

This is the list of Interop stubs & transition helpers with information
regarding what (if any) Frame they used and where they were set up:

P/Invoke:
 JIT inlined: The code to call the method is inlined into the caller by the JIT.
    InlinedCallFrame is erected by the JITted code.
 Requires marshaling: The stub does not erect any frames explicitly but contains
    an unmanaged CALLI which turns it into the JIT inlined case.

Delegate over a native function pointer:
 The same as P/Invoke but the raw JIT inlined case is not present (the call always
 goes through an IL stub).

Calli:
 The same as P/Invoke.
 PInvokeCalliFrame is erected in stub generated by GenerateGetStubForPInvokeCalli
 before calling to GetILStubForCalli which generates the IL stub. This happens only
 the first time a call via the corresponding VASigCookie is made.

ClrToCom:
 Late-bound or eventing: The stub is generated by GenerateGenericComplusWorker
    (x86) or exists statically as GenericComPlusCallStub[RetBuffArg] (64-bit),
    and it erects a ComPlusMethodFrame frame.
 Early-bound: The stub does not erect any frames explicitly but contains an
    unmanaged CALLI which turns it into the JIT inlined case.

ComToClr:
 Normal stub:
 Interpreted: The stub is generated by ComCall::CreateGenericComCallStub
    (in ComToClrCall.cpp) and it erects a ComMethodFrame frame.
 Prestub:
  The prestub is ComCallPreStub (in ComCallableWrapper.cpp) and it erects a ComPrestubMethodFrame frame.

Reverse P/Invoke (used for C++ exports & fixups as well as delegates
obtained from function pointers):
 Normal stub:
  x86: The stub is generated by UMEntryThunk::CompileUMThunkWorker
    (in DllImportCallback.cpp) and it is frameless. It calls directly
    the managed target or to IL stub if marshaling is required.
  non-x86: The stub exists statically as UMThunkStub and calls to IL stub.
 Prestub:
  The prestub is generated by GenerateUMThunkPrestub (x86) or exists statically
  as TheUMEntryPrestub (64-bit), and it erects an UMThkCallFrame frame.

Reverse P/Invoke AppDomain selector stub:
 The asm helper is IJWNOADThunkJumpTarget (in asmhelpers.asm) and it is frameless.

The P/Invoke IL stubs are wired up in the MethodDesc::DoPrestub(..) method (note that P/Invoke is also known as ‘NDirect’), in addition they are also created here when being used for ‘COM Interop’. That code then calls into GetStubForInteropMethod(..) in /vm/dllimport.cpp, before branching off to handle each case:

P/Invoke calls into NDirect::GetStubForILStub(..) here
Reverse P/Invoke calls into another overload of NDirect::GetStubForILStub(..) here
COM Interop goes to ComPlusCall::GetStubForILStub(..) here in /vm/clrtocomcall.cpp
EE implemented methods end up in COMDelegate::GetStubForILStub(..) here (for more info on EEImpl methods see ‘Kinds of MethodDescs’)

There are also hand-written assembly stubs for the differents scenarios, such as JIT_PInvokeBegin, JIT_PInvokeEnd and VarargPInvokeStub, these can be seen in the files below:

As an example, calli method calls (see OpCodes.Calli) end up in GenericPInvokeCalliHelper, which has a nice bit of ASCII art in the i386 version:

// stack layout at this point:
//
// |         ...          |
// |   stack arguments    | ESP + 16
// +----------------------+
// |     VASigCookie*     | ESP + 12
// +----------------------+
// |    return address    | ESP + 8
// +----------------------+
// | CALLI target address | ESP + 4
// +----------------------+
// |   stub entry point   | ESP + 0
// ------------------------

However, all these stubs can have an adverse impact on start-up time, see Large numbers of Pinvoke stubs created on startup for example. This impact has been mitigated by compiling the stubs ‘Ahead-of-Time’ (AOT) and storing them in the ‘Ready-to-Run’ images (replacement format for NGEN (Native Image Generator)). From R2R ilstubs:

IL stub generation for interop takes measurable time at startup, and it is possible to generate some of them in an ahead of time

This change introduces ahead of time R2R compilation of IL stubs

Related work was done in Enable R2R compilation/inlining of PInvoke stubs where no marshalling is required and PInvoke stubs for Unix platforms (‘Enables inlining of PInvoke stubs for Unix platforms’).

Finally, for even more information on the issues involved, see:

Better diagnostic for collected delegate #15465
Fill freed loader heap chunk with non-zero value #12731
[Arm64] Implement Poison() #13125
Collected delegate diagnostic #15809
AdvancedDLSupport:
- Delegate-based C# P/Invoke alternative - compatible with all platforms and runtimes.
- Also see ‘The Developer Documentation’ for the project.

Marshalling

However, dealing with the ‘managed’ to ‘un-managed’ transition is only one part of the story. The other is that there are also stubs created to deal with the ‘marshalling’ of arguments between the 2 sides. This process of ‘Interop Marshalling’ is explained nicely in the Microsoft docs:

Interop marshaling governs how data is passed in method arguments and return values between managed and unmanaged memory during calls. Interop marshaling is a run-time activity performed by the common language runtime’s marshaling service.

Most data types have common representations in both managed and unmanaged memory. The interop marshaler handles these types for you. Other types can be ambiguous or not represented at all in managed memory.

Like many stubs in the CLR, the marshalling stubs have evolved over time. As we can read in the excellent post Improvements to Interop Marshaling in V4: IL Stubs Everywhere:

History The 1.0 and 1.1 versions of the CLR had several different techniques for creating and executing these stubs that were each designed for marshaling different types of signatures. These techniques ranged from directly generated x86 assembly instructions for simple signatures to generating specialized ML (an internal marshaling language) and running them through an internal interpreter for the most complicated signatures. This system worked well enough – although not without difficulties – in 1.0 and 1.1 but presented us with a serious maintenance problem when 2.0, and its support for multiple processor architectures, came around.

That’s right, there was an internal interpreter built into early version of the .NET CLR that had the job of running the ‘marshalling language’ (ML) code!

However, it then goes on to explain why this process wasn’t sustainable:

We realized early in the process of adding 64 bit support to 2.0 that this approach was not sustainable across multiple architectures. Had we continued with the same strategy we would have had to create parallel marshaling infrastructures for each new architecture we supported (remember in 2.0 we introduced support for both x64 and IA64) which would, in addition to the initial cost, at least triple the cost of every new marshaling feature or bug fix. We needed one marshaling stub technology that would work on multiple processor architectures and could be efficiently executed on each one: enter IL stubs.

The solution was to implement all stubs using ‘Intermediate Language’ (IL) that is CPU-agnostic. Then the JIT-compiler is used to convert the IL into machine code for each CPU architecture, which makes sense because it’s exactly what the JIT is good at. Also worth noting is that this work still continues today, for instance see Implement struct marshalling via IL Stubs instead of via FieldMarshalers #26340.

Finally, there is a really nice investigation into the whole process in PInvoke: beyond the magic (also Compile time marshalling). What’s also nice is that you can use PerfView to see the stubs that the runtime generates.

Generics

It is reasonably well known that generics in .NET use ‘code sharing’ to save space. That is, given a generic method such as public void Insert<T>(..), one method body of ‘native code’ will be created and shared by the instantiated types of Insert<Foo>(..) and Insert<Bar>(..) (assumning that Foo and Bar are references types), but different versions will be created for Insert<int>(..) and Insert<double>(..) (as int/double are value types). This is possible, for the reasons outlined by Jon Skeet in a StackOverflow question:

.. consider what the CLR needs to know about a type. It includes:

The size of a value of that type (i.e. if you have a variable of some type, how much space will that memory need?)

How to treat the value in terms of garbage collection: is it a reference to an object, or a value which may in turn contain other references?

For all reference types, the answers to these questions are the same. The size is just the size of a pointer, and the value is always just a reference (so if the variable is considered a root, the GC needs to recursively descend into it).

For value types, the answers can vary significantly.

But, this poses a problem. What about if the ‘shared’ method needs to do something specific for each type, like call typeof(T)?

This whole issue is explained in these 2 great posts, which I really recommend you take the time to read:

I’m not going to repeat what they cover here, except to say that (not surprisingly) ‘stubs’ are used to solve this issue, in conjunction with a ‘hidden’ parameter. These stubs are known as ‘instantiating’ stubs and we can find out more about them in this comment:

Instantiating Stubs - Return TRUE if this is this a special stub used to implement an instantiated generic method or per-instantiation static method. The action of an instantiating stub is - pass on a MethodTable or InstantiatedMethodDesc extra argument to shared code

The different scenarios are handled in MakeInstantiatingStubWorker(..) in /vm/prestub.cpp, you can see the check for HasMethodInstantiation and the fall-back to a ‘per-instantiation static method’:

    // It's an instantiated generic method
    // Fetch the shared code associated with this instantiation
    pSharedMD = pMD->GetWrappedMethodDesc();
    _ASSERTE(pSharedMD != NULL && pSharedMD != pMD);

    if (pMD->HasMethodInstantiation())
    {
        extraArg = pMD;
    }
    else
    {
        // It's a per-instantiation static method
        extraArg = pMD->GetMethodTable();
    }
    Stub *pstub = NULL;

#ifdef FEATURE_STUBS_AS_IL
    pstub = CreateInstantiatingILStub(pSharedMD, extraArg);
#else
    CPUSTUBLINKER sl;
    _ASSERTE(pSharedMD != NULL && pSharedMD != pMD);
    sl.EmitInstantiatingMethodStub(pSharedMD, extraArg);

    pstub = sl.Link(pMD->GetLoaderAllocator()->GetStubHeap());
#endif

As a reminder, FEATURE_STUBS_AS_IL is defined for all Unix versions of the CoreCLR, but on Windows it’s only used with ARM64.

When FEATURE_STUBS_AS_IL is defined, the code calls into CreateInstantiatingILStub(..) here. To get an overview of what it’s doing, we can take a look at the steps called-out in the code comments:
- // 1. Build the new signature here
- // 2. Emit the method body here
- // 2.2 Push the rest of the arguments for x86 here
- // 2.3 Push the hidden context param here
- // 2.4 Push the rest of the arguments for not x86 here
- // 2.5 Push the target address here
- // 2.6 Do the calli here
When FEATURE_STUBS_AS_IL is note defined, per CPU/OS versions of EmitInstantiatingMethodStub(..) are used, they exist for:
- i386 in /vm/i386/stublinkerx86.cpp
- ARM in /vm/arm/stubs.cpp

In the last case, (EmitInstantiatingMethodStub(..) on ARM), the stub shares code with the instantiating version of the unboxing stub, so the heavy-lifting is done in StubLinkerCPU::ThumbEmitCallWithGenericInstantiationParameter(..) here. This method is over 400 lines for fairly complex code, althrough there is also a nice piece of ASCII art (for info on why this ‘complex’ case is needed see this comment):

// Complex case where we need to emit a new stack frame and copy the arguments.

// Calculate the size of the new stack frame:
//
//            +------------+
//      SP -> |            | <-- Space for helper arg, if isRelative is true
//            +------------+
//            |            | <-+
//            :            :   | Outgoing arguments
//            |            | <-+
//            +------------+
//            | Padding    | <-- Optional, maybe required so that SP is 64-bit aligned
//            +------------+
//            | GS Cookie  |
//            +------------+
//        +-> | vtable ptr |
//        |   +------------+
//        |   | m_Next     |
//        |   +------------+
//        |   | R4         | <-+
//   Stub |   +------------+   |
// Helper |   :            :   |
//  Frame |   +------------+   | Callee saved registers
//        |   | R11        |   |
//        |   +------------+   |
//        |   | LR/RetAddr | <-+
//        |   +------------+
//        |   | R0         | <-+
//        |   +------------+   |
//        |   :            :   | Argument registers
//        |   +------------+   |
//        +-> | R3         | <-+
//            +------------+
//  Old SP -> |            |
//

Delegates

Delegates in .NET provide a nice abstraction over the top of a function call, from Delegates (C# Programming Guide):

A delegate is a type that represents references to methods with a particular parameter list and return type. When you instantiate a delegate, you can associate its instance with any method with a compatible signature and return type. You can invoke (or call) the method through the delegate instance.

But under the hood there is quite a bit going on, for the full story take a look at How do .NET delegates work?, but in summary, there are several different types of delegates, as shown in this table from /vm/comdelegate.cpp:

// DELEGATE KINDS TABLE
//
//                                  _target         _methodPtr              _methodPtrAux       _invocationList     _invocationCount
//
// 1- Instance closed               'this' ptr      target method           null                null                0
// 2- Instance open non-virt        delegate        shuffle thunk           target method       null                0
// 3- Instance open virtual         delegate        Virtual-stub dispatch   method id           null                0
// 4- Static closed                 first arg       target method           null                null                0
// 5- Static closed (special sig)   delegate        specialSig thunk        target method       first arg           0
// 6- Static opened                 delegate        shuffle thunk           target method       null                0
// 7- Secure                        delegate        call thunk              MethodDesc (frame)  target delegate     creator assembly 
//
// Delegate invoke arg count == target method arg count - 2, 3, 6
// Delegate invoke arg count == 1 + target method arg count - 1, 4, 5
//
// 1, 4     - MulticastDelegate.ctor1 (simply assign _target and _methodPtr)
// 5        - MulticastDelegate.ctor2 (see table, takes 3 args)
// 2, 6     - MulticastDelegate.ctor3 (take shuffle thunk)
// 3        - MulticastDelegate.ctor4 (take shuffle thunk, retrieve MethodDesc) ???
//
// 7 - Needs special handling

The difference between Open Delegates vs. Closed Delegates is nicely illustrated in this code sample from the linked post:

Func<string> closed = new Func<string>("a".ToUpperInvariant);
Func<string, string> open = (Func<string, string>)
    Delegate.CreateDelegate(
        typeof(Func<string, string>),
        typeof(string).GetMethod("ToUpperInvariant")
    );

closed();     //Returns "A"
open("abc");  //Returns "ABC"

Stubs are used in several scenarios, including the intruiging named ‘shuffle thunk’ whose job it is to literally shuffle arguments around! In the simplest case, this process looks a bit like the following:

Delegate Call: [delegateThisPtr, arg1, arg2, ...]

Method Call:   [targetThisPtr, arg1, arg2, ...]

So when you invoke a delegate, the Invoke(..) method (generated by CLR), expects a ‘this’ pointer of the delegate object itself. However when the target method is called (i.e. the method the delagate ‘wraps’), the ‘this’ pointer needs to be the one for the type/class that the target method exists in, hence all the swapping/shuffling.

Of couse things get more complicated when you deal with static methods (no ‘this’ pointer) and different CPU calling conventions, as this answer to the question ‘What in the world is a shuffle thunk cache?’ explains:

When you use a delegate to call a method, the JIT doesn’t know at the time it generates the code what the delegate points to. It can e.g. be a member method or a static method. So the JIT generates arguments to registers and stack based on the signature of the delegate and the call then doesn’t call the target method directly, but a shuffle thunk instead. This thunk is generated based on the caller side signature and the real target method signature and shuffles the arguments in registers and on stack to correspond to the target calling convention. So if it needs to add “this” pointer into the first argument register, it needs to move the first argument register to the second, the second to the third and the last to the stack (obviously in the right order so that nothing gets overwritten). And e.g. Unix amd64 calling convention makes it even more interesting when there are arguments that are structs that can be passed in multiple registers.

Singlecast Delegates

‘Singlecast’ delegates (as opposed to the ‘multicast’ variants) are the most common scenario and so they’re written as optimised ‘stubs’, starting in:

MethodDesc::DoPrestub(..) here, specifically when IsEEImpl() is true which calls into
COMDelegate::GetInvokeMethodStub(..) here, that then calls
COMDelegate::TheDelegateInvokeStub(..) here
- If FEATURE_STUBS_AS_IL is not defined, it calls into EmitDelegateInvoke() in /vm/i386/stublinkerx86.cpp (for x86)
- If FEATURE_STUBS_AS_IL is defined, a per-CPU/OS version of SinglecastDelegateInvokeStub is wired up:
  - Windows
    - AMD64 /vm/amd64/AsmHelpers.asm
    - ARM /vm/arm/asmhelpers.asm
    - ARM64 /vm/arm64/asmhelpers.asm
  - Unix
    - i386 /vm/i386/asmhelpers.S
    - AMD64 /vm/amd64/unixasmhelpers.S
    - ARM /vm/arm/asmhelpers.S
    - ARM64 /vm/arm64/asmhelpers.S

For example, this is the AMD64 (Windows) version of SinglecastDelegateInvokeStub:

LEAF_ENTRY SinglecastDelegateInvokeStub, _TEXT

        test    rcx, rcx
        jz      NullObject

        mov     rax, [rcx + OFFSETOF__DelegateObject___methodPtr]
        mov     rcx, [rcx + OFFSETOF__DelegateObject___target]  ; replace "this" pointer

        jmp     rax

NullObject:
        mov     rcx, CORINFO_NullReferenceException_ASM
        jmp     JIT_InternalThrow

LEAF_END SinglecastDelegateInvokeStub, _TEXT

As you can see, it reaches into the internals of the DelegateObject, pulls out the values in the methodPtr and target fields and puts them into the the rax and rcx registers.

Shuffle Thunks

Finally, let’s look at ‘shuffle thunks’ in more detail (cases 2, 3, 6 from the table above).

There are created in several places in the CoreCLR source, which all call into COMDelegate::SetupShuffleThunk(..) here
1. COMDelegate::BindToMethod(..) here
2. COMDelegate::DelegateConstruct(..) here
3. COMDelegate::GetDelegateCtor(..) here
COMDelegate::SetupShuffleThunk(..) then calls GenerateShuffleArray(..) here
Followed by a call to StubCacheBase::Canonicalize(..) here, that ends up in ShuffleThunkCache::CompileStub(..) here
This ends up calls the CPU-specific method EmitShuffleThunk(..):
- src/vm/i386 (also does AMD64 and UNIX_AMD64_ABI)
- src/vm/arm
- src/vm/arm64

Note how the stubs are cached in the ShuffleThunkCache where possible. This is because the thunks don’t have to be unique per method they can be shared across multiple methods as long as the signatures are compatible.

However, these stubs are not straight-forward and sometimes they go wrong, for instance Infinite loop in GenerateShuffleArray on unix64 #26054, fixed in PR #26169. Also see Corrupted struct passed to delegate constructed via reflection #16833 and Fix shuffling thunk for Unix AMD64 #16904 for more examples.

To give a flavour of what they need to do, here’s the code of the ARM64 version, which is by far the simplest one!! If you want to understand the full complexities, take a look at the ARM version which is 182 LOC or the x86 one at 281 LOC!!

// Emits code to adjust arguments for static delegate target.
VOID StubLinkerCPU::EmitShuffleThunk(ShuffleEntry *pShuffleEntryArray)
{
    // On entry x0 holds the delegate instance. Look up the real target address stored in the MethodPtrAux
    // field and save it in x16(ip). Tailcall to the target method after re-arranging the arguments
    // ldr x16, [x0, #offsetof(DelegateObject, _methodPtrAux)]
    EmitLoadStoreRegImm(eLOAD, IntReg(16), IntReg(0), DelegateObject::GetOffsetOfMethodPtrAux());
    //add x11, x0, DelegateObject::GetOffsetOfMethodPtrAux() - load the indirection cell into x11 used by ResolveWorkerAsmStub
    EmitAddImm(IntReg(11), IntReg(0), DelegateObject::GetOffsetOfMethodPtrAux());

    for (ShuffleEntry* pEntry = pShuffleEntryArray; pEntry->srcofs != ShuffleEntry::SENTINEL; pEntry++)
    {
        if (pEntry->srcofs & ShuffleEntry::REGMASK)
        {
            // If source is present in register then destination must also be a register
            _ASSERTE(pEntry->dstofs & ShuffleEntry::REGMASK);

            EmitMovReg(IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), IntReg(pEntry->srcofs & ShuffleEntry::OFSMASK));
        }
        else if (pEntry->dstofs & ShuffleEntry::REGMASK)
        {
            // source must be on the stack
            _ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));

            EmitLoadStoreRegImm(eLOAD, IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), RegSp, pEntry->srcofs * sizeof(void*));
        }
        else
        {
            // source must be on the stack
            _ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));

            // dest must be on the stack
            _ASSERTE(!(pEntry->dstofs & ShuffleEntry::REGMASK));

            EmitLoadStoreRegImm(eLOAD, IntReg(9), RegSp, pEntry->srcofs * sizeof(void*));
            EmitLoadStoreRegImm(eSTORE, IntReg(9), RegSp, pEntry->dstofs * sizeof(void*));
        }
    }

    // Tailcall to target
    // br x16
    EmitJumpRegister(IntReg(16));
}

Unboxing

I’ve written about this type of ‘stub’ before in A look at the internals of ‘boxing’ in the CLR, but in summary the unboxing stub needs to handle steps 2) and 3) from the diagram below:

1. MyStruct:         [0x05 0x00 0x00 0x00]

                     |   Object Header   |   MethodTable  |   MyStruct    |
2. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
                                          ^
                    object 'this' pointer | 

                     |   Object Header   |   MethodTable  |   MyStruct    |
3. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
                                                           ^
                                   adjusted 'this' pointer | 

Key to the diagram

Original struct, on the stack
The struct being boxed into an object that lives on the heap
Adjustment made to this pointer so MyStruct::ToString() will work

These stubs make is possible for ‘value types’ (structs) to override methods from System.Object, such as ToString() and GetHashCode(). The fix-up is needed because structs don’t have an ‘object header’, but when they’re boxed into an Object they do. So the stub has the job of moving or adjusting the ‘this’ pointer so that the code in the ToString() method can work the same, regardless of whether it’s operating on a regular ‘struct’ or one that’s been boxed into an ‘object.

The unboxing stubs are created in MethodDesc::DoPrestub(..) here, which in turn calls into MakeUnboxingStubWorker(..) here

when FEATURE_STUBS_AS_IL is disabled it then calls EmitUnboxMethodStub(..) to create the stub, there are per-CPU versions:
- i386
- ARM
- ARM64
when FEATURE_STUBS_AS_IL is enabled is instead calls into CreateUnboxingILStubForSharedGenericValueTypeMethods(..) here

For more information on some of the internal details of unboxing stubs and how they interact with ‘generic instantiations’ see this informative comment and one in the code for MethodDesc::FindOrCreateAssociatedMethodDesc(..) here.

Arrays

As discussed at the beginning, the method bodies for arrays is provided by the runtime, that is the array access methods, ‘get’ and ‘set’, that allow var a = myArray[5] and myArray[7] = 5 to work. Not surprisingly, these are done as stubs to allow them to be as small and efficient as possible.

Here is the flow for wiring up ‘array stubs’. It all starts up in MethodDesc::DoPrestub(..) here:

If FEATURE_ARRAYSTUB_AS_IL is defined (see ‘Stubs-as-IL’), it happens in GenerateArrayOpStub(ArrayMethodDesc* pMD) here
- Then ArrayOpLinker::EmitStub() here, which is responsible for generating 3 types of stubs { ILSTUB_ARRAYOP_GET, ILSTUB_ARRAYOP_SET, ILSTUB_ARRAYOP_ADDRESS }.
- Before calling ILStubCache::CreateAndLinkNewILStubMethodDesc(..) here
- Finally ending up in JitILStub(..) here
When FEATURE_ARRAYSTUB_AS_IL isn’t defined, happens in another version of GenerateArrayOpStub(ArrayMethodDesc* pMD) lower down
- Then void GenerateArrayOpScript(..) here
- Followed by a call to StubCacheBase::Canonicalize(..) here, that ends up in ArrayStubCache::CompileStub(..) here.
- Eventually, we end up in StubLinkerCPU::EmitArrayOpStub(..) here, which does the heavy lifting (despite being under ‘\src\vm\i386' seems to support x86 and AMD64?)

I’m not going to include the code for the ‘stub-as-IL’ (ArrayOpLinker::EmitStub()) or the assembly code (StubLinkerCPU::EmitArrayOpStub(..)) versions of the array stubs because they’re both 100’s of lines long, dealing with type and bounds checking, computing address, multi-dimensional arrays and mode. But to give an idea of the complexities, take a look at this comment from StubLinkerCPU::EmitArrayOpStub(..) here:

// Register usage
//
//                                          x86                 AMD64
// Inputs:
//  managed array                           THIS_kREG (ecx)     THIS_kREG (rcx)
//  index 0                                 edx                 rdx
//  index 1/value                           <stack>             r8
//  index 2/value                           <stack>             r9
//  expected element type for LOADADDR      eax                 rax                 rdx
// Working registers:
//  total (accumulates unscaled offset)     edi                 r10
//  factor (accumulates the slice factor)   esi                 r11

Finally, these stubs are still being improved, for example see Use unsigned index extension in muldi-dimensional array stubs.

Tail Calls

The .NET runtime provides a nice optimisation when doing ‘tail calls’, that (amoung other things) will prevent StackoverflowExceptions in recursive scenarios. For more on why these tail call optimisations are useful and how they work, take a look at:

In summary, a tail call optimisation allows the same stack frame to be re-used if in the caller, there is no work done after the function call to the callee (see Tail call JIT conditions (2007) for a more precise definition).

And why is this beneficial? From Tail Call Improvements in .NET Framework 4:

The primary reason for a tail call as an optimization is to improve data locality, memory usage, and cache usage. By doing a tail call the callee will use the same stack space as the caller. This reduces memory pressure. It marginally improves the cache because the same memory is reused for subsequent callers and thus can stay in the cache, rather than evicting some older cache line to make room for a new cache line.

To make this clear, the code below can benefit from the optimisation, because both functions return straight after calling each the other:

public static long Ping(int cnt, long val)
{
    if (cnt-- == 0)
        return val;

    return Pong(cnt, val + cnt);
}

public static long Pong(int cnt, long val)
{
    if (cnt-- == 0)
        return val;

    return Ping(cnt, val + cnt);
}

However, if the code was changed to the version below, the optimisation would no longer work because PingNotOptimised(..) does some extra work between calling Pong(..) and when it returns:

public static long PingNotOptimised(int cnt, long val)
{
    if (cnt-- == 0)
        return val;

    var result = Pong(cnt, val + cnt);
    result += 1; // prevents the Tail-call optimization
    return result;
}

public static long Pong(int cnt, long val)
{
    if (cnt-- == 0)
        return val;

    return PingNotOptimised(cnt, val + cnt);
}

You can see the difference in the code emitted by the JIT compiler for the different scenarios in SharpLab.

But where do the ‘tail call optimisation stubs’ come into play? Helpfully there is a tail call related design doc that explains, from ‘current way of handling tail-calls’:

Fast tail calls These are tail calls that are handled directly by the jitter and no runtime cooperation is needed. They are limited to cases where:

Return value and call target arguments are all either primitive types, reference types, or valuetypes with a single primitive type or reference type fields

The aligned size of call target arguments is less or equal to aligned size of caller arguments

So, the stubs aren’t always needed, sometimes the work can be done by the JIT, if there scenario is simple enough.

However for the more complex cases, a ‘helper’ stub is needed:

Tail calls using a helper Tail calls in cases where we cannot perform the call in a simple way are implemented using a tail call helper. Here is a rough description of how it works:

For each tail call target, the jitter asks runtime to generate an assembler argument copying routine. This routine reads vararg list of arguments and places the arguments in their proper slots in the CONTEXT or on the stack. Together with the argument copying routine, the runtime also builds a list of offsets of references and byrefs for return value of reference type or structs returned in a hidden return buffer and for structs passed by ref. The gc layout data block is stored at the end of the argument copying thunk.

At the time of the tail call, the caller generates a vararg list of all arguments of the tail called function and then calls JIT_TailCall runtime function. It passes it the copying routine address, the target address and the vararg list of the arguments.

The JIT_TailCall then performs the following: …

To see the rest of the steps that JIT_TailCall takes you can read the design doc or if you’re really keen you can look at the code in /vm/jithelpers.cpp. Also, there’s a useful explanation of what it needs to handle in the JIT code, see here and here.

However, we’re just going to focus on the stubs, refered to as an ‘assembler argument copying routine’. Firstly, we can see that they have their own stub manager, TailCallStubManager, which is implemented here and allows the stubs to play nicely with the debugger. Also interesting to look at is the TailCallFrame here that is used to ensure that the ‘stack walker’ can work well with tail calls.

Now, onto the stubs themselves, the ‘copying routines’ are provided by the runtime via a call to CEEInfo::getTailCallCopyArgsThunk(..) in /vm/jitinterface.cpp. This in turn calls the CPU specific versions of CPUSTUBLINKER::CreateTailCallCopyArgsThunk(..):

X86 /vm/i386/stublinkerx86.cpp
ARM /vm/arm/stubs.cpp

These routines have the complex and hairy job of dealing with the CPU registers and calling conventions. They achieve this by dynamicially emitting assembly instructions, to create a function that looks like the following pseudo code (X86 version):

    // size_t CopyArguments(va_list args,         (RCX)
    //                      CONTEXT *pCtx,        (RDX)
    //                      DWORD64 *pvStack,     (R8)
    //                      size_t cbStack)       (R9)
    // {
    //     if (pCtx != NULL) {
    //         foreach (arg in args) {
    //             copy into pCtx or pvStack
    //         }
    //     }
    //     return <size of stack needed>;
    // }

In addition there is one other type of stub that is used. Known as the TailCallHelperStub, they also come in per-CPU versions:

AMD64 /vm/amd64/JitHelpers_Fast.asm
ARM /vm/arm/asmhelpers.asm.

Going forward, there are several limitations of to this approach of using per-CPU stubs, as the design doc explains:

It is expensive to port to new platforms

Parsing the vararg list is not possible to do in a portable way on Unix. Unlike on Windows, the list is not stored a linear sequence of the parameter data bytes in memory. va_list on Unix is an opaque data type, some of the parameters can be in registers and some in the memory.

Generating the copying asm routine needs to be done for each target architecture / platform differently. And it is also very complex, error prone and impossible to do on platforms where code generation at runtime is not allowed.

It is slower than it has to be

The parameters are copied possibly twice - once from the vararg list to the stack and then one more time if there was not enough space in the caller’s stack frame.

RtlRestoreContext restores all registers from the CONTEXT structure, not just a subset of them that is really necessary for the functionality, so it results in another unnecessary memory accesses.

Stack walking over the stack frames of the tail calls requires runtime assistance.

Fortunately, it then goes into great depth discussing how a new approach could be implemented and how it would solve these issues. Even better, work has already started and we can follow along in Implement portable tailcall helpers #26418 (currently sitting at ‘31 of 55’ tasks completed, with over 50 files modified, it’s not a small job!).

Finally, for other PRs related to tail calls, see:

Virtual Stub Dispatch (VSD)

I’ve saved the best for last, ‘Virtual Stub Dispatch’ or VSD is such an in-depth topic, that it an entire BotR page devoted to it!! From the introduction:

Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.

It then goes on to say:

Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.

So despite having the work ‘virtual’ in the title, it’s not actually used for C# methods with the virtual modifier on them. However, if you look at the IL for interface methods you can see why they are also known as ‘virtual’.

Virtual Stub Dispatch is so complex, it actually has several different stub types, from /vm/virtualcallstub.h:

enum StubKind { 
  SK_UNKNOWN, 
  SK_LOOKUP,      // Lookup Stubs are SLOW stubs that simply call into the runtime to do all work.
  SK_DISPATCH,    // Dispatch Stubs have a fast check for one type otherwise jumps to runtime.  Works for monomorphic sites
  SK_RESOLVE,     // Resolve Stubs do a hash lookup before fallling back to the runtime.  Works for polymorphic sites.
  SK_VTABLECALL,  // Stub that jumps to a target method using vtable-based indirections. Works for non-interface calls.
  SK_BREAKPOINT 
};

So there are the following types (these are links to the AMD64 versions, x86 versions are in /vm/i386/virtualcallstubcpu.hpp):

Lookup Stubs:
- // Virtual and interface call sites are initially setup to point at LookupStubs. This is because the runtime type of the <this> pointer is not yet known, so the target cannot be resolved.
Dispatch Stubs:
- // Monomorphic and mostly monomorphic call sites eventually point to DispatchStubs. A dispatch stub has an expected type (expectedMT), target address (target) and fail address (failure). If the calling frame does in fact have the <this> type be of the expected type, then control is transfered to the target address, the method implementation. If not, then control is transfered to the fail address, a fail stub (see below) where a polymorphic lookup is done to find the correct address to go to.
- There’s also specific versions, DispatchStubShort and DispatchStubLong, see this comment for why they are both needed.
Resolve Stubs:
- // Polymorphic call sites and monomorphic calls that fail end up in a ResolverStub. There is only one resolver stub built for any given token, even though there may be many call sites that use that token and many distinct <this> types that are used in the calling call frames. A resolver stub actually has two entry points, one for polymorphic call sites and one for dispatch stubs that fail on their expectedMT test. There is a third part of the resolver stub that enters the ee when a decision should be made about changing the callsite.
V-Table or Virtual Call Stubs
- //These are jump stubs that perform a vtable-base virtual call. These stubs assume that an object is placed in the first argument register (this pointer). From there, the stub extracts the MethodTable pointer, followed by the vtable pointer, and finally jumps to the target method at a given slot in the vtable.

The below diagram shows the general control flow between these stubs

(Image from ‘Design of Virtual Stub Dispatch’)

Finally, if you want even more in-depth information see this comment.

However, these stubs come at a cost, which makes virtual method calls more expensive than direct ones. This is why de-virtualization is so important, i.e. the process of the .NET JIT detecting when a virtual call can instead be replaced by a direct one. There has been some work done in .NET Core to improve this, see Simple devirtualization #9230 which covers sealed classes/methods and when the object type is known exactly. However there is still more to be done, as shown in JIT: devirtualization next steps #9908, where ‘5 of 23’ tasks have been completed.

Other Types of Stubs

This post is already way too long, so I don’t intend to offer any analysis of the following stubs. Instead I’ve just included some links to more information so you can read up on any that interest you!

‘Jump’ stubs

‘Function Pointer’ stubs

‘Function Pointer’ Stubs, see /vm/fptrstubs.cpp and /vm/fptrstubs.h
// FuncPtrStubs contains stubs that is used by GetMultiCallableAddrOfCode() if the function has not been jitted. Using a stub decouples ldftn from the prestub, so prestub does not need to be backpatched. This stub is also used in other places which need a function pointer

‘Thread Hijacking’ stubs

From the BotR page on ‘Threading’:

If fully interruptable, it is safe to perform a GC at any point, since the thread is, by definition, at a safe point. It is reasonable to leave the thread suspended at this point (because it’s safe) but various historical OS bugs prevent this from working, because the CONTEXT retrieved earlier may be corrupt). Instead, the thread’s instruction pointer is overwritten, redirecting it to a stub that will capture a more complete CONTEXT, leave cooperative mode, wait for the GC to complete, reenter cooperative mode, and restore the thread to its previous state.

If partially-interruptable, the thread is, by definition, not at a safe point. However, the caller will be at a safe point (method transition). Using that knowledge, the CLR “hijacks” the top-most stack frame’s return address (physically overwrite that location on the stack) with a stub similar to the one used for fully-interruptable code. When the method returns, it will no longer return to its actual caller, but rather to the stub (the method may also perform a GC poll, inserted by the JIT, before that point, which will cause it to leave cooperative mode and undo the hijack).

Done with the OnHijackTripThread method in /vm/amd64/AsmHelpers.asm, which calls into OnHijackWorker(..) in /vm/threadsuspend.cpp.

‘NGEN Fixup’ stubs

From CLR Inside Out - The Performance Benefits of NGen (2006):

Throughput of NGen-compiled code is lower than that of JIT-compiled code primarily for one reason: cross-assembly references. In JIT-compiled code, cross-assembly references can be implemented as direct calls or jumps since the exact addresses of these references are known at run time. For statically compiled code, however, cross-assembly references need to go through a jump slot that gets populated with the correct address at run time by executing a method pre-stub. The method pre-stub ensures, among other things, that the native images for assemblies referenced by that method are loaded into memory before the method is executed. The pre-stub only needs to be executed the first time the method is called; it is short-circuited out for subsequent calls. However, every time the method is called, cross-assembly references do need to go through a level of indirection. This is principally what accounted for the 5-10 percent drop in throughput for NGen-compiled code when compared to JIT-compiled code.

Also see the ‘NGEN’ section of the ‘jump stub’ design doc.

Stubs in the Mono Runtime

Mono refers to ‘Stubs’ as ‘Trampolines’ and they’re widely used in the source code.

The Mono docs have an excellent page all about ‘Trampolines’, that lists the following types:

Also the docs page on Generic Sharing has some good, in-depth information.

Conclusion

So it turns out that ‘stubs’ are way more prevelant in the .NET Core Runtime that I imagined when I first started on this post. They are an interesting technique and they contain a fair amount of complexity. In addition, I only covered each stub in isolation, in reality many of them have to play nicely together, for instance imagine a delegate calling a virtual method that has generic type parameters and you can see that things start to get complex! (that scenario might contain 3 seperate stubs, although they are also shared where possible). If you were then to add array methods, P/Invoke marshalling and un-boxing to the mix, things get even more hairy and even more complex!

If anyone has read this far and wants a fun challenge, try and figure out what’s the most stubs you can force a single method call to go via! If you do, let me know in the comments or via twitter

Finally, by knowing where and when stubs are involved in our method calls, we can start to understand the overhead of each scenario. For instance, it explains why delegate method calls are a bit slower than calling a method directly and why ‘de-virtualization’ is so important. Having the JIT be able to perform extra analysis to determine that a virtual call can be converted into a direct one skips an entire level of indirection, for more on this see:

Simple devirtualization #9230 (already implemented)
JIT: devirtualization next steps #9908
‘Guarded Devirtualization’ design doc

ASCII Art in .NET Code

2019-04-25T00:00:00+00:00

Who doesn’t like a nice bit of ‘ASCII Art’? I know I certainly do!

To see what Matt’s CLR was all about you can watch the recording of my talk ‘From ‘dotnet run’ to ‘Hello World!’’ (from about ~24:30 in)

So armed with a trusty regex /\*(.*?)\*/|//(.*?)\r?\n|"((\\[^\n]|[^"\n])*)"|@("[^"]*")+, I set out to find all the interesting ASCII Art used in source code comments in the following .NET related repositories:

dotnet/CoreCLR - “the runtime for .NET Core. It includes the garbage collector, JIT compiler, primitive data types and low-level classes.”
Mono - “open source ECMA CLI, C# and .NET implementation.”
dotnet/CoreFX - “the foundational class libraries for .NET Core. It includes types for collections, file systems, console, JSON, XML, async and many others.”
dotnet/Roslyn - “provides C# and Visual Basic languages with rich code analysis APIs”
aspnet/AspNetCore - “a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.”

Note: Yes, I shamelessly ‘borrowed’ this idea from John Regehr, I was motivated to write this because his excellent post ‘Explaining Code using ASCII Art’ didn’t have any .NET related code in it!

If you’ve come across any interesting examples I’ve missed out, please let me know!

To make the examples easier to browse, I’ve split them up into categories:

Dave Cutler
Syntax Trees
Timelines
Logic Tables
Class Hierarchies
Component Diagrams
Algorithms
Bit Packing
Data Structures
State Machines
RFC’s and Specs
Dates & Times
Stack Layouts
The Rest

Dave Cutler

There’s no art in this one, but it deserves it’s own category as it quotes the amazing Dave Cutler who led the development of Windows NT. Therefore there’s no better person to ask a deep, technical question about how Thread Suspension works on Windows, from coreclr/src/vm/threadsuspend.cpp

// Message from David Cutler
/*
    After SuspendThread returns, can the suspended thread continue to execute code in user mode?

    [David Cutler] The suspended thread cannot execute any more user code, but it might be currently "running"
    on a logical processor whose other logical processor is currently actually executing another thread.
    In this case the target thread will not suspend until the hardware switches back to executing instructions
    on its logical processor. In this case even the memory barrier would not necessarily work - a better solution
    would be to use interlocked operations on the variable itself.

    After SuspendThread returns, does the store buffer of the CPU for the suspended thread still need to drain?

    Historically, we've assumed that the answer to both questions is No.  But on one 4/8 hyper-threaded machine
    running Win2K3 SP1 build 1421, we've seen two stress failures where SuspendThread returns while writes seem to still be in flight.

    Usually after we suspend a thread, we then call GetThreadContext.  This seems to guarantee consistency.
    But there are places we would like to avoid GetThreadContext, if it's safe and legal.

    [David Cutler] Get context delivers a APC to the target thread and waits on an event that will be set
    when the target thread has delivered its context.

    Chris.
*/

For more info on Dave Cutler, see this excellent interview ‘Internets of Interest #6: Dave Cutler on Dave Cutler’ or ‘The engineer’s engineer: Computer industry luminaries salute Dave Cutler’s five-decade-long quest for quality’

Syntax Trees

The inner workings of the .NET ‘Just-in-Time’ (JIT) Compiler have always been a bit of a mystery to me. But, having informative comments like this one from coreclr/src/jit/lsra.cpp go some way to showing what it’s doing

// For example, for this tree (numbers are execution order, lower is earlier and higher is later):
//
//                                   +---------+----------+
//                                   |       GT_ADD (3)   |
//                                   +---------+----------+
//                                             |
//                                           /   \
//                                         /       \
//                                       /           \
//                   +-------------------+           +----------------------+
//                   |         x (1)     | "tree"    |         y (2)        |
//                   +-------------------+           +----------------------+
//
// generate this tree:
//
//                                   +---------+----------+
//                                   |       GT_ADD (4)   |
//                                   +---------+----------+
//                                             |
//                                           /   \
//                                         /       \
//                                       /           \
//                   +-------------------+           +----------------------+
//                   |  GT_RELOAD (3)    |           |         y (2)        |
//                   +-------------------+           +----------------------+
//                             |
//                   +-------------------+
//                   |         x (1)     | "tree"
//                   +-------------------+

There’s also a more in-depth example in coreclr/src/jit/morph.cpp

Also from roslyn/src/Compilers/VisualBasic/Portable/Semantics/TypeInference/RequiredConversion.vb

 '// These restrictions form a partial order composed of three chains: from less strict to more strict, we have:
'//    [reverse chain] [None] < AnyReverse < ReverseReference < Identity
'//    [middle  chain] None < [Any,AnyReverse] < AnyConversionAndReverse < Identity
'//    [forward chain] [None] < Any < ArrayElement < Reference < Identity
'//
'//            =           KEY:
'//         /  |  \           =     Identity
'//        /   |   \         +r     Reference
'//      -r    |    +r       -r     ReverseReference
'//       |  +-any  |       +-any   AnyConversionAndReverse
'//       |   /|\   +arr     +arr   ArrayElement
'//       |  / | \  |        +any   Any
'//      -any  |  +any       -any   AnyReverse
'//         \  |  /           none  None
'//          \ | /
'//           none
'//

Timelines

This example from coreclr/src/vm/comwaithandle.cpp was unique! I didn’t find another example of ASCII Art used to illustrate time-lines, it’s a really novel approach.

// In case the CLR is paused inbetween a wait, this method calculates how much 
// the wait has to be adjusted to account for the CLR Freeze. Essentially all
// pause duration has to be considered as "time that never existed".
//
// Two cases exists, consider that 10 sec wait is issued 
// Case 1: All pauses happened before the wait completes. Hence just the 
// pause time needs to be added back at the end of wait
// 0           3                   8       10
// |-----------|###################|------>
//                 5-sec pause    
//             ....................>
//                                            Additional 5 sec wait
//                                        |=========================> 
//
// Case 2: Pauses ended after the wait completes. 
// 3 second of wait was left as the pause started at 7 so need to add that back
// 0                           7           10
// |---------------------------|###########>
//                                 5-sec pause   12
//                             ...................>
//                                            Additional 3 sec wait
//                                                |==================> 
//
// Both cases can be expressed in the same calculation
// pauseTime:   sum of all pauses that were triggered after the timer was started
// expDuration: expected duration of the wait (without any pauses) 10 in the example
// actDuration: time when the wait finished. Since the CLR is frozen during pause it's
//              max of timeout or pause-end. In case-1 it's 10, in case-2 it's 12

Logic Tables

A sweet-spot for ASCII Art seems to be tables, there are so many examples. Starting with coreclr/src/vm/methodtablebuilder.cpp (bonus points for combining comments and code together!)

//               |        Base type
// Subtype       |        mdPrivateScope  mdPrivate   mdFamANDAssem   mdAssem     mdFamily    mdFamORAssem    mdPublic
// --------------+-------------------------------------------------------------------------------------------------------
/*mdPrivateScope | */ { { e_SM,           e_NO,       e_NO,           e_NO,       e_NO,       e_NO,           e_NO    },
/*mdPrivate      | */   { e_SM,           e_YES,      e_NO,           e_NO,       e_NO,       e_NO,           e_NO    },
/*mdFamANDAssem  | */   { e_SM,           e_YES,      e_SA,           e_NO,       e_NO,       e_NO,           e_NO    },
/*mdAssem        | */   { e_SM,           e_YES,      e_SA,           e_SA,       e_NO,       e_NO,           e_NO    },
/*mdFamily       | */   { e_SM,           e_YES,      e_YES,          e_NO,       e_YES,      e_NSA,          e_NO    },
/*mdFamORAssem   | */   { e_SM,           e_YES,      e_YES,          e_SA,       e_YES,      e_YES,          e_NO    },
/*mdPublic       | */   { e_SM,           e_YES,      e_YES,          e_YES,      e_YES,      e_YES,          e_YES   } };

Also coreclr/src/jit/importer.cpp which shows how the JIT deals with boxing/un-boxing

/*
    ----------------------------------------------------------------------
    | \ helper  |                         |                              |
    |   \       |                         |                              |
    |     \     | CORINFO_HELP_UNBOX      | CORINFO_HELP_UNBOX_NULLABLE  |
    |       \   | (which returns a BYREF) | (which returns a STRUCT)     |
    | opcode  \ |                         |                              |
    |---------------------------------------------------------------------
    | UNBOX     | push the BYREF          | spill the STRUCT to a local, |
    |           |                         | push the BYREF to this local |
    |---------------------------------------------------------------------
    | UNBOX_ANY | push a GT_OBJ of        | push the STRUCT              |
    |           | the BYREF               | For Linux when the           |
    |           |                         |  struct is returned in two   |
    |           |                         |  registers create a temp     |
    |           |                         |  which address is passed to  |
    |           |                         |  the unbox_nullable helper.  |
    |---------------------------------------------------------------------
*/

Finally, there’s some other nice examples showing the rules for operator overloading in the C# (Roslyn) Compiler and which .NET data-types can be converted via the System.ToXXX() functions.

Class Hierarchies

Of course, most IDE’s come with tools that will generate class-hierarchies for you, but it’s much nicer to see them in ASCII, from coreclr/src/vm/object.h

 * COM+ Internal Object Model
 *
 *
 * Object              - This is the common base part to all COM+ objects
 *  |                        it contains the MethodTable pointer and the
 *  |                        sync block index, which is at a negative offset
 *  |
 *  +-- code:StringObject       - String objects are specialized objects for string
 *  |                        storage/retrieval for higher performance
 *  |
 *  +-- BaseObjectWithCachedData - Object Plus one object field for caching.
 *  |       |
 *  |       +-  ReflectClassBaseObject    - The base object for the RuntimeType class
 *  |       +-  ReflectMethodObject       - The base object for the RuntimeMethodInfo class
 *  |       +-  ReflectFieldObject        - The base object for the RtFieldInfo class
 *  |
 *  +-- code:ArrayBase          - Base portion of all arrays
 *  |       |
 *  |       +-  I1Array    - Base type arrays
 *  |       |   I2Array
 *  |       |   ...
 *  |       |
 *  |       +-  PtrArray   - Array of OBJECTREFs, different than base arrays because of pObjectClass
 *  |              
 *  +-- code:AssemblyBaseObject - The base object for the class Assembly

There’s also an even larger one that I stumbled across when writing “Stack Walking” in the .NET Runtime.

Component Diagrams

When you have several different components in a code-base it’s always nice to see how they fit together. From coreclr/src/vm/codeman.h we can see how the top-level parts of the .NET JIT work together

                                               ExecutionManager
                                                       |
                           +-----------+---------------+---------------+-----------+--- ...
                           |           |                               |           |
                        CodeType       |                            CodeType       |
                           |           |                               |           |
                           v           v                               v           v
+---------------+      +--------+<---- R    +---------------+      +--------+<---- R
|ICorJitCompiler|<---->|IJitMan |<---- R    |ICorJitCompiler|<---->|IJitMan |<---- R
+---------------+      +--------+<---- R    +---------------+      +--------+<---- R
                           |       x   .                               |       x   .
                           |        \  .                               |        \  .
                           v         \ .                               v         \ .
                       +--------+      R                           +--------+      R
                       |ICodeMan|                                  |ICodeMan|     (RangeSections)
                       +--------+                                  +--------+       

Other notable examples are:

Finally, from coreclr/src/vm/ceeload.cpp we see the inner-workings of the Native Image Generator (NGEN)

        This diagram illustrates the layout of fixups in the ngen image.
        This is the case where function foo2 has a class-restore fixup
        for class C1 in b.dll.

                                  zapBase+curTableVA+rva /         FixupList (see Fixup Encoding below)
                                  m_pFixupBlobs
                                                            +-------------------+
                  pEntry->VA +--------------------+         |     non-NULL      | foo1
                             |Handles             |         +-------------------+
ZapHeader.ImportTable        |                    |         |     non-NULL      |
                             |                    |         +-------------------+
   +------------+            +--------------------+         |     non-NULL      |
   |a.dll       |            |Class cctors        |<---+    +-------------------+
   |            |            |                    |     \   |         0         |
   |            |     p->VA/ |                    |<---+ \  +===================+
   |            |      blobs +--------------------+     \ +-------non-NULL      | foo2
   +------------+            |Class restore       |      \  +-------------------+
   |b.dll       |            |                    |       +-------non-NULL      |
   |            |            |                    |         +-------------------+
   |  token_C1  |<--------------blob(=>fixedUp/0) |<--pBlob--------index        |
   |            | \          |                    |         +-------------------+
   |            |  \         +--------------------+         |     non-NULL      |
   |            |   \        |                    |         +-------------------+
   |            |    \       |        .           |         |         0         |
   |            |     \      |        .           |         +===================+
   +------------+      \     |        .           |         |         0         | foo3
                        \    |                    |         +===================+
                         \   +--------------------+         |     non-NULL      | foo4
                          \  |Various fixups that |         +-------------------+
                           \ |need too happen     |         |         0         |
                            \|                    |         +===================+
                             |(CorCompileTokenTable)
                             |                    |
               pEntryEnd->VA +--------------------+

Algorithms

They say ‘a picture paints a thousand words’ and that definately applies when describing complex algorithms, from roslyn/src/Workspaces/Core/Portable/Utilities/EditDistance.cs

// If we fill out the matrix fully we'll get:
//          
//           s u n d a y <-- source
//      ----------------
//      |∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
//      |∞ 0 1 2 3 4 5 6
//    s |∞ 1 0 1 2 3 4 5 
//    a |∞ 2 1 1 2 3 3 4 
//    t |∞ 3 2 2 2 3 4 4 
//    u |∞ 4 3 2 3 3 4 5 
//    r |∞ 5 4 3 3 4 4 5 
//    d |∞ 6 5 4 4 3 4 5 
//    a |∞ 7 6 5 5 4 3 4 
//    y |∞ 8 7 6 6 5 4 3 <--
//                     ^
//                     |

Next, this gem that explains how the DOS wild-card matching works, corefx/src/System.IO.FileSystem/src/System/IO/Enumeration/FileSystemName.cs

// Matching routine description
// ============================
// (copied from native impl)
//
// This routine compares a Dbcs name and an expression and tells the caller
// if the name is in the language defined by the expression.  The input name
// cannot contain wildcards, while the expression may contain wildcards.
//
// Expression wild cards are evaluated as shown in the nondeterministic
// finite automatons below.  Note that ~* and ~? are DOS_STAR and DOS_QM.
//
//        ~* is DOS_STAR, ~? is DOS_QM, and ~. is DOS_DOT
//
//                                  S
//                               <-----<
//                            X  |     |  e       Y
//        X * Y ==       (0)----->-(1)->-----(2)-----(3)
//
//                                 S-.
//                               <-----<
//                            X  |     |  e       Y
//        X ~* Y ==      (0)----->-(1)->-----(2)-----(3)
//
//                           X     S     S     Y
//        X ?? Y ==      (0)---(1)---(2)---(3)---(4)
//
//                           X     .        .      Y
//        X ~.~. Y ==    (0)---(1)----(2)------(3)---(4)
//                              |      |________|
//                              |           ^   |
//                              |_______________|
//                                 ^EOF or .^
//
//                           X     S-.     S-.     Y
//        X ~?~? Y ==    (0)---(1)-----(2)-----(3)---(4)
//                              |      |________|
//                              |           ^   |
//                              |_______________|
//                                 ^EOF or .^
//
//    where S is any single character
//          S-. is any single character except the final .
//          e is a null character transition
//          EOF is the end of the name string
//
//   In words:
//
//       * matches 0 or more characters.
//       ? matches exactly 1 character.
//       DOS_STAR matches 0 or more characters until encountering and matching
//           the final . in the name.
//       DOS_QM matches any single character, or upon encountering a period or
//           end of name string, advances the expression to the end of the
//           set of contiguous DOS_QMs.
//       DOS_DOT matches either a . or zero characters beyond name string.

Finally from roslyn/src/Workspaces/Core/Portable/Shared/Collections/IntervalTree`1.Node.cs we have per-method comments with samples, this is a great idea!

// Sample:
//   1            1                  3
//  / \          / \              /     \
// a   2        a   3            1       2
//    / \   =>     / \     =>   / \     / \
//   3   d        b   2        a   b   c   d
//  / \              / \
// b   c            c   d
internal Node InnerRightOuterLeftRotation(IIntervalIntrospector<T> introspector)
{
    ...
}

// Sample:
//     1              1              3
//    / \            / \          /     \
//   2   d          3   d        2       1
//  / \     =>     / \     =>   / \     / \
// a   3          2   c        a   b   c   d
//    / \        / \
//   b   c      a   b
internal Node InnerLeftOuterRightRotation(IIntervalIntrospector<T> introspector)
{
    ...
}

Bit Packing

Maybe you can visualise which individual bits are set given a Hexadecimal value, but I can’t, so I’m always grateful for comments like this one from roslyn/src/Compilers/CSharp/Portable/Symbols/Source/SourceMemberContainerSymbol.cs

// We current pack everything into two 32-bit ints; layouts for each are given below.

// First int:
//
// | |d|yy|xxxxxxxxxxxxxxxxxxxxxxx|wwwwww|
//
// w = special type.  6 bits.
// x = modifiers.  23 bits.
// y = IsManagedType.  2 bits.
// d = FieldDefinitionsNoted. 1 bit

This one from corefx/src/System.Runtime.WindowsRuntime/src/System/Threading/Tasks/TaskToAsyncInfoAdapter.cs also does a great job of showing the different bit-flags and how they interact

// ! THIS DIAGRAM ILLUSTRATES THE CONSTANTS BELOW. UPDATE THIS IF UPDATING THE CONSTANTS BELOW!:
//     3         2         1         0
//    10987654321098765432109876543210
//    X...............................   Reserved such that we can use Int32 and not worry about negative-valued state constants
//    ..X.............................   STATEFLAG_COMPLETED_SYNCHRONOUSLY
//    ...X............................   STATEFLAG_MUST_RUN_COMPLETION_HNDL_WHEN_SET
//    ....X...........................   STATEFLAG_COMPLETION_HNDL_NOT_YET_INVOKED
//    ................................   STATE_NOT_INITIALIZED
//    ...............................X   STATE_STARTED
//    ..............................X.   STATE_RUN_TO_COMPLETION
//    .............................X..   STATE_CANCELLATION_REQUESTED
//    ............................X...   STATE_CANCELLATION_COMPLETED
//    ...........................X....   STATE_ERROR
//    ..........................X.....   STATE_CLOSED
//    ..........................XXXXXX   STATEMASK_SELECT_ANY_ASYNC_STATE
//    XXXXXXXXXXXXXXXXXXXXXXXXXX......   STATEMASK_CLEAR_ALL_ASYNC_STATES
//     3         2         1         0
//    10987654321098765432109876543210

Finally, we have some helpful explanations of how different encoding work. Firstly UTF-8 from corefx//src/Common/src/CoreLib/System/Text/UTF8Encoding.cs

/*
    bytes   bits    UTF-8 representation
    -----   ----    -----------------------------------
    1        7      0vvvvvvv
    2       11      110vvvvv 10vvvvvv
    3       16      1110vvvv 10vvvvvv 10vvvvvv
    4       21      11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
    -----   ----    -----------------------------------
    Surrogate:
    Real Unicode value = (HighSurrogate - 0xD800) * 0x400 + (LowSurrogate - 0xDC00) + 0x10000
*/

and then UTF-32 in corefx/src/Common/src/CoreLib/System/Text/UTF32Encoding.cs

/*
    words   bits    UTF-32 representation
    -----   ----    -----------------------------------
    1       16      00000000 00000000 xxxxxxxx xxxxxxxx
    2       21      00000000 000xxxxx hhhhhhll llllllll
    -----   ----    -----------------------------------
    Surrogate:
    Real Unicode value = (HighSurrogate - 0xD800) * 0x400 + (LowSurrogate - 0xDC00) + 0x10000
*/

Data Structures

This comment from mono/utils/dlmalloc.c does a great job of showing how chunks of memory are arranaged by malloc

  A chunk that's in use looks like:

   chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           | Size of previous chunk (if P = 1)                             |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P|
         | Size of this chunk                                         1| +-+
   mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |                                                               |
         +-                                                             -+
         |                                                               |
         +-                                                             -+
         |                                                               :
         +-      size - sizeof(size_t) available payload bytes          -+
         :                                                               |
 chunk-> +-                                                             -+
         |                                                               |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|
       | Size of next chunk (may or may not be in use)               | +-+
 mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

    And if it's free, it looks like this:

   chunk-> +-                                                             -+
           | User payload (must be in use, or we would have merged!)       |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P|
         | Size of this chunk                                         0| +-+
   mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         | Next pointer                                                  |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         | Prev pointer                                                  |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |                                                               :
         +-      size - sizeof(struct chunk) unused bytes               -+
         :                                                               |
 chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         | Size of this chunk                                            |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|
       | Size of next chunk (must be in use, or we would have merged)| +-+
 mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                                                               :
       +- User payload                                                -+
       :                                                               |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                                                     |0|
                                                                     +-+

Also, from corefx/src/Common/src/CoreLib/System/MemoryExtensions.cs we can see how overlapping memory regions are detected:

//  Visually, the two sequences are located somewhere in the 32-bit
//  address space as follows:
//
//      [----------------------------------------------)                            normal address space
//      0                                             2³²
//                            [------------------)                                  first sequence
//                            xRef            xRef + xLength
//              [--------------------------)     .                                  second sequence
//              yRef          .         yRef + yLength
//              :             .            .     .
//              :             .            .     .
//                            .            .     .
//                            .            .     .
//                            .            .     .
//                            [----------------------------------------------)      relative address space
//                            0            .     .                          2³²
//                            [------------------)             :                    first sequence
//                            x1           .     x2            :
//                            -------------)                   [-------------       second sequence
//                                         y2                  y1

State Machines

This comment from mono/benchmark/zipmark.cs gives a great over-view of the implementation of RFC 1951 - DEFLATE Compressed Data Format Specification

/*
 * The Deflater can do the following state transitions:
    *
    * (1) -> INIT_STATE   ----> INIT_FINISHING_STATE ---.
    *        /  | (2)      (5)                         |
    *       /   v          (5)                         |
    *   (3)| SETDICT_STATE ---> SETDICT_FINISHING_STATE |(3)
    *       \   | (3)                 |        ,-------'
    *        |  |                     | (3)   /
    *        v  v          (5)        v      v
    * (1) -> BUSY_STATE   ----> FINISHING_STATE
    *                                | (6)
    *                                v
    *                           FINISHED_STATE
    *    \_____________________________________/
    *          | (7)
    *          v
    *        CLOSED_STATE
    *
    * (1) If we should produce a header we start in INIT_STATE, otherwise
    *     we start in BUSY_STATE.
    * (2) A dictionary may be set only when we are in INIT_STATE, then
    *     we change the state as indicated.
    * (3) Whether a dictionary is set or not, on the first call of deflate
    *     we change to BUSY_STATE.
    * (4) -- intentionally left blank -- :)
    * (5) FINISHING_STATE is entered, when flush() is called to indicate that
    *     there is no more INPUT.  There are also states indicating, that
    *     the header wasn't written yet.
    * (6) FINISHED_STATE is entered, when everything has been flushed to the
    *     internal pending output buffer.
    * (7) At any time (7)
    *
    */

This might be pushing the definition of ‘state machine’ a bit far, but I wanted to include it because it shows just how complex ‘exception handling’ can be, from coreclr/src/jit/jiteh.cpp

// fgNormalizeEH: Enforce the following invariants:
//
//   1. No block is both the first block of a handler and the first block of a try. In IL (and on entry
//      to this function), this can happen if the "try" is more nested than the handler.
//
//      For example, consider:
//
//               try1 ----------------- BB01
//               |                      BB02
//               |--------------------- BB03
//               handler1
//               |----- try2 ---------- BB04
//               |      |               BB05
//               |      handler2 ------ BB06
//               |      |               BB07
//               |      --------------- BB08
//               |--------------------- BB09
//
//      Thus, the start of handler1 and the start of try2 are the same block. We will transform this to:
//
//               try1 ----------------- BB01
//               |                      BB02
//               |--------------------- BB03
//               handler1 ------------- BB10 // empty block
//               |      try2 ---------- BB04
//               |      |               BB05
//               |      handler2 ------ BB06
//               |      |               BB07
//               |      --------------- BB08
//               |--------------------- BB09
//

RFC’s and Specs

Next up, how the Kestrel web-server handles RFC 7540 - Hypertext Transfer Protocol Version 2 (HTTP/2).

Firstly, from aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2Frame.cs

/* https://tools.ietf.org/html/rfc7540#section-4.1
    +-----------------------------------------------+
    |                 Length (24)                   |
    +---------------+---------------+---------------+
    |   Type (8)    |   Flags (8)   |
    +-+-------------+---------------+-------------------------------+
    |R|                 Stream Identifier (31)                      |
    +=+=============================================================+
    |                   Frame Payload (0...)                      ...
    +---------------------------------------------------------------+
*/

and then in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2Frame.Headers.cs

/* https://tools.ietf.org/html/rfc7540#section-6.2
    +---------------+
    |Pad Length? (8)|
    +-+-------------+-----------------------------------------------+
    |E|                 Stream Dependency? (31)                     |
    +-+-------------+-----------------------------------------------+
    |  Weight? (8)  |
    +-+-------------+-----------------------------------------------+
    |                   Header Block Fragment (*)                 ...
    +---------------------------------------------------------------+
    |                           Padding (*)                       ...
    +---------------------------------------------------------------+
*/

There are other notable examples in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameReader.cs and aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameWriter.cs.

Also RFC 3986 - Uniform Resource Identifier (URI) is discussed in corefx/src/Common/src/System/Net/IPv4AddressHelper.Common.cs

Finally, RFC 7541 - HPACK: Header Compression for HTTP/2, is covered in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/HPack/HPackDecoder.cs

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.1
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 1 |        Index (7+)         |
// +---+---------------------------+
private const byte IndexedHeaderFieldMask = 0x80;
private const byte IndexedHeaderFieldRepresentation = 0x80;

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.1
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 0 | 1 |      Index (6+)       |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldWithIncrementalIndexingMask = 0xc0;
private const byte LiteralHeaderFieldWithIncrementalIndexingRepresentation = 0x40;

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.2
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 0 |  Index (4+)   |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldWithoutIndexingMask = 0xf0;
private const byte LiteralHeaderFieldWithoutIndexingRepresentation = 0x00;

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.3
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 1 |  Index (4+)   |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldNeverIndexedMask = 0xf0;
private const byte LiteralHeaderFieldNeverIndexedRepresentation = 0x10;

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.3
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 1 |   Max size (5+)   |
// +---+---------------------------+
private const byte DynamicTableSizeUpdateMask = 0xe0;
private const byte DynamicTableSizeUpdateRepresentation = 0x20;

// http://httpwg.org/specs/rfc7541.html#rfc.section.5.2
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | H |    String Length (7+)     |
// +---+---------------------------+
private const byte HuffmanMask = 0x80;

Dates & Times

It is pretty widely accepted that dates and times are hard and that’s reflected in the amount of comments explaining different scenarios. For example from corefx/src/Common/src/CoreLib/System/TimeZoneInfo.cs

// startTime and endTime represent the period from either the start of DST to the end and
// ***does not include*** the potentially overlapped times
//
//         -=-=-=-=-=- Pacific Standard Time -=-=-=-=-=-=-
//    April 2, 2006                            October 29, 2006
// 2AM            3AM                        1AM              2AM
// |      +1 hr     |                        |       -1 hr      |
// | <invalid time> |                        | <ambiguous time> |
//                  [========== DST ========>)
//
//        -=-=-=-=-=- Some Weird Time Zone -=-=-=-=-=-=-
//    April 2, 2006                          October 29, 2006
// 1AM              2AM                    2AM              3AM
// |      -1 hr       |                      |       +1 hr      |
// | <ambiguous time> |                      |  <invalid time>  |
//                    [======== DST ========>)
//

Also, from corefx/src/Common/src/CoreLib/System/TimeZoneInfo.Unix.cs we see some details on how ‘leap-years’ are handled:

// should be n Julian day format which we don't support. 
// 
// This specifies the Julian day, with n between 0 and 365. February 29 is counted in leap years.
//
// n would be a relative number from the begining of the year. which should handle if the 
// the year is a leap year or not.
// 
// In leap year, n would be counted as:
// 
// 0                30 31              59 60              90      335            365
// |-------Jan--------|-------Feb--------|-------Mar--------|....|-------Dec--------|
//
// while in non leap year we'll have 
// 
// 0                30 31              58 59              89      334            364
// |-------Jan--------|-------Feb--------|-------Mar--------|....|-------Dec--------|
//
// 
// For example if n is specified as 60, this means in leap year the rule will start at Mar 1,
// while in non leap year the rule will start at Mar 2.
// 
// If we need to support n format, we'll have to have a floating adjustment rule support this case.

Finally, this comment from corefx/src/System.Runtime/tests/System/TimeZoneInfoTests.cs discusses invalid and ambiguous times that are covered in tests:

//    March 26, 2006                            October 29, 2006
// 2AM            3AM                        2AM              3AM
// |      +1 hr     |                        |       -1 hr      |
// | <invalid time> |                        | <ambiguous time> |
//                  *========== DST ========>*

//
// * 00:59:59 Sunday March 26, 2006 in Universal converts to
//   01:59:59 Sunday March 26, 2006 in Europe/Amsterdam (NO DST)
//
// * 01:00:00 Sunday March 26, 2006 in Universal converts to
//   03:00:00 Sunday March 26, 2006 in Europe/Amsterdam (DST)
//

Stack Layouts

To finish off, I wanted to look at ‘stack layouts’ because they seem to be a favourite of the .NET/Mono Runtime Engineers, there’s sooo many examples!

First-up, x68 from coreclr/src/jit/lclvars.cpp (you can also see the x64, ARM and ARM64 versions).

 *  The frame is laid out as follows for x86:
 *
 *              ESP frames                
 *
 *      |                       |         
 *      |-----------------------|         
 *      |       incoming        |         
 *      |       arguments       |         
 *      |-----------------------| <---- Virtual '0'         
 *      |    return address     |         
 *      +=======================+
 *      |Callee saved registers |         
 *      |-----------------------|         
 *      |       Temps           |         
 *      |-----------------------|         
 *      |       Variables       |         
 *      |-----------------------| <---- Ambient ESP
 *      |   Arguments for the   |         
 *      ~    next function      ~ 
 *      |                       |         
 *      |       |               |         
 *      |       | Stack grows   |         
 *              | downward                
 *              V                         
 *
 *
 *              EBP frames
 *
 *      |                       |
 *      |-----------------------|
 *      |       incoming        |
 *      |       arguments       |
 *      |-----------------------| <---- Virtual '0'         
 *      |    return address     |         
 *      +=======================+
 *      |    incoming EBP       |
 *      |-----------------------| <---- EBP
 *      |Callee saved registers |         
 *      |-----------------------|         
 *      |   security object     |
 *      |-----------------------|
 *      |     ParamTypeArg      |
 *      |-----------------------|
 *      |  Last-executed-filter |
 *      |-----------------------|
 *      |                       |
 *      ~      Shadow SPs       ~
 *      |                       |
 *      |-----------------------|
 *      |                       |
 *      ~      Variables        ~
 *      |                       |
 *      ~-----------------------|
 *      |       Temps           |
 *      |-----------------------|
 *      |       localloc        |
 *      |-----------------------| <---- Ambient ESP
 *      |   Arguments for the   |
 *      |    next function      ~
 *      |                       |
 *      |       |               |
 *      |       | Stack grows   |
 *              | downward
 *              V
 *

Not to be left out, Mono has some nice examples covering MIPS (below), PPC and ARM

/*
 * Stack frame layout:
 * 
 *   ------------------- sp + cfg->stack_usage + cfg->param_area
 *      param area		incoming
 *   ------------------- sp + cfg->stack_usage + MIPS_STACK_PARAM_OFFSET
 *      a0-a3			incoming
 *   ------------------- sp + cfg->stack_usage
 *	ra
 *   ------------------- sp + cfg->stack_usage-4
 *   	spilled regs
 *   ------------------- sp + 
 *   	MonoLMF structure	optional
 *   ------------------- sp + cfg->arch.lmf_offset
 *   	saved registers		s0-s8
 *   ------------------- sp + cfg->arch.iregs_offset
 *   	locals
 *   ------------------- sp + cfg->param_area
 *   	param area		outgoing
 *   ------------------- sp + MIPS_STACK_PARAM_OFFSET
 *   	a0-a3			outgoing
 *   ------------------- sp
 *   	red zone
 */

Finally, there’s another example covering [DLLImport] callbacks and one more involving funclet frames in ARM64, I told you there were lots!!

The Rest

If you aren’t sick of ‘ASCII Art’ by now, here’s a few more examples for you to look at!!

Is C# a low-level language?

2019-03-01T00:00:00+00:00

I’m a massive fan of everything Fabien Sanglard does, I love his blog and I’ve read both his books cover-to-cover (for more info on his books, check out the recent Hansleminutes podcast).

Recently he wrote an excellent post where he deciphered a postcard sized raytracer, un-packing the obfuscated code and providing a fantastic explanation of the maths involved. I really recommend you take the time to read it!

But it got me thinking, would it be possible to port that C++ code to C#?

Partly because in my day job I’ve been having to write a fair amount of C++ recently and I’ve realised I’m a bit rusty, so I thought this might help!

But more significantly, I wanted to get a better insight into the question is C# a low-level language?

A slightly different, but related question is how suitable is C# for ‘systems programming’? For more on that I really recommend Joe Duffy’s excellent post from 2013.

Line-by-line port

I started by simply porting the un-obfuscated C++ code line-by-line to C#. Turns out that this was pretty straight forward, I guess the story about C# being C++++ is true after all!!

Let’s look at an example, the main data structure in the code is a ‘vector’, here’s the code side-by-side, C++ on the left and C# on the right:

So there’s a few syntax differences, but because .NET lets you define your own ‘Value Types’ I was able to get the same functionality. This is significant because treating the ‘vector’ as a struct means we can get better ‘data locality’ and the .NET Garbage Collector (GC) doesn’t need to be involved as the data will go onto the stack (probably, yes I know it’s an implementation detail).

For more info on structs or ‘value types’ in .NET see:

In particular that last post form Eric Lippert contains this helpful quote that makes it clear what ‘value types’ really are:

Surely the most relevant fact about value types is not the implementation detail of how they are allocated, but rather the by-design semantic meaning of “value type”, namely that they are always copied “by value”. If the relevant thing was their allocation details then we’d have called them “heap types” and “stack types”. But that’s not relevant most of the time. Most of the time the relevant thing is their copying and identity semantics.

Now lets look at how some other methods look side-by-side (again C++ on the left, C# on the right), first up RayTracing(..):

Next QueryDatabase(..):

(see Fabien’s post for an explanation of what these 2 functions are doing)

But the point is that again, C# lets us very easily write C++ code! In this case what helps us out the most is the ref keyword which lets us pass a value by reference. We’ve been able to use ref in method calls for quite a while, but recently there’s been a effort to allow ref in more places:

Now sometimes using ref can provide a performance boost because it means that the struct doesn’t need to be copied, see the benchmarks in Adam Sitniks post and Performance traps of ref locals and ref returns in C# for more information.

However what’s most important for this scenario is that it allows us to have the same behaviour in our C# port as the original C++ code. Although I want to point out that ‘Managed References’ as they’re known aren’t exactly the same as ‘pointers’, most notably you can’t do arithmetic on them, for more on this see:

Performance

So, it’s all well and good being able to port the code, but ultimately the performance also matters. Especially in something like a ‘ray tracer’ that can take minutes to run! The C++ code contains a variable called sampleCount that controls the final quality of the image, with sampleCount = 2 it looks like this:

Which clearly isn’t that realistic!

However once you get to sampleCount = 2048 things look a lot better:

But, running with sampleCount = 2048 means the rendering takes a long time, so all the following results were run with it set to 2, which means the test runs completed in ~1 minute. Changing sampleCount only affects the number of iterations of the outermost loop of the code, see this gist for an explanation.

Results after a ‘naive’ line-by-line port

To be able to give a meaningful side-by-side comparison of the C++ and C# versions I used the time-windows tool that’s a port of the Unix time command. My initial results looked this this:

	C++ (VS 2017)	.NET Framework (4.7.2)	.NET Core (2.2)
Elapsed time (secs)	47.40	80.14	78.02
Kernel time	0.14 (0.3%)	0.72 (0.9%)	0.63 (0.8%)
User time	43.86 (92.5%)	73.06 (91.2%)	70.66 (90.6%)
page fault #	1,143	4,818	5,945
Working set (KB)	4,232	13,624	17,052
Paged pool (KB)	95	172	154
Non-paged pool	7	14	16
Page file size (KB)	1,460	10,936	11,024

So initially we see that the C# code is quite a bit slower than the C++ version, but it does get better (see below).

However lets first look at what the .NET JIT is doing for us even with this ‘naive’ line-by-line port. Firstly, it’s doing a nice job of in-lining the smaller ‘helper methods’, we can see this by looking at the output of the brilliant Inlining Analyzer tool (green overlay = inlined):

However, it doesn’t inline all methods, for example QueryDatabase(..) is skipped because of it’s complexity:

Another feature that the .NET Just-In-Time (JIT) compiler provides is converting specific methods calls into corresponding CPU instructions. We can see this in action with the sqrt wrapper function, here’s the original C# code (note the call to Math.Sqrt):

// intnv square root
public static Vec operator !(Vec q) {
    return q * (1.0f / (float)Math.Sqrt(q % q));
}

And here’s the assembly code that the .NET JIT generates, there’s no call to Math.Sqrt and it makes use of the vsqrtsd CPU instruction:

; Assembly listing for method Program:sqrtf(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->  mm0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M8216_IG01:
       vzeroupper 

G_M8216_IG02:
       vcvtss2sd xmm0, xmm0
       vsqrtsd  xmm0, xmm0
       vcvtsd2ss xmm0, xmm0

G_M8216_IG03:
       ret      

; Total bytes of code 16, prolog size 3 for method Program:sqrtf(float):float
; ============================================================

(to get this output you need to following these instructions, use the ‘Disasmo’ VS2019 Add-in or take a look at SharpLab.io)

These replacements are also known as ‘intrinsics’ and we can see the JIT generating them in the code below. This snippet just shows the mapping for AMD64, the JIT also targets X86, ARM and ARM64, the full method is here

bool Compiler::IsTargetIntrinsic(CorInfoIntrinsics intrinsicId)
{
#if defined(_TARGET_AMD64_) || (defined(_TARGET_X86_) && !defined(LEGACY_BACKEND))
    switch (intrinsicId)
    {
        // AMD64/x86 has SSE2 instructions to directly compute sqrt/abs and SSE4.1
        // instructions to directly compute round/ceiling/floor.
        //
        // TODO: Because the x86 backend only targets SSE for floating-point code,
        //       it does not treat Sine, Cosine, or Round as intrinsics (JIT32
        //       implemented those intrinsics as x87 instructions). If this poses
        //       a CQ problem, it may be necessary to change the implementation of
        //       the helper calls to decrease call overhead or switch back to the
        //       x87 instructions. This is tracked by #7097.
        case CORINFO_INTRINSIC_Sqrt:
        case CORINFO_INTRINSIC_Abs:
            return true;

        case CORINFO_INTRINSIC_Round:
        case CORINFO_INTRINSIC_Ceiling:
        case CORINFO_INTRINSIC_Floor:
            return compSupports(InstructionSet_SSE41);

        default:
            return false;
    }
    ...
}

As you can see, some methods are implemented like this, e.g. Sqrt and Abs, but for others the CLR instead uses the C++ runtime functions for instance powf.

This entire process is explained very nicely in How is Math.Pow() implemented in .NET Framework?, but we can also see it in action in the CoreCLR source:

COMSingle::Pow implementation, i.e. the method that’s executed if you call MathF.Pow(..) from C# code
Mapping to C runtime method implementations
Cross-platform version of powf implementation that ensures the same behaviour across OSes

Results after simple performance improvements

However, I wanted to see if my ‘naive’ line-by-line port could be improved, after some profiling I made two main changes:

Remove in-line array initialisation
Switch from Math.XXX(..) functions to the MathF.XXX() counterparts.

These changes are explained in more depth below

Remove in-line array initialisation

For more information about why this is necessary see this excellent Stack Overflow answer from Andrey Akinshin complete with benchmarks and assembly code! It comes to the following conclusion:

Conclusion

Does .NET caches hardcoded local arrays? Kind of: the Roslyn compiler put it in the metadata.

Do we have any overhead in this case? Unfortunately, yes: JIT will copy the array content from the metadata for each invocation; it will work longer than the case with a static array. Runtime also allocates objects and produce memory traffic.

Should we care about it? It depends. If it’s a hot method and you want to achieve a good level of performance, you should use a static array. If it’s a cold method which doesn’t affect the application performance, you probably should write “good” source code and put the array in the method scope.

You can see the change I made in this diff.

Using MathF functions instead of Math

Secondly and most significantly I got a big perf improvement by making the following changes:

#if NETSTANDARD2_1 || NETCOREAPP2_0 || NETCOREAPP2_1 || NETCOREAPP2_2 || NETCOREAPP3_0
    // intnv square root
    public static Vec operator !(Vec q) {
      return q * (1.0f / MathF.Sqrt(q % q));
    }
#else
    public static Vec operator !(Vec q) {
      return q * (1.0f / (float)Math.Sqrt(q % q));
    }
#endif

As of ‘.NET Standard 2.1’ there are now specific float implementations of the common maths functions, located in the System.MathF class. For more information on this API and it’s implementation see:

After these changes, the C# code is ~10% slower than the C++ version:

	C++ (VS C++ 2017)	.NET Framework (4.7.2)	.NET Core (2.2) TC OFF	.NET Core (2.2) TC ON
Elapsed time (secs)	41.38	58.89	46.04	44.33
Kernel time	0.05 (0.1%)	0.06 (0.1%)	0.14 (0.3%)	0.13 (0.3%)
User time	41.19 (99.5%)	58.34 (99.1%)	44.72 (97.1%)	44.03 (99.3%)
page fault #	1,119	4,749	5,776	5,661
Working set (KB)	4,136	13,440	16,788	16,652
Paged pool (KB)	89	172	150	150
Non-paged pool	7	13	16	16
Page file size (KB)	1,428	10,904	10,960	11,044

TC = Tiered Compilation (I believe that it’ll be on by default in .NET Core 3.0)

For completeness, here’s the results across several runs:

Run	C++ (VS C++ 2017)	.NET Framework (4.7.2)	.NET Core (2.2) TC OFF	.NET Core (2.2) TC ON
TestRun-01	41.38	58.89	46.04	44.33
TestRun-02	41.19	57.65	46.23	45.96
TestRun-03	42.17	62.64	46.22	48.73

Note: the difference between .NET Core and .NET Framework is due to the lack of the MathF API in .NET Framework v4.7.2, for more info see Support .Net Framework (4.8?) for netstandard 2.1.

Further performance improvements

However I’m sure that others can do better!

If you’re interested in trying to close the gap the C# code is available. For comparison, you can see the assembly produced by the C++ compiler courtesy of the brilliant Compiler Explorer.

Finally, if it helps, here’s the output from the Visual Studio Profiler showing the ‘hot path’ (after the perf improvement described above):

Is C# a low-level language?

Or more specifically:

What language features of C#/F#/VB.NET or BCL/Runtime functionality enable ‘low-level’* programming?

* yes, I know ‘low-level’ is a subjective term 😊

Note: Any C# developer is going to have a different idea of what ‘low-level’ means, these features would be taken for granted by C++ or Rust programmers.

Here’s the list that I came up with:

ref returns and ref locals
- “tl;dr Pass and return by reference to avoid large struct copying. It’s type and memory safe. It can be even faster than unsafe!”
Unsafe code in .NET
- “The core C# language, as defined in the preceding chapters, differs notably from C and C++ in its omission of pointers as a data type. Instead, C# provides references and the ability to create objects that are managed by a garbage collector. This design, coupled with other features, makes C# a much safer language than C or C++.”
Managed pointers in .NET
- “There is, however, another pointer type in CLR – a managed pointer. It could be defined as a more general type of reference, which may point to other locations than just the beginning of an object.”
C# 7 Series, Part 10: Span<T> and universal memory management
- “System.Span<T> is a stack-only type (ref struct) that wraps all memory access patterns, it is the type for universal contiguous memory access. You can think the implementation of the Span contains a dummy reference and a length, accepting all 3 memory access types."
Interoperability (C# Programming Guide)
- “The .NET Framework enables interoperability with unmanaged code through platform invoke services, the System.Runtime.InteropServices namespace, C++ interoperability, and COM interoperability (COM interop).”

However, I know my limitations and so I asked on twitter and got a lot more replies to add to the list:

Ben Adams “Platform intrinsics (CPU instruction access)”
Marc Gravell “SIMD via Vector (which mixes well with Span) is *fairly* low; .NET Core should (soon?) offer direct CPU intrinsics for more explicit usage targeting particular CPU ops"
Marc Gravell “powerful JIT: things like range elision on arrays/spans, and the JIT using per-struct-T rules to remove huge chunks of code that it knows can’t be reached for that T, or on your particular CPU (BitConverter.IsLittleEndian, Vector.IsHardwareAccelerated, etc)”
Kevin Jones “I would give a special shout-out to the MemoryMarshal and Unsafe classes, and probably a few other things in the System.Runtime.CompilerServices namespace.”
Theodoros Chatzigiannakis “You could also include __makeref and the rest.”
damageboy “Being able to dynamically generate code that fits the expected input exactly, given that the latter will only be known at runtime, and might change periodically?”
Robert Haken “dynamic IL emission”
Victor Baybekov “Stackalloc was not mentioned. Also ability to write raw IL (not dynamic, so save on a delegate call), e.g. to use cached ldftn and call them via calli. VS2017 has a proj template that makes this trivial via extern methods + MethodImplOptions.ForwardRef + ilasm.exe rewrite.”
Victor Baybekov “Also MethodImplOptions.AggressiveInlining “does enable ‘low-level’ programming” in a sense that it allows to write high-level code with many small methods and still control JIT behavior to get optimized result. Otherwise uncomposable 100s LOCs methods with copy-paste…”
Ben Adams “Using the same calling conventions (ABI) as the underlying platform and p/invokes for interop might be more of a thing though?”
Victor Baybekov “Also since you mentioned #fsharp - it does have inline keyword that does the job at IL level before JIT, so it was deemed important at the language level. C# lacks this (so far) for lambdas which are always virtual calls and workarounds are often weird (constrained generics).”
Alexandre Mutel “new SIMD intrinsics, Unsafe Utility class/IL post processing (e.g custom, Fody…etc.). For C#8.0, upcoming function pointers…”
Alexandre Mutel “related to IL, F# has support for direct IL within the language for example”
OmariO “BinaryPrimitives. Low-level but safe.” (https://docs.microsoft.com/en-us/dotnet/api/system.buffers.binary.binaryprimitives?view=netcore-3.0)
Kouji (Kozy) Matsui “How about native inline assembler? It’s difficult for how relation both toolchains and runtime, but can replace current P/Invoke solution and do inlining if we have it.”
Frank A. Krueger “Ldobj, stobj, initobj, initblk, cpyblk.”
Konrad Kokosa “Maybe Thread Local Storage? Fixed Size Buffers? unmanaged constraint and blittable types should be probably mentioned:)”
Sebastiano Mandalà “Just my two cents as everything has been said: what about something as simple as struct layout and how padding and memory alignment and order of the fields may affect the cache line performance? It’s something I have to investigate myself too”
Nino Floris “Constants embedding via readonlyspan, stackalloc, finalizers, WeakReference, open delegates, MethodImplOptions, MemoryBarriers, TypedReference, varargs, SIMD, Unsafe.AsRef can coerce struct types if layout matches exactly (used for a.o. TaskAwaiter and its version)"

So in summary, I would say that C# certainly lets you write code that looks a lot like C++ and in conjunction with the Runtime and Base-Class Libraries it gives you a lot of low-level functionality

Discuss this post on Hacker News, /r/programming, /r/dotnet or /r/csharp

"Stack Walking" in the .NET Runtime

2019-01-21T00:00:00+00:00

What is ‘stack walking’, well as always the ‘Book of the Runtime’ (BotR) helps us, from the relevant page:

The CLR makes heavy use of a technique known as stack walking (or stack crawling). This involves iterating the sequence of call frames for a particular thread, from the most recent (the thread’s current function) back down to the base of the stack.

The runtime uses stack walks for a number of purposes:

The runtime walks the stacks of all threads during garbage collection, looking for managed roots (local variables holding object references in the frames of managed methods that need to be reported to the GC to keep the objects alive and possibly track their movement if the GC decides to compact the heap).

On some platforms the stack walker is used during the processing of exceptions (looking for handlers in the first pass and unwinding the stack in the second).

The debugger uses the functionality when generating managed stack traces.

Various miscellaneous methods, usually those close to some public managed API, perform a stack walk to pick up information about their caller (such as the method, class or assembly of that caller).

The rest of this post will explore what ‘Stack Walking’ is, how it works and why so many parts of the runtime need to be involved.

Table of Contents

Where does the CLR use ‘Stack Walking’?
The ‘Stack Walking’ API
Unwinding ‘Native’ Code
Unwinding ‘JITted’ Code
- Help from the ‘JIT Compiler’
Further Reading
- Stack Unwinding (general)
- Stack Unwinding (other runtimes)

Where does the CLR use ‘Stack Walking’?

Before we dig into the ‘internals’, let’s take a look at where the runtime utilises ‘stack walking’, below is the full list (as of .NET Core CLR ‘Release 2.2’). All these examples end up calling into the Thread::StackWalkFrames(..) method here and provide a callback that is triggered whenever the API encounters a new section of the stack (see How to use it below for more info).

Common Scenarios

Garbage Collection (GC)
- ScanStackRoots(..) here -> callback
Exception Handling (unwinding)
- x86 - UnwindFrames(..) here -> callback
- x64 - ResetThreadAbortState(..) here -> callback
Exception Handling (resumption):
- ExceptionTracker::FindNonvolatileRegisterPointers(..) here -> callback
- ExceptionTracker::RareFindParentStackFrame(..) here -> callback
Threads:
- Thread::IsRunningIn(..) (AppDomain) here -> callback
- Thread::DetectHandleILStubsForDebugger(..) here -> callback
Thread Suspension:
- Thread::IsExecutingWithinCer() (‘Constrained Execution Region’) here (wrapper and callback)
- Thread::HandledJITCase(..) here -> callback, alternative callback

Debugging/Diagnostics

Debugger
- DebuggerWalkStack(..) here -> callback
- DebuggerWalkStackProc() here (called from DebuggerWalkStack(..)) -> callback
Managed APIs (e.g System.Diagnostics.StackTrace)
- Managed code calls via an InternalCall (C#) here into DebugStackTrace::GetStackFramesInternal(..) (C++) here
- Before ending up in DebugStackTrace::GetStackFramesHelper(..) here -> callback
DAC (via by SOS) - Scan for GC ‘Roots’
- DacStackReferenceWalker::WalkStack<..>(..) here -> callback
Profiling API
- ProfToEEInterfaceImpl::ProfilerStackWalkFramesWrapper(..) here -> callback
Event Pipe (Diagnostics)
- EventPipe::WalkManagedStackForThread(..) here -> callback
CLR prints a Stack Trace (to the console/log, DEBUG builds only)
- PrintStackTrace() here (and other functions) -> callback

Obscure Scenarios

Reflection
- RuntimeMethodHandle::GetCurrentMethod(..) here (callback)
Application (App) Domains (See ‘Stack Crawl Marks’ below)
- SystemDomain::GetCallersMethod(..) here (also GetCallersType(..) and GetCallersModule(..)) (callback)
- SystemDomain::GetCallersModule(..) here (callback)
‘Code Pitching’
- CheckStacksAndPitch() here (wrapper and callback)
Extensible Class Factory (System.Runtime.InteropServices.ExtensibleClassFactory)
- RegisterObjectCreationCallback(..) here (callback)
Stack Sampler (unused?)
- StackSampler::ThreadProc() here (wrapper and callback)

Stack Crawl Marks

One of the above scenarios deserves a closer look, but firstly why are ‘stack crawl marks’ used, from coreclr/issues/#21629 (comment):

Unfortunately, there is a ton of legacy APIs that were added during netstandard2.0 push whose behavior depend on the caller. The caller is basically passed in as an implicit argument to the API. Most of these StackCrawlMarks are there to support these APIs…

So we can see that multiple functions within the CLR itself need to have knowledge of their caller. To understand this some more, let’s look an example, the GetType(string typeName) method. Here’s the flow from the externally-visible method all the way down to where the work is done, note how a StackCrawlMark instance is passed through:

Type::GetType(string typeName) implementation (Creates StackCrawlMark.LookForMyCaller)
RuntimeType::GetType(.., ref StackCrawlMark stackMark) implementation
RuntimeType::GetTypeByName(.., ref StackCrawlMark stackMark, ..) implementation
extern void GetTypeByName(.., ref StackCrawlMark stackMark, ..) definition (call into native code, i.e. [DllImport(JitHelpers.QCall, ..)])
RuntimeTypeHandle::GetTypeByName(.., QCall::StackCrawlMarkHandle pStackMark, ..) implementation
TypeHandle TypeName::GetTypeManaged(.., StackCrawlMark* pStackMark, ..) implementation
TypeHandle TypeName::GetTypeWorker(.. , StackCrawlMark* pStackMark, ..) implementation
SystemDomain::GetCallersAssembly(StackCrawlMark *stackMark,..) implementation
SystemDomain::GetCallersModule(StackCrawlMark* stackMark, ..) implementation
SystemDomain::CallersMethodCallbackWithStackMark(..) callback implementation

In addition the JIT (via the VM) has to ensure that all relevant methods are available in the call-stack, i.e. they can’t be removed:

Prevent in-lining CEEInfo::canInline(..) implementation
Prevent removal via a ‘tail call’ CEEInfo::canTailCall(..) implementation

However, the StackCrawlMark feature is currently being cleaned up, so it may look different in the future:

Exception Handling

The place that most .NET Developers will run into ‘stack traces’ is when dealing with exceptions. I originally intended to also describe ‘exception handling’ here, but then I opened up /src/vm/exceptionhandling.cpp and saw that it contained over 7,000 lines of code!! So I decided that it can wait for a future post 😁.

However, if you want to learn more about the ‘internals’ I really recommend Chris Brumme’s post The Exception Model (2003) which is the definitive guide on the topic (also see his Channel9 Videos) and as always, the ‘BotR’ chapter ‘What Every (Runtime) Dev needs to Know About Exceptions in the Runtime’ is well worth a read.

Also, I recommend talking a look at the slides from the ‘Internals of Exceptions’ talk’ and the related post .NET Inside Out Part 2 — Handling and rethrowing exceptions in C# both by Adam Furmanek.

The ‘Stack Walking’ API

Now that we’ve seen where it’s used, let’s look at the ‘stack walking’ API itself. Firstly, how is it used?

How to use it

It’s worth pointing out that the only way you can access it from C#/F#/VB.NET code is via the StackTrace class, only the runtime itself can call into Thread::StackWalkFrames(..) directly. The simplest usage in the runtime is EventPipe::WalkManagedStackForThread(..) (see here), which is shown below. As you can see it’s as simple as specifying the relevant flags, in this case ALLOW_ASYNC_STACK_WALK | FUNCTIONSONLY | HANDLESKIPPEDFRAMES | ALLOW_INVALID_OBJECTS and then providing the callback, which in the EventPipe class is the StackWalkCallback method (here)

bool EventPipe::WalkManagedStackForThread(Thread *pThread, StackContents &stackContents)
{
    CONTRACTL
    {
        NOTHROW;
        GC_NOTRIGGER;
        MODE_ANY;
        PRECONDITION(pThread != NULL);
    }
    CONTRACTL_END;

    // Calling into StackWalkFrames in preemptive mode violates the host contract,
    // but this contract is not used on CoreCLR.
    CONTRACT_VIOLATION( HostViolation );

    stackContents.Reset();

    StackWalkAction swaRet = pThread->StackWalkFrames(
        (PSTACKWALKFRAMESCALLBACK) &StackWalkCallback,
        &stackContents,
        ALLOW_ASYNC_STACK_WALK | FUNCTIONSONLY | HANDLESKIPPEDFRAMES | ALLOW_INVALID_OBJECTS);

    return ((swaRet == SWA_DONE) || (swaRet == SWA_CONTINUE));
}

The StackWalkFrame(..) function then does the heavy-lifting of actually walking the stack, before triggering the callback shown below. In this case it just records the ‘Instruction Pointer’ (IP/CP) and the ‘managed function’, which is an instance of the MethodDesc obtained via the pCf->GetFunction() call:

StackWalkAction EventPipe::StackWalkCallback(CrawlFrame *pCf, StackContents *pData)
{
    CONTRACTL
    {
        NOTHROW;
        GC_NOTRIGGER;
        MODE_ANY;
        PRECONDITION(pCf != NULL);
        PRECONDITION(pData != NULL);
    }
    CONTRACTL_END;

    // Get the IP.
    UINT_PTR controlPC = (UINT_PTR)pCf->GetRegisterSet()->ControlPC;
    if (controlPC == 0)
    {
        if (pData->GetLength() == 0)
        {
            // This happens for pinvoke stubs on the top of the stack.
            return SWA_CONTINUE;
        }
    }

    _ASSERTE(controlPC != 0);

    // Add the IP to the captured stack.
    pData->Append(controlPC, pCf->GetFunction());

    // Continue the stack walk.
    return SWA_CONTINUE;
}

How it works

Now onto the most interesting part, how to the runtime actually walks the stack. Well, first let’s understand what the stack looks like, from the ‘BotR’ page:

The main thing to note is that a .NET ‘stack’ can contain 3 types of methods:

Managed - this represents code that started off as C#/F#/VB.NET, was turned into IL and then finally compiled to native code by the ‘JIT Compiler’.
Unmanaged - completely native code that exists outside of the runtime, i.e. a OS function the runtime calls into or a user call via P/Invoke. The runtime only cares about transitions into or out of regular unmanaged code, is doesn’t care about the stack frame within it.
Runtime Managed - still native code, but this is slightly different because the runtime case more about this code. For example there are quite a few parts of the Base-Class libraries that make use of InternalCall methods, for more on this see the ‘Helper Method’ Frames section later on.

So the ‘stack walk’ has to deal with these different scenarios as it proceeds. Now let’s look at the ‘code flow’ starting with the entry-point method StackWalkFrames(..):

Thread::StackWalkFrames(..) here
- the entry-point function, the type of ‘stack walk’ can be controlled via these flags
Thread::StackWalkFramesEx(..) here
- worker-function that sets up the StackFrameIterator, via a call to StackFrameIterator::Init(..) here
StackFrameIterator::Next() here, then hands off to the primary worker method StackFrameIterator::NextRaw() here that does 5 things:
1. CheckForSkippedFrames(..) here, deals with frames that may have been allocated inside a managed stack frame (e.g. an inlined p/invoke call).
2. UnwindStackFrame(..) here, in-turn calls:
  - x64 - Thread::VirtualUnwindCallFrame(..) here, then calls VirtualUnwindNonLeafCallFrame(..) here or VirtualUnwindLeafCallFrame(..) here. All of of these functions make use of the Windows API function RtlLookupFunctionEntry(..) to do the actual unwinding.
  - x86 - ::UnwindStackFrame(..) here, in turn calls UnwindEpilog(..) here and UnwindEspFrame(..) here. Unlike x64, under x86 all the ‘stack-unwinding’ is done manually, within the CLR code.
3. PostProcessingForManagedFrames(..) here, determines if the stack-walk is actually within a managed method rather than a native frame.
4. ProcessIp(..) here has the job of looking up the current managed method (if any) based on the current instruction pointer (IP). It does this by calling into EECodeInfo::Init(..) here and then ends up in one of:
  - EEJitManager::JitCodeToMethodInfo(..) here, that uses a very cool looking data structure refereed to as a ‘nibble map’
  - NativeImageJitManager::JitCodeToMethodInfo(..) here
  - ReadyToRunJitManager::JitCodeToMethodInfo(..) here
5. ProcessCurrentFrame(..) here, does some final house-keeping and tidy-up.
CrawlFrame::GotoNextFrame() here
- in-turn calls pFrame->Next() here to walk through the ‘linked list’ of frames which drive the ‘stack walk’ (more on these ‘frames’ later)
StackFrameIterator::Filter() here
- essentially a huge switch statement that handles all the different Frame States and decides whether or not the ‘stack walk’ should continue.

When it gets a valid frame it triggers the callback in Thread::MakeStackwalkerCallback(..) here and passes in a pointer to the current CrawlFrame class defined here, this exposes methods such as IsFrameless(), GetFunction() and GetThisPointer(). The CrawlFrame actually represents 2 scenarios, based on the current IP:

Native code, represented by a Frame class defined here, which we’ll discuss more in a moment.
Managed code, well technically ‘managed code’ that was JITted to ‘native code’, so more accurately a managed stack frame. In this situation the MethodDesc class defined here is provided, you can read more about this key CLR data-structure in the corresponding BotR chapter.

See it ‘in Action’

Fortunately we’re able to turn on some nice diagnostics in a debug build of the CLR (COMPLUS_LogEnable, COMPLUS_LogToFile & COMPLUS_LogFacility). With that in place, given C# code like this:

internal class Program {
    private static void Main() {
        MethodA();
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private void MethodA() {
        MethodB();
    }
    
    [MethodImpl(MethodImplOptions.NoInlining)]
    private void MethodB() {
        MethodC();
    }
    
    [MethodImpl(MethodImplOptions.NoInlining)]
    private void MethodC() {
        var stackTrace = new StackTrace(fNeedFileInfo: true);
        Console.WriteLine(stackTrace.ToString());
    }
}

We get the output shown below, in which you can see the ‘stack walking’ process. It starts in InitializeSourceInfo and CaptureStackTrace which are methods internal to the StackTrace class (see here), before moving up the stack MethodC -> MethodB -> MethodA and then finally stopping in the Main function. Along the way its does a ‘FILTER’ and ‘CONSIDER’ step before actually unwinding (‘finished unwind for …’):

TID 4740: STACKWALK    starting with partial context
TID 4740: STACKWALK: [000] FILTER  : EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cc48  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [001] CONSIDER: EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cc48  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [001] FILTER  : EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cc48  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [002] CONSIDER: EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cdd8  vtbl= 00007ffd`74995220 
TID 4740: STACKWALK    LazyMachState::unwindLazyState(ip:00007FFD7439C45C,sp:000000029977C338)
TID 4740: STACKWALK: [002] CALLBACK: EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cdd8  vtbl= 00007ffd`74995220 
TID 4740: STACKWALK    HelperMethodFrame::UpdateRegDisplay cached ip:00007FFD72FE9258, sp:000000029977D300
TID 4740: STACKWALK: [003] CONSIDER: FRAMELESS: PC= 00007ffd`72fe9258  SP= 00000002`9977d300  method=InitializeSourceInfo 
TID 4740: STACKWALK: [003] CALLBACK: FRAMELESS: PC= 00007ffd`72fe9258  SP= 00000002`9977d300  method=InitializeSourceInfo 
TID 4740: STACKWALK: [004] about to unwind for 'InitializeSourceInfo', SP: 00000002`9977d300 , IP: 00007ffd`72fe9258 
TID 4740: STACKWALK: [004] finished unwind for 'InitializeSourceInfo', SP: 00000002`9977d480 , IP: 00007ffd`72eeb671 
TID 4740: STACKWALK: [004] CONSIDER: FRAMELESS: PC= 00007ffd`72eeb671  SP= 00000002`9977d480  method=CaptureStackTrace 
TID 4740: STACKWALK: [004] CALLBACK: FRAMELESS: PC= 00007ffd`72eeb671  SP= 00000002`9977d480  method=CaptureStackTrace 
TID 4740: STACKWALK: [005] about to unwind for 'CaptureStackTrace', SP: 00000002`9977d480 , IP: 00007ffd`72eeb671 
TID 4740: STACKWALK: [005] finished unwind for 'CaptureStackTrace', SP: 00000002`9977d5b0 , IP: 00007ffd`72eeadd0 
TID 4740: STACKWALK: [005] CONSIDER: FRAMELESS: PC= 00007ffd`72eeadd0  SP= 00000002`9977d5b0  method=.ctor 
TID 4740: STACKWALK: [005] CALLBACK: FRAMELESS: PC= 00007ffd`72eeadd0  SP= 00000002`9977d5b0  method=.ctor 
TID 4740: STACKWALK: [006] about to unwind for '.ctor', SP: 00000002`9977d5b0 , IP: 00007ffd`72eeadd0 
TID 4740: STACKWALK: [006] finished unwind for '.ctor', SP: 00000002`9977d5f0 , IP: 00007ffd`14c620d3 
TID 4740: STACKWALK: [006] CONSIDER: FRAMELESS: PC= 00007ffd`14c620d3  SP= 00000002`9977d5f0  method=MethodC 
TID 4740: STACKWALK: [006] CALLBACK: FRAMELESS: PC= 00007ffd`14c620d3  SP= 00000002`9977d5f0  method=MethodC 
TID 4740: STACKWALK: [007] about to unwind for 'MethodC', SP: 00000002`9977d5f0 , IP: 00007ffd`14c620d3 
TID 4740: STACKWALK: [007] finished unwind for 'MethodC', SP: 00000002`9977d630 , IP: 00007ffd`14c62066 
TID 4740: STACKWALK: [007] CONSIDER: FRAMELESS: PC= 00007ffd`14c62066  SP= 00000002`9977d630  method=MethodB 
TID 4740: STACKWALK: [007] CALLBACK: FRAMELESS: PC= 00007ffd`14c62066  SP= 00000002`9977d630  method=MethodB 
TID 4740: STACKWALK: [008] about to unwind for 'MethodB', SP: 00000002`9977d630 , IP: 00007ffd`14c62066 
TID 4740: STACKWALK: [008] finished unwind for 'MethodB', SP: 00000002`9977d660 , IP: 00007ffd`14c62016 
TID 4740: STACKWALK: [008] CONSIDER: FRAMELESS: PC= 00007ffd`14c62016  SP= 00000002`9977d660  method=MethodA 
TID 4740: STACKWALK: [008] CALLBACK: FRAMELESS: PC= 00007ffd`14c62016  SP= 00000002`9977d660  method=MethodA 
TID 4740: STACKWALK: [009] about to unwind for 'MethodA', SP: 00000002`9977d660 , IP: 00007ffd`14c62016 
TID 4740: STACKWALK: [009] finished unwind for 'MethodA', SP: 00000002`9977d690 , IP: 00007ffd`14c61f65 
TID 4740: STACKWALK: [009] CONSIDER: FRAMELESS: PC= 00007ffd`14c61f65  SP= 00000002`9977d690  method=Main 
TID 4740: STACKWALK: [009] CALLBACK: FRAMELESS: PC= 00007ffd`14c61f65  SP= 00000002`9977d690  method=Main 
TID 4740: STACKWALK: [00a] about to unwind for 'Main', SP: 00000002`9977d690 , IP: 00007ffd`14c61f65 
TID 4740: STACKWALK: [00a] finished unwind for 'Main', SP: 00000002`9977d6d0 , IP: 00007ffd`742f9073 
TID 4740: STACKWALK: [00a] FILTER  : NATIVE   : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0 
TID 4740: STACKWALK: [00b] CONSIDER: EXPLICIT : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0  Frame= 00000002`9977de58  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [00b] FILTER  : EXPLICIT : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0  Frame= 00000002`9977de58  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [00c] CONSIDER: EXPLICIT : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0  Frame= 00000002`9977e7e0  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [00c] FILTER  : EXPLICIT : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0  Frame= 00000002`9977e7e0  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: SWA_DONE: reached the end of the stack

To find out more, you can search for these diagnostic message in \vm\stackwalk.cpp, e.g. in Thread::DebugLogStackWalkInfo(..) here

Unwinding ‘Native’ Code

As explained in this excellent article:

There are fundamentally two main ways to implement exception propagation in an ABI (Application Binary Interface):

“dynamic registration”, with frame pointers in each activation record, organized as a linked list. This makes stack unwinding fast at the expense of having to set up the frame pointer in each function that calls other functions. This is also simpler to implement.

“table-driven”, where the compiler and assembler create data structures alongside the program code to indicate which addresses of code correspond to which sizes of activation records. This is called “Call Frame Information” (CFI) data in e.g. the GNU tool chain. When an exception is generated, the data in this table is loaded to determine how to unwind. This makes exception propagation slower but the general case faster.

It turns out that .NET uses the ‘table-driven’ approach, for the reason explained in the ‘BotR’:

The exact definition of a frame varies from platform to platform and on many platforms there isn’t a hard definition of a frame format that all functions adhere to (x86 is an example of this). Instead the compiler is often free to optimize the exact format of frames. On such systems it is not possible to guarantee that a stackwalk will return 100% correct or complete results (for debugging purposes, debug symbols such as pdbs are used to fill in the gaps so that debuggers can generate more accurate stack traces).

This is not a problem for the CLR, however, since we do not require a fully generalized stack walk. Instead we are only interested in those frames that are managed (i.e. represent a managed method) or, to some extent, frames coming from unmanaged code used to implement part of the runtime itself. In particular there is no guarantee about fidelity of 3rd party unmanaged frames other than to note where such frames transition into or out of the runtime itself (i.e. one of the frame types we do care about).

Frames

To enable ‘unwinding’ of native code or more strictly the transitions ‘into’ and ‘out of’ native code, the CLR uses a mechanism of Frames, which are defined in the source code here. These frames are arranged into a hierachy and there is one type of Frame for each scenario, for more info on these individual Frames take a look at the excellent source-code comments here.

Frame (abstract/base class)
- GCFrame
- FaultingExceptionFrame
- HijackFrame
- ResumableFrame
  - RedirectedThreadFrame
- InlinedCallFrame
- HelperMethodFrame
  - HelperMethodFrame_1OBJ
  - HelperMethodFrame_2OBJ
  - HelperMethodFrame_3OBJ
  - HelperMethodFrame_PROTECTOBJ
- TransitionFrame
  - StubHelperFrame
  - SecureDelegateFrame
    - MulticastFrame
  - FramedMethodFrame
    - ComPlusMethodFrame
    - PInvokeCalliFrame
    - PrestubMethodFrame
    - StubDispatchFrame
    - ExternalMethodFrame
    - TPMethodFrame
- UnmanagedToManagedFrame
  - ComMethodFrame
    - ComPrestubMethodFrame
  - UMThkCallFrame
- ContextTransitionFrame
- TailCallFrame
- ProtectByRefsFrame
- ProtectValueClassFrame
- DebuggerClassInitMarkFrame
- DebuggerSecurityCodeMarkFrame
- DebuggerExitFrame
- DebuggerU2MCatchHandlerFrame
- FuncEvalFrame
- ExceptionFilterFrame

‘Helper Method’ Frames

But to make sense of this, let’s look at one type of Frame, known as HelperMethodFrame (above). This is used when .NET code in the runtime calls into C++ code to do the heavy-lifting, often for performance reasons. One example is if you call Environment.GetCommandLineArgs() you end up in this code (C#), but note that it ends up calling an extern method marked with InternalCall:

[MethodImplAttribute(MethodImplOptions.InternalCall)]
private static extern string[] GetCommandLineArgsNative();

This means that the rest of the method is implemented in the runtime in C++, you can see how the method call is wired up, before ending up SystemNative::GetCommandLineArgs here, which is shown below:

FCIMPL0(Object*, SystemNative::GetCommandLineArgs)
{
    FCALL_CONTRACT;

    PTRARRAYREF strArray = NULL;

    HELPER_METHOD_FRAME_BEGIN_RET_1(strArray); // <-- 'Helper method Frame' started here

    // Error handling and setup code removed for clarity

    strArray = (PTRARRAYREF) AllocateObjectArray(numArgs, g_pStringClass);
    // Copy each argument into new Strings.
    for(unsigned int i=0; i<numArgs; i++)
    {
        STRINGREF str = StringObject::NewString(argv[i]);
        STRINGREF * destData = ((STRINGREF*)(strArray->GetDataPtr())) + i;
        SetObjectReference((OBJECTREF*)destData, (OBJECTREF)str, strArray->GetAppDomain());
    }
    delete [] argv;

    HELPER_METHOD_FRAME_END(); // <-- 'Helper method Frame' ended/closed here

    return OBJECTREFToObject(strArray);
}
FCIMPLEND

Note: this code makes heavy use of macros, see this gist for the original code and then the expanded versions (Release and Debug). In addition, if you want more information on these mysterious FCalls as they are known (and the related QCalls) see Mscorlib and Calling Into the Runtime in the ‘BotR’.

But the main thing to look at in the code sample is the HELPER_METHOD_FRAME_BEGIN_RET_1() macro, with ultimately installs an instance of the HelperMethodFrame_1OBJ class. The macro expands into code like this:

FrameWithCookie < HelperMethodFrame_1OBJ > __helperframe(__me, Frame::FRAME_ATTR_NONE, (OBJECTREF * ) & strArray); 
{
  __helperframe.Push(); // <-- 'Helper method Frame' pushed

  Thread * CURRENT_THREAD = __helperframe.GetThread();
  const bool CURRENT_THREAD_AVAILABLE = true;
  (void) CURRENT_THREAD_AVAILABLE;; {
	Exception * __pUnCException = 0;
	Frame * __pUnCEntryFrame = ( & __helperframe);
	bool __fExceptionCatched = false;;
	try {;

	  // Original code from SystemNative::GetCommandLineArgs goes in here

	} catch (Exception * __pException) {;
	  do {} while (0);
	  __pUnCException = __pException;
	  UnwindAndContinueRethrowHelperInsideCatch(__pUnCEntryFrame, __pUnCException);
	  __fExceptionCatched = true;;
	}
	if (__fExceptionCatched) {;
	  UnwindAndContinueRethrowHelperAfterCatch(__pUnCEntryFrame, __pUnCException);
	}
  };
  
  __helperframe.Pop(); // <-- 'Helper method Frame' popped
};

Note: the Push() and Pop() against _helperMethodFrame that make it available for ‘stack walking’. You can also see the try/catch block that the CLR puts in place to ensure any exceptions from native code are turned into managed exceptions that C#/F#/VB.NET code can handle. If you’re interested the full macro-expansion is available in this gist.

So in summary, these Frames are pushed onto a ‘linked list’ when calling into native code and popped off the list when returning from native code. This means that are any moment the ‘linked list’ contains all the current or active Frames.

Native Unwind Information

In addition to creating ‘Frames’, the CLR also ensures that the C++ compiler emits ‘unwind info’ for native code. We can see this if we use the DUMPBIN tool and run dumpbin /UNWINDINFO coreclr.dll. We get the following output for SystemNative::GetCommandLineArgs(..) (that we looked at before):

  0002F064 003789B0 00378B7E 004ED1D8  ?GetCommandLineArgs@SystemNative@@SAPEAVObject@@XZ (public: static class Object * __cdecl SystemNative::GetCommandLineArgs(void))
    Unwind version: 1
    Unwind flags: EHANDLER UHANDLER
    Size of prologue: 0x3B
    Count of codes: 13
    Unwind codes:
      29: SAVE_NONVOL, register=r12 offset=0x1C8
      25: SAVE_NONVOL, register=rdi offset=0x1C0
      21: SAVE_NONVOL, register=rsi offset=0x1B8
      1D: SAVE_NONVOL, register=rbx offset=0x1B0
      10: ALLOC_LARGE, size=0x190
      09: PUSH_NONVOL, register=r15
      07: PUSH_NONVOL, register=r14
      05: PUSH_NONVOL, register=r13
    Handler: 00148F14 __GSHandlerCheck_EH
    EH Handler Data: 00415990
    GS Unwind flags: EHandler UHandler
    Cookie Offset: 00000180

  0002F070 00378B7E 00378BB4 004ED26C
    Unwind version: 1
    Unwind flags: EHANDLER UHANDLER
    Size of prologue: 0x0A
    Count of codes: 2
    Unwind codes:
      0A: ALLOC_SMALL, size=0x20
      06: PUSH_NONVOL, register=rbp
    Handler: 0014978C __CxxFrameHandler3
    EH Handler Data: 00415990

If you want to understand more of what’s going on here I really recommend reading the excellent article x64 Manual Stack Reconstruction and Stack Walking. But in essence the ‘unwind info’ describes which registers are used within a method and how big stack is for that method. These pieces of information are enough to tell the runtime how to ‘unwind’ that particular method when walking the stack.

Differences between Windows and Unix

However, to further complicate things, the ‘native code unwinding’ uses a different mechanism for ‘Windows’ v. ‘Unix’, as explained in coreclr/issues/#177 (comment):

Stack walker for managed code. JIT will generate regular Windows style unwinding info. We will reuse Windows unwinder code that we currently have checked in for debugger components for unwinding calls in managed code on Linux/Mac. Unfortunately, this work requires changes in the runtime that currently cannot be tested in the CoreCLR repo so it is hard to do this in the public right now. But we are working on fixing that because, as I mentioned at the beginning, our goal is do most work in the public.

Stack walker for native code. Here, in addition to everything else, we need to allow GC to unwind native stack of any thread in the current process until it finds a managed frame. Currently we are considering using libunwind (http://www.nongnu.org/libunwind) for unwinding native call stacks. @janvorli did some prototyping/experiments and it seems to do what we need. If you have any experience with this library or have any comments/suggestions please let us know.

This also shows that there are 2 different ‘unwind’ mechanisms for ‘managed’ or ‘native’ code, we will discuss how the “stack walker for managed code” works in Unwinding ‘JITted’ Code.

There is also some more information in coreclr/issues/#177 (comment):

My current work has two parts, as @sergiy-k has already mentioned. The windows style unwinder that will be used for the jitted code and Unix unwinder for native code that uses the libunwind’s low level unw_xxxx functions like unw_step etc.

So, for ‘native code’ the runtime uses an OS specific mechanism, i.e. on Unix the Open Source ‘libunwind’ library is used. You can see the differences in the code below (from here), under Windows Thread::VirtualUnwindCallFrame(..) (implementation) is called, but on Unix (i.e. FEATURE_PAL) PAL_VirtualUnwind(..) (implementation) is called instead:

#ifndef FEATURE_PAL
    pvControlPc = Thread::VirtualUnwindCallFrame(&ctx, &nonVolRegPtrs);
#else // !FEATURE_PAL
    ...
    BOOL success = PAL_VirtualUnwind(&ctx, &nonVolRegPtrs);
    ...
    pvControlPc = GetIP(&ctx);
#endif // !FEATURE_PAL

Before we more on, here are some links to the work that was done to support ‘stack walking’ when .NET Core CLR was ported to Linux:

Unwinding ‘JITted’ Code

Finally, we’re going to look at what happens with ‘managed code’, i.e. code that started off as C#/F#/VB.NET, was turned into IL and then compiled into native code by the ‘JIT Compiler’. This is the code that you generally want to see in your ‘stack trace’, because it’s code you wrote yourself!

Help from the ‘JIT Compiler’

Simply, what happens is that when the code is ‘JITted’, the compiler also emits some extra information, stored via the EECodeInfo class, which is defined here. Also see the ‘Unwind Info’ section in the JIT Compiler <-> Runtime interface, note how it features seperate sections for TARGET_ARM, TARGET_ARM64, TARGET_X86 and TARGET_UNIX.

In addition, in CodeGen::genFnProlog() here the JIT emits a function ‘prologue’ that contains several pieces of ‘unwind’ related data. This is also imlemented in CEEJitInfo::allocUnwindInfo(..) in this piece of code, which behaves differently for each CPU architecture:

#if defined(_TARGET_X86_)
    // Do NOTHING
#elif defined(_TARGET_AMD64_)
    pUnwindInfo->Flags = UNW_FLAG_EHANDLER | UNW_FLAG_UHANDLER;
    ULONG * pPersonalityRoutine = (ULONG*)ALIGN_UP(&(pUnwindInfo->UnwindCode[pUnwindInfo->CountOfUnwindCodes]), sizeof(ULONG));
    *pPersonalityRoutine = ExecutionManager::GetCLRPersonalityRoutineValue();
#elif defined(_TARGET_ARM64_)
    *(LONG *)pUnwindInfo |= (1 << 20); // X bit
    ULONG * pPersonalityRoutine = (ULONG*)((BYTE *)pUnwindInfo + ALIGN_UP(unwindSize, sizeof(ULONG)));
    *pPersonalityRoutine = ExecutionManager::GetCLRPersonalityRoutineValue();
#elif defined(_TARGET_ARM_)
    *(LONG *)pUnwindInfo |= (1 << 20); // X bit
    ULONG * pPersonalityRoutine = (ULONG*)((BYTE *)pUnwindInfo + ALIGN_UP(unwindSize, sizeof(ULONG)));
    *pPersonalityRoutine = (TADDR)ProcessCLRException - baseAddress;
#endif

Also, the JIT has several Compiler::unwindXXX(..) methods, that are all implemented in per-CPU source files:

Fortunately, we can ask the JIT to output the unwind info that it emits, however this only works with a Debug version of the CLR. Given a simple method like this:

private void MethodA() {
    try {
        MethodB();
    } catch (Exception ex) {
        Console.WriteLine(ex.ToString());
    }
}

if we call SET COMPlus_JitUnwindDump=MethodA, we get the following output with 2 ‘Unwind Info’ sections, one for the try and the other for the catch block:

Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0x00004e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x07
  CountOfUnwindCodes: 4
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 11 * 8 + 8 = 96 = 0x60
    CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
    CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)
Unwind Info:
  >> Start offset   : 0x00004e (not in unwind data)
  >>   End offset   : 0x0000e2 (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x07
  CountOfUnwindCodes: 4
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 5 * 8 + 8 = 48 = 0x30
    CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
    CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)

This ‘unwind info’ is then looked up during a ‘stack walk’ as explained in the How it works section above.

So next time you encounter a ‘stack trace’ remember that a lot of work went into making it possible!!

Exploring the .NET Core Runtime (in which I set myself a challenge)

2018-12-13T00:00:00+00:00

It seems like this time of year anyone with a blog is doing some sort of ‘advent calendar’, i.e. 24 posts leading up to Christmas. For instance there’s a F# one which inspired a C# one (C# copying from F#, that never happens 😉)

However, that’s a bit of a problem for me, I struggled to write 24 posts in my most productive year, let alone a single month! Also, I mostly blog about ‘.NET Internals’, a subject which doesn’t necessarily lend itself to the more ‘light-hearted’ posts you get in these ‘advent calendar’ blogs.

Until now!

Recently I’ve been giving a talk titled from ‘dotnet run’ to ‘hello world’, which attempts to explain everything that the .NET Runtime does from the point you launch your application till “Hello World” is printed on the screen:

From 'dotnet run' to 'hello world' from Matt Warren

But as I was researching and presenting this talk, it made me think about the .NET Runtime as a whole, what does it contain and most importantly what can you do with it?

Note: this is mostly for informational purposes, for the recommended way of achieving the same thing, take a look at this excellent Deep-dive into .NET Core primitives by Nate McMaster.

In this post I will explore what you can do using only the code in the dotnet/coreclr repository and along the way we’ll find out more about how the runtime interacts with the wider .NET Ecosystem.

To makes things clearer, there are 3 challenges that will need to be solved before a simple “Hello World” application can be run. That’s because in the dotnet/coreclr repository there is:

No compiler, that lives in dotnet/Roslyn
No Framework Class Library (FCL) a.k.a. ‘dotnet/CoreFX’
No dotnet run as it’s implemented in the dotnet/CLI repository

Building the CoreCLR

But before we even work through these ‘challenges’, we need to build the CoreCLR itself. Helpfully there is really nice guide available in ‘Building the Repository’:

The build depends on Git, CMake, Python and of course a C++ compiler. Once these prerequisites are installed the build is simply a matter of invoking the ‘build’ script (build.cmd or build.sh) at the base of the repository.

The details of installing the components differ depending on the operating system. See the following pages based on your OS. There is no cross-building across OS (only for ARM, which is built on X64). You have to be on the particular platform to build that platform.

Windows Build Instructions

Linux Build Instructions

macOS Build Instructions

FreeBSD Build Instructions

NetBSD Build Instructions

If you follow these steps successfully, you’ll end up with the following files (at least on Windows, other OSes may produce something slightly different):

No Compiler

First up, how do we get around the fact that we don’t have a compiler? After all we need some way of turing our simple “Hello World” code into a .exe?

namespace Hello_World
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
        }
    }
}

Fortunately we do have access to the ILASM tool (IL Assembler), which can turn Common Intermediate Language (CIL) into an .exe file. But how do we get the correct IL code? Well, one way is to write it from scratch, maybe after reading Inside NET IL Assembler and Expert .NET 2.0 IL Assembler by Serge Lidin (yes, amazingly, 2 books have been written about IL!)

Another, much easier way, is to use the amazing SharpLab.io site to do it for us! If you paste the C# code from above into it, you’ll get the following IL code:

.class private auto ansi '<Module>'
{
} // end of class <Module>

.class private auto ansi beforefieldinit Hello_World.Program
    extends [mscorlib]System.Object
{
    // Methods
    .method private hidebysig static 
        void Main (
            string[] args
        ) cil managed 
    {
        // Method begins at RVA 0x2050
        // Code size 11 (0xb)
        .maxstack 8

        IL_0000: ldstr "Hello World!"
        IL_0005: call void [mscorlib]System.Console::WriteLine(string)
        IL_000a: ret
    } // end of method Program::Main

    .method public hidebysig specialname rtspecialname 
        instance void .ctor () cil managed 
    {
        // Method begins at RVA 0x205c
        // Code size 7 (0x7)
        .maxstack 8

        IL_0000: ldarg.0
        IL_0001: call instance void [mscorlib]System.Object::.ctor()
        IL_0006: ret
    } // end of method Program::.ctor

} // end of class Hello_World.Program

Then, if we save this to a file called ‘HelloWorld.il’ and run the cmd ilasm HelloWorld.il /out=HelloWorld.exe, we get the following output:

Microsoft (R) .NET Framework IL Assembler.  Version 4.5.30319.0
Copyright (c) Microsoft Corporation.  All rights reserved.
Assembling 'HelloWorld.il'  to EXE --> 'HelloWorld.exe'
Source file is ANSI

HelloWorld.il(38) : warning : Reference to undeclared extern assembly 'mscorlib'. Attempting autodetect
Assembled method Hello_World.Program::Main
Assembled method Hello_World.Program::.ctor
Creating PE file

Emitting classes:
Class 1:        Hello_World.Program

Emitting fields and methods:
Global
Class 1 Methods: 2;

Emitting events and properties:
Global
Class 1
Writing PE file
Operation completed successfully

Nice, so part 1 is done, we now have our HelloWorld.exe file!

No Base Class Library

Well, not exactly, one problem is that System.Console lives in dotnet/corefx, in there you can see the different files that make up the implementation, such as Console.cs, ConsolePal.Unix.cs, ConsolePal.Windows.cs, etc.

Fortunately, the nice CoreCLR developers included a simple Console implementation in System.Private.CoreLib.dll, the managed part of the CoreCLR, which was previously known as ‘mscorlib’ (before it was renamed). This internal version of Console is pretty small and basic, but it provides enough for what we need.

To use this ‘workaround’ we need to edit our HelloWorld.il to look like this (note the change from mscorlib to System.Private.CoreLib)

.class public auto ansi beforefieldinit C
       extends [System.Private.CoreLib]System.Object
{
    .method public hidebysig static void M () cil managed 
    {
        .entrypoint
        // Code size 11 (0xb)
        .maxstack 8

        IL_0000: ldstr "Hello World!"
        IL_0005: call void [System.Private.CoreLib]Internal.Console::WriteLine(string)
        IL_000a: ret
    } // end of method C::M
    ...
}

Note: You can achieve the same thing with C# code instead of raw IL, by invoking the C# compiler with the following cmd-line:

csc -optimize+ -nostdlib -reference:System.Private.Corelib.dll -out:HelloWorld.exe HelloWorld.cs

So we’ve completed part 2, we are able to at least print “Hello World” to the screen without using the CoreFX repository!

Now this is a nice little trick, but I wouldn’t ever recommend writing real code like this. Compiling against System.Private.CoreLib isn’t the right way of doing things. What the compiler normally does is compile against the publicly exposed surface area that lives in dotnet/corefx, but then at run-time a process called ‘Type-Forwarding’ is used to make that ‘reference’ implementation in CoreFX map to the ‘real’ implementation in the CoreCLR. For more on this entire process see The Rough History of Referenced Assemblies.

However, only a small amount of managed code (i.e. C#) actually exists in the CoreCLR, to show this, the directory tree for /dotnet/coreclr/src/System.Private.CoreLib is available here and the tree with all ~1280 .cs files included is here.

As a concrete example, if you look in CoreFX, you’ll see that the System.Reflection implementation is pretty empty! That’s because it’s a ‘partial facade’ that is eventually ‘type-forwarded’ to System.Private.CoreLib.

If you’re interested, the entire API that is exposed in CoreFX (but actually lives in CoreCLR) is contained in System.Runtime.cs. But back to our example, here is the code that describes all the GetMethod(..) functions in the ‘System.Reflection’ API.

To learn more about ‘type forwarding’, I recommend watching ‘.NET Standard - Under the Hood’ (slides) by Immo Landwerth and there is also some more in-depth information in ‘Evolution of design time assemblies’.

But why is this code split useful, from the CoreFX README:

Runtime-specific library code (mscorlib) lives in the CoreCLR repo. It needs to be built and versioned in tandem with the runtime. The rest of CoreFX is agnostic of runtime-implementation and can be run on any compatible .NET runtime (e.g. CoreRT).

And from the other point-of-view, in the CoreCLR README:

By itself, the Microsoft.NETCore.Runtime.CoreCLR package is actually not enough to do much. One reason for this is that the CoreCLR package tries to minimize the amount of the class library that it implements. Only types that have a strong dependency on the internal workings of the runtime are included (e.g, System.Object, System.String, System.Threading.Thread, System.Threading.Tasks.Task and most foundational interfaces).

Instead most of the class library is implemented as independent NuGet packages that simply use the .NET Core runtime as a dependency. Many of the most familiar classes (System.Collections, System.IO, System.Xml and so on), live in packages defined in the dotnet/corefx repository.

One huge benefit of this approach is that Mono can share large amounts of the CoreFX code, as shown in this tweet:

How Mono reuses .NET Core sources for BCL (doesn't include runtime, tools, etc) according to my calculations 🙂 pic.twitter.com/8JCDxqwnNi
— Egor Bogatov (@EgorBo) March 27, 2018

No Launcher

So far we’ve ‘compiled’ our code (well technically ‘assembled’ it) and we’ve been able to access a simple version of System.Console, but how do we actually run our .exe? Remember we can’t use the dotnet run command because that lives in the dotnet/CLI repository (and that would be breaking the rules of this slightly contrived challenge!!).

Again, fortunately those clever runtime engineers have thought of this exact scenario and they built the very helpful corerun application. You can read more about in Using corerun To Run .NET Core Application, but the td;dr is that it will only look for dependencies in the same folder as your .exe.

So, to complete the challenge, we can now run CoreRun HelloWorld.exe:

# CoreRun HelloWorld.exe
Hello World!

Yay, the least impressive demo you’ll see this year!!

For more information on how you can ‘host’ the CLR in your application I recommend this excellent tutorial Write a custom .NET Core host to control the .NET runtime from your native code. In addition, the docs page on ‘Runtime Hosts’ gives a nice overview of the different hosts that are available:

The .NET Framework ships with a number of different runtime hosts, including the hosts listed in the following table.

Runtime Host Description

ASP.NET Loads the runtime into the process that is to handle the Web request. ASP.NET also creates an application domain for each Web application that will run on a Web server.

Microsoft Internet Explorer Creates application domains in which to run managed controls. The .NET Framework supports the download and execution of browser-based controls. The runtime interfaces with the extensibility mechanism of Microsoft Internet Explorer through a mime filter to create application domains in which to run the managed controls. By default, one application domain is created for each Web site.

Shell executables Invokes runtime hosting code to transfer control to the runtime each time an executable is launched from the shell.

Runtime Host	Description
ASP.NET	Loads the runtime into the process that is to handle the Web request. ASP.NET also creates an application domain for each Web application that will run on a Web server.
Microsoft Internet Explorer	Creates application domains in which to run managed controls. The .NET Framework supports the download and execution of browser-based controls. The runtime interfaces with the extensibility mechanism of Microsoft Internet Explorer through a mime filter to create application domains in which to run the managed controls. By default, one application domain is created for each Web site.
Shell executables	Invokes runtime hosting code to transfer control to the runtime each time an executable is launched from the shell.

Open Source .NET – 4 years later

2018-12-04T00:00:00+00:00

A little over 4 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as this slide from New Features in .NET Core and ASP.NET Core 2.1 shows, the community has been contributing in a significant way:

Side-note: This post forms part of an on-going series, if you want to see how things have changed over time you can check out the previous ones:

Runtime Changes

Before I look at the numbers, I just want to take a moment to look at the significant runtime changes that have taken place over the last 4 years. Partly because I really like looking at the ‘Internals’ of CoreCLR, but also because the runtime is the one repository that makes all the others possible, they rely on it!

To give some context, here’s the slides from a presentation I did called ‘From ‘dotnet run’ to ‘hello world’. If you flick through them you’ll see what components make up the CoreCLR code-base and what they do to make your application run.

From 'dotnet run' to 'hello world' from Matt Warren

So, after a bit of digging through the 19,059 commits, 5,790 issues and the 8 projects, here’s the list of significant changes in the .NET Core Runtime (CoreCLR) over the last few years (if I’ve missed any out, please let me know!!):

Span<T> (more info)
- Span<T> (‘umbrella’ issue for the whole feature)
  - Includes change to multiple parts of the runtime, the VM, JIT and GC
- Will .NET Core 2.1’s Span-based APIs be made available on the .NET Framework? If so, when?
- Also needed CoreFX work such as Add initial Span/Buffer-based APIs across corefx and String-like extension methods to ReadOnlySpan<char> Epic and Compiler changes, e.g. Compile time enforcement of safety for ref-like types
ref-like like types (to support Span<T>)
Tiered Compilation (more info)
- Tiered Compilation step 1, profiler changes for tiered compilation, Fix x86 steady state tiered compilation performance
- Also see the more general ‘Code Versioning’ design doc and Enable Tiered Compilation by default
Cross-platform (Unix, OS X, etc, see list of all ‘os-xxx’ labels)
New CPU Architectures
- ARM64 Project
- ARM32 Project
- List of all issues labelled ‘arch-xxx’
Hardware Intrinsics (project)
- Design Document
- Using .NET Hardware Intrinsics API to accelerate machine learning scenarios contains a nice overview of the implementation
Default Interface Methods (project)
- Runtime support for the default interface methods C# language feature.
Performance Monitoring and Diagnostics (project)
Ready-to-Run Images
- ReadyToRun Overview
- Bing.com runs on .NET Core 2.1! (section on ‘ReadyToRun Images’)
LocalGC (project)
- See in in action in Zero Garbage Collector for .NET Core and the follow-up Zero Garbage Collector for .NET Core 2.1 and ASP.NET Core 2.1
Unloadability (project)
- Support for unloading AssemblyLoadContext and all assemblies loaded into it.

So there’s been quite a few large, fundamental changes to the runtime since it’s been open-sourced.

Repository activity over time

But onto the data, first we are going to look at an overview of the level of activity in each repo, by analysing the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (Sparklines FTW!!). If you are interested in how I got the data, see the previous post because the process is the same.

Note: Numbers in black are from the most recent month, with the red dot showing the lowest and the green dot the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.

This data gives a good indication of how healthy different repos are, are they growing over time, or staying the same. You can also see the different levels of activity each repo has and how they compare to other ones.

Whilst it’s clear that Visual Studio Code is way ahead of all the other repos (in ‘# of Issues’), it’s interesting to see that some of the .NET-only ones are still pretty large, notably CoreFX (base-class libraries), Roslyn (compiler) and CoreCLR (runtime).

Overall Participation - Community v. Microsoft

Next will will look at the total participation from the last 4 years, i.e. November 2014 to November 2018. All Pull Requests and Issues are treated equally, so a large PR counts the same as one that fixes a speling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split. In addition, Community does include people paid by other companies to work on .NET Projects, for instance Samsung Engineers.

Note: You can hover over the bars to get the actual numbers, rather than percentages.

Issues: Microsoft Community

Pull Requests: Microsoft Community

Participation over time - Community v. Microsoft

Finally we can see the ‘per-month’ data from the last 4 years, i.e. November 2014 to November 2018.

Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.

Issues: Microsoft Community

Pull Requests: Microsoft Community

Summary

It’s clear that the community continues to be invested in the .NET-related, Open Source repositories, contributing significantly and for a sustained period of time. I think this is good for all .NET developers, whether you contribute to OSS or not, having .NET be a thriving, Open Source product has many benefits!

A History of .NET Runtimes

2018-10-02T00:00:00+00:00

Recently I was fortunate enough to chat with Chris Bacon who wrote DotNetAnywhere (an alternative .NET Runtime) and I quipped with him:

.. you’re probably one of only a select group(*) of people who’ve written a .NET runtime, that’s pretty cool!

* if you exclude people who were paid to work on one, i.e. Microsoft/Mono/Xamarin engineers, it’s a very select group.

But it got me thinking, how many .NET Runtimes are there? I put together my own list, then enlisted a crack team of highly-paid researchers, a.k.a my twitter followers:

#LazyWeb, fun Friday quiz, how many different .NET Runtimes are there? (that implement ECMA-335 https://t.co/76stuYZLrw)
- .NET Framework
- .NET Core
- Mono
- Unity
- .NET Compact Framework
- DotNetAnywhere
- Silverlight
What have I missed out?
— Matt Warren (@matthewwarren) September 14, 2018

For the purposes of this post I’m classifying a ‘.NET Runtime’ as anything that implements the ECMA-335 Standard for .NET (more info here). I don’t know if there’s a more precise definition or even some way of officially veryifying conformance, but in practise it means that the runtimes can take a .NET exe/dll produced by any C#/F#/VB.NET compiler and run it.

Once I had the list, I made copious use of wikipedia (see the list of ‘References’) and came up with the following timeline:

Timeline maker

(If the interactive timeline isn’t working for you, take a look at this version)

If I’ve missed out any runtimes, please let me know!

To make the timeline a bit easier to understand, I put each runtime into one of the following categories:

Microsoft .NET Frameworks
Other Microsoft Runtimes
Mono/Xamarin Runtimes
'Ahead-of-Time' (AOT) Runtimes
Community Projects
Research Projects

The rest of the post will look at the different runtimes in more detail. Why they were created, What they can do and How they compare to each other.

Microsoft .NET Frameworks

The original ‘.NET Framework’ was started by Microsoft in the late 1990’s and has been going strong ever since. Recently they’ve changed course somewhat with the announcement of .NET Core, which is ‘open-source’ and ‘cross-platform’. In addition, by creating the .NET Standard they’ve provided a way for different runtimes to remain compatible:

.NET Standard is for sharing code. .NET Standard is a set of APIs that all .NET implementations must provide to conform to the standard. This unifies the .NET implementations and prevents future fragmentation.

As an aside, if you want more information on the ‘History of .NET’, I really recommend Anders Hejlsberg - What brought about the birth of the CLR? and this presentation by Richard Campbell who really knows how to tell a story!

(Also available as a podcast if you’d prefer and he’s working on a book covering the same subject. If you want to learn more about the history of the entire ‘.NET Ecosystem’ not just the Runtimes, check out ‘Legends of .NET’)

Other Microsoft Runtimes

But outside of the main general purpose ‘.NET Framework’, Microsoft have also released other runtimes, designed for specific scenarios.

.NET Compact Framework

The Compact (.NET CF) and Micro (.NET MF) Frameworks were both attempts to provide cut-down runtimes that would run on more constrained devices, for instance .NET CF:

… is designed to run on resource constrained mobile/embedded devices such as personal digital assistants (PDAs), mobile phones factory controllers, set-top boxes, etc. The .NET Compact Framework uses some of the same class libraries as the full .NET Framework and also a few libraries designed specifically for mobile devices such as .NET Compact Framework controls. However, the libraries are not exact copies of the .NET Framework; they are scaled down to use less space.

.NET Micro Framework

The .NET MF is even more constrained:

… for resource-constrained devices with at least 256 KB of flash and 64 KB of random-access memory (RAM). It includes a small version of the .NET Common Language Runtime (CLR) and supports development in C#, Visual Basic .NET, and debugging (in an emulator or on hardware) using Microsoft Visual Studio. NETMF features a subset of the .NET base class libraries (about 70 classes with about 420 methods),.. NETMF also features added libraries specific to embedded applications. It is free and open-source software released under Apache License 2.0.

If you want to try it out, Scott Hanselman did a nice write-up The .NET Micro Framework - Hardware for Software People.

Silverlight

Although now only in support mode (or ‘dead’/‘sunsetted’ depending on your POV), it’s interesting to go back to the original announcement and see what Silverlight was trying to do:

Silverlight is a cross platform, cross browser .NET plug-in that enables designers and developers to build rich media experiences and RIAs for browsers. The preview builds we released this week currently support Firefox, Safari and IE browsers on both the Mac and Windows.

Back in 2007, Silverlight 1.0 had the following features (it even worked on Linux!):

Built-in codec support for playing VC-1 and WMV video, and MP3 and WMA audio within a browser…

Silverlight supports the ability to progressively download and play media content from any web-server…

Silverlight also optionally supports built-in media streaming…

Silverlight enables you to create rich UI and animations, and blend vector graphics with HTML to create compelling content experiences…

Silverlight makes it easy to build rich video player interactive experiences…

Mono/Xamarin Runtimes

Mono came about when Miguel de Icaza and others explored the possibility of making .NET work on Linux (from Mono early history):

Who came first is not an important question to me, because Mono to me is a means to an end: a technology to help Linux succeed on the desktop.

The same post also talks about how it started:

On the Mono side, the events were approximately like this:

As soon as the .NET documents came out in December 2000, I got really interested in the technology, and started where everyone starts: at the byte code interpreter, but I faced a problem: there was no specification for the metadata though.

The last modification to the early VM sources was done on January 22 2001, around that time I started posting to the .NET mailing lists asking for the missing information on the metadata file format.

…

About this time Sam Ruby was pushing at the ECMA committee to get the binary file format published, something that was not part of the original agenda. I do not know how things developed, but by April 2001 ECMA had published the file format.

Over time, Mono (now Xamarin) has branched out into wider areas. It runs on Android and iOS/Mac and was acquired by Microsoft in Feb 2016. In addition Unity & Mono/Xamarim have long worked together, to provide C# support in Unity and Unity is now a member of the .NET Foundation.

'Ahead-of-Time' (AOT) Runtimes

I wanted to include AOT runtimes as a seperate category, because traditionally .NET has been ‘Just-in-Time’ Compiled, but over time more and more ‘Ahead-of-Time’ compilation options have been available.

As far as I can tell, Mono was the first, with an ‘AOT’ mode since Aug 2006, but recently, Microsoft have released .NET Native and are they’re working on CoreRT - A .NET Runtime for AOT.

Community Projects

However, not all ‘.NET Runtimes’ were developed by Microsoft, or companies that they later acquired. There are some ‘Community’ owned ones:

The oldest is DotGNU Portable.NET, which started at the same time as Mono, with the goal ‘to build a suite of Free Software tools to compile and execute applications for the Common Language Infrastructure (CLI)..’.
Secondly, there is DotNetAnywhere, the work of just one person, Chris Bacon. DotNetAnywhere has the claim to fame that it provided the initial runtime for the Blazor project. However it’s also an excellent resource if you want to look at what makes up a ‘.NET Compatible-Runtime’ and don’t have the time to wade through the millions of lines-of-code that make up the CoreCLR!
Next comes CosmosOS (GitHub project), which is not just a .NET Runtime, but a ‘Managed Operating System’. If you want to see how it achieves this I recommend reading through the excellent FAQ or taking a quick look under the hood. Another similar effort is SharpOS.
Finally, I recently stumbled across CrossNet, which takes a different approach, it ‘parses .NET assemblies and generates unmanaged C++ code that can be compiled on any standard C++ compiler.’ Take a look at the overview docs and example of generated code to learn more.

Research Projects

Finally, onto the more esoteric .NET Runtimes. These are the Research Projects run by Microsoft, with the aim of seeing just how far can you extend a ‘managed runtime’, what can they be used for. Some of this research work has made it’s way back into commercial/shipping .NET Runtimes, for instance Span<T> came from Midori.

Shared Source Common Language Infrastructure (SSCLI) (a.k.a ‘Rotor):

is Microsoft’s shared source implementation of the CLI, the core of .NET. Although the SSCLI is not suitable for commercial use due to its license, it does make it possible for programmers to examine the implementation details of many .NET libraries and to create modified CLI versions. Microsoft provides the Shared Source CLI as a reference CLI implementation suitable for educational use.

An interesting side-effect of releasing Rotor is that they were also able to release the ‘Gyro’ Project, which gives an idea of how Generics were added to the .NET Runtime.

Midori:

Midori was the code name for a managed code operating system being developed by Microsoft with joint effort of Microsoft Research. It had been reported to be a possible commercial implementation of the Singularity operating system, a research project started in 2003 to build a highly dependable operating system in which the kernel, device drivers, and applications are all written in managed code. It was designed for concurrency, and could run a program spread across multiple nodes at once. It also featured a security model that sandboxes applications for increased security. Microsoft had mapped out several possible migration paths from Windows to Midori. The operating system was discontinued some time in 2015, though many of its concepts were rolled into other Microsoft projects.

Midori is the project that appears to have led to the most ideas making their way back into the ‘.NET Framework’, you can read more about this in Joe Duffy’s excellent series Blogging about Midori

Singularity (operating system) (also Singularity RDK)

Singularity is an experimental operating system (OS) which was built by Microsoft Research between 2003 and 2010. It was designed as a high dependability OS in which the kernel, device drivers, and application software were all written in managed code. Internal security uses type safety instead of hardware memory protection.

Last, but not least, there is Redhawk:

Codename for experimental minimal managed code runtime that evolved into CoreRT.

References

Below are the Wikipedia articles I referenced when creating the timeline:

Fuzzing the .NET JIT Compiler

2018-08-28T00:00:00+00:00

I recently came across the excellent ‘Fuzzlyn’ project, created as part of the ‘Language-Based Security’ course at Aarhus University. As per the project description Fuzzlyn is a:

… fuzzer which utilizes Roslyn to generate random C# programs

And what is a ‘fuzzer’, from the Wikipedia page for ‘fuzzing’:

Fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program.

Or in other words, a fuzzer is a program that tries to create source code that finds bugs in a compiler.

Massive kudos to the developers behind Fuzzlyn, Jakob Botsch Nielsen (who helped answer my questions when writing this post), Chris Schmidt and Jonas Larsen, it’s an impressive project!! (to be clear, I have no link with the project and can’t take any of the credit for it)

Compilation in .NET

But before we dive into ‘Fuzzlyn’ and what it does, we’re going to take a quick look at ‘compilation’ in the .NET Framework. When you write C#/VB.NET/F# code (delete as appropriate) and compile it, the compiler converts it into Intermediate Language (IL) code. The IL is then stored in a .exe or .dll, which the Common Language Runtime (CLR) reads and executes when your program is actually run. However it’s the job of the Just-in-Time (JIT) Compiler to convert the IL code into machine code.

Why is this relevant? Because Fuzzlyn works by comparing the output of a Debug and a Release version of a program and if they are different, there’s a bug! But it turns out that very few optimisations are actually done by the ‘Roslyn’ compiler, compared to what the JIT does, from Eric Lippert’s excellent post What does the optimize switch do? (2009)

The /optimize flag does not change a huge amount of our emitting and generation logic. We try to always generate straightforward, verifiable code and then rely upon the jitter to do the heavy lifting of optimizations when it generates the real machine code. But we will do some simple optimizations with that flag set. For example, with the flag set:

He then goes on to list the 15 things that the C# Compiler will optimise, before finishing with this:

That’s pretty much it. These are very straightforward optimizations; there’s no inlining of IL, no loop unrolling, no interprocedural analysis whatsoever. We let the jitter team worry about optimizing the heck out of the code when it is actually spit into machine code; that’s the place where you can get real wins.

So in .NET, very few of the techniques that an ‘Optimising Compiler’ uses are done at compile-time. They are almost all done at run-time by the JIT Compiler (leaving aside AOT scenarios for the time being).

For reference, most of the differences in IL are there to make the code easier to debug, for instance given this C# code:

public void M() {
    foreach (var item in new [] { 1, 2, 3, 4 }) {
      Console.WriteLine(item);
    }
}

The differences in IL are shown below (‘Release’ on the left, ‘Debug’ on the right). As you can see there are a few extra nop instructions to allow the debugger to ‘step-through’ more locations in the code, plus an extra local variable, which makes it easier/possible to see the value when debugging.

(click for larger image or you can view the ‘Release’ version and the ‘Debug’ version on the excellent SharpLab)

For more information on the differences in Release/Debug code-gen see the ‘Release (optimized)’ section in this doc on CodeGen Differences. Also, because Roslyn is open-source we can see how this is handled in the code:

This all means that the ‘Fuzzlyn’ project has actually been finding bugs in the .NET JIT, not in the Roslyn Compiler

(well, except this one Finally block belonging to unexecuted try runs anyway, which was fixed here)

How it works

At the simplest level, Fuzzlyn works by compiling and running a piece of randomly generated code in ‘Debug’ and ‘Release’ versions and comparing the output. If the 2 versions produce different results, then it’s a bug, specifically a bug in the optimisations that the JIT compiler has attempted.

The .NET JIT, known as ‘RyuJIT’, has several modes. It can produce fully optimised code that has the highest-performance, or in can produce more ‘debug’ friendly code that has no optimisations, but is much simpler. You can find out more about the different ‘optimisations’ that RyuJIT performs in this excellent tutorial, in this design doc or you can search through the code for usages of the ‘compDbgCode’ flag.

From a high-level Fuzzlyn goes through the following steps:

Randomly generate a C# program
Check if the code produces an error (Debug v. Release)
Reduce the code to it’s simplest form

If you want to see this in action, I ran Fuzzlyn until it produced a randomly generated program with a bug. You can see the original source (6,802 LOC) and the reduced version (28 LOC). What’s interesting is that you can clearly see the buggy line-of-code in the original code, before it’s turned into a simplified version:

// Generated by Fuzzlyn v1.1 on 2018-08-22 15:19:26
// Seed: 14928117313359926641
// Reduced from 256.3 KiB to 0.4 KiB in 00:01:58
// Debug: Prints 0 line(s)
// Release: Prints 1 line(s)
public class Program
{
    static short s_18;
    static byte s_33 = 1;
    static int[] s_40 = new int[]{0};
    static short s_74 = 1;
    public static void Main()
    {
        s_18 = -1;
        // This comparision is the bug, in Debug it's False, in Release it's True
        // However, '(ushort)(s_18 | 2L)' is 65,535 in Debug *and* Release
        if (((ushort)(s_18 | 2L) <= s_40[0])) 
        {
            s_74 = 0;
        }

        bool vr10 = s_74 < s_33;
        if (vr10)
        {
            System.Console.WriteLine(0);
        }
    }
}

Random Code Generation

Fuzzlyn can’t produce every type of C# program, however it does support quite a few language features, from Supported constructs:

Fuzzlyn generates only a limited subset of C#. Most importantly, it does not support loops yet. It supports structs and classes, though it does not generate member methods in these. We make no attempt to fully support all kinds of expressions and statements.

To see the code for these generators, follow the links below:

CodeGenerator
LiteralGenerator
FuncGenerator, with specific generator for a:
Binary Operation tables, which are themselves generated using Roslyn

All the statements and expressions that are currently supported are listed here. Interestingly enough the type of statement/expression chosen is not completely random, instead that are chosen using probability tables, that look like this:

public ProbabilityDistribution StatementTypeDist { get; set; }
  = new TableDistribution(new Dictionary<int, double>
  {
      [(int)StatementKind.Assignment] = 0.57,
      [(int)StatementKind.If] = 0.17,
      [(int)StatementKind.Block] = 0.1,
      [(int)StatementKind.Call] = 0.1,
      [(int)StatementKind.TryFinally] = 0.05,
      [(int)StatementKind.Return] = 0.01,
  });

As we saw before, the initial program that Fuzzlyn produces is quite large (over 5,000 LOC), so why does it create and execute a very large program?

Partly because it’s quicker to do this compared to working with lots of smaller programs, i.e. the steps of generation, compilation and starting new processes can be reduced by running large programs.

In addition, Jakob explained the other reasons:

Empirically, other similar projects have shown that larger programs are better. Csmith authors report that most bugs were found with examples of around 80 KB (I don’t remember the exact number). We actually found the same thing in v1.0 – our examples had an average size of 76 KB

Small programs do not get as many opportunities to generate a lot of patterns. For example, it is very unlikely that a small program will have a method taking a byte parameter and at the same time, a method returning a ref byte (this pattern has a bug on Linux: dotnet/coreclr#19256).

We mainly adjusted our probabilities based on how the examples looked. We strived for the generator to produce code that looked relatively like human code. This included going for a wide range of program sizes. By the way, you can run Fuzzlyn with --stats --num-programs=10000 to get a view of the distribution of program sizes – it will output stats for every 500 programs generated.

‘Checking’ for bugs

To check if the behaviour of 2 samples diverge (in ‘Release’ v ‘Debug’ mode), the tool inserts checksum-related code throughout the program. For example here’s a randomly generated method, note the calls to the Checksum(..) function at the end:

static sbyte M15(int arg0)
{
    bool var0 = -71 < s_1;
    uint var1 = (uint)(1UL & s_4++);
    if (var0)
    {
        var0 = var0;
        arg0 = arg0;
    }
    else
    {
        ref ushort var2 = ref s_4;
        var2 = var2;
        s_rt.Checksum("c_17", var2);
    }

    uint var3 = var1;
    short[] var4 = s_2[0][0];
    s_rt.Checksum("c_18", arg0);
    s_rt.Checksum("c_19", var0);
    s_rt.Checksum("c_20", var1);
    s_rt.Checksum("c_21", var3);
    s_rt.Checksum("c_22", var4[0]);
    return 0;
}

The checksums calls allow the execution of a program to be compared between ‘Release’ and ‘Debug’ modes, if a single variable has a different value, at any point during execution, the checksums will be different.

It’s also worth pointing out that Roslyn provides in-memory compilation that helps speed up this process because you don’t have to shell-out to an external process. As Jakob explains:

Additionally since we don’t have to start processes for every invocation when we use Roslyn’s in-memory compilation, we can compile and check for interesting behavior super fast. This allows our reducer to be really simple and dumb, while still giving great results.

‘Reducing’ the output

However, the checksums also help Fuzzlyn ‘Reduce’ the program from the large initial version to something much more readable. By using a ‘binary search’ technique it can remove a section of code and compare the checksums of the remaining code. If the checksums still differ then the remaining code contains the error/bug and Fuzzlyn can carry on reducing it, otherwise it can be discarded.

In addition, Fuzzlyn makes good use of the Roslyn ‘syntax tree’ API when removing code. For instance the CoarseStatementRemover class makes use of the Roslyn CSharpSyntaxWriter class, which is designed to allow syntax re-writing (also see Using a CSharp Syntax Rewriter).

The Results

What initially drew me to the Fuzzlyn project (aside from the great name) was the impressive results I saw it getting. As of the end of Aug 2018, they’re reported 22 bugs, of which 11 have already been fixed (kudos to the .NET JIT devs for fixing them so quickly).

Here’s a list of some of them, taken from the project README:

NullReferenceException thrown for multi-dimensional arrays in release (fixed)

Wrong integer promotion in release (fixed)

Cast to ushort is dropped in release (fixed)

Wrong value passed to generic interface method in release

Constant-folding int.MinValue % -1

Deterministic program outputs indeterministic results on Linux in release (fixed)

RyuJIT incorrectly reorders expression containing a CSE, resulting in exception thrown in release

RyuJIT incorrectly narrows value on ARM32/x86 in release (fixed)

Invalid value numbering when morphing casts that changes signedness after global morph (fixed)

RyuJIT spills 16 bit value but reloads as 32 bits in ARM32/x86 in release

RyuJIT fails to preserve variable allocated to RCX around shift on x64 in release (fixed)

RyuJIT: Invalid ordering when assigning ref-return (fixed)

RyuJIT: Argument written to stack too early on Linux

RyuJIT: Morph forgets about side effects when optimizing casted shift

RyuJIT: By-ref assignment with null leads to runtime crash (fixed)

RyuJIT: Mishandling of subrange assertion for rewritten call parameter

RyuJIT: Incorrect ordering around Interlocked.Exchange and Interlocked.CompareExchange

(for the most up-to-date list see the GitHub Issues created by @jakobbotsch)

Summary

I think that Fuzzlyn is a fantastic project, anything that roots out bugs or undesired behaviour in the JIT is a great benefit to all .NET Developers. If you want a see what the potential side-effects of JIT bugs can be, take a look at Why you should wait on upgrading to .Net 4.6 by Nick Craver (one of the developers at Stack Overflow).

Now, you could argue that some of the code patterns that Fuzzlyn detects are not ones you’d normally write, e.g. if (((ushort)(s_18 | 2L) <= s_40[0])). But the wider point is that it’s valid C# code, which isn’t behaving as it should. Also, if you ever wrote this code you’d have a horrible time tracking down the problem because:

Everyone knows that The First Rule of Programming: It’s Always Your Fault or “select” Isn’t Broken, i.e. getting to the point where you’re sure it is the compilers fault could take a while!
If you tried to debug it, the problem would go away (Fuzzlyn only finds Debug v. Release differences). At which point you might begin to doubt your sanity!

Discuss this post on Hacker News, /r/dotnet or /r/csharp

Monitoring and Observability in the .NET Runtime

2018-08-21T00:00:00+00:00

.NET is a managed runtime, which means that it provides high-level features that ‘manage’ your program for you, from Introduction to the Common Language Runtime (CLR) (written in 2007):

The runtime has many features, so it is useful to categorize them as follows:

Fundamental features – Features that have broad impact on the design of other features. These include:

Garbage Collection

Memory Safety and Type Safety

High level support for programming languages.

Secondary features – Features enabled by the fundamental features that may not be required by many useful programs:

Program isolation with AppDomains

Program Security and sandboxing

Other Features – Features that all runtime environments need but that do not leverage the fundamental features of the CLR. Instead, they are the result of the desire to create a complete programming environment. Among them are:

Versioning

Debugging/Profiling

Interoperation

You can see that ‘Debugging/Profiling’, whilst not a Fundamental or Secondary feature, still makes it into the list because of a ‘desire to create a complete programming environment’.

The rest of this post will look at what Monitoring, Observability and Introspection features the Core CLR provides, why they’re useful and how it provides them.

To make it easier to navigate, the post is split up into 3 main sections (with some ‘extra-reading material’ at the end):

Diagnostics
- Perf View
- Common Infrastructure
- Future Plans
Profiling
- ICorProfiler API
- Profiling v. Debugging
Debugging
- ICorDebug API
- SOS and the DAC
- 3rd Party Debuggers
- Memory Dumps
Further Reading

Diagnostics

Firstly we are going to look at the diagnostic information that the CLR provides, which has traditionally been supplied via ‘Event Tracing for Windows’ (ETW).

There is quite a wide range of events that the CLR provides related to:

Garbage Collection (GC)
Just-in-Time (JIT) Compilation
Module and AppDomains
Threading and Lock Contention
and much more

For example this is where the AppDomain Load event is fired, this is the Exception Thrown event and here is the GC Allocation Tick event.

Perf View

If you want to see the ETW Events coming from your .NET program I recommend using the excellent PerfView tool and starting with these PerfView Tutorials or this excellent talk PerfView: The Ultimate .NET Performance Tool. PerfView is widely regarded because it provides invaluable information, for instance Microsoft Engineers regularly use it for performance investigations.

Common Infrastructure

However, in case it wasn’t clear from the name, ETW events are only available on Windows, which doesn’t really fit into the new ‘cross-platform’ world of .NET Core. You can use PerfView for Performance Tracing on Linux (via LTTng), but that is only the cmd-line collection tool, known as ‘PerfCollect’, the analysis and rich UI (which includes flamegraphs) is currently Windows only.

But if you do want to analyse .NET Performance Linux, there are some other approaches:

The 2nd link above discusses the new ‘EventPipe’ infrastructure that is being worked on in .NET Core (along with EventSources & EventListeners, can you spot a theme!), you can see its aims in Cross-Platform Performance Monitoring Design. At a high-level it will provide a single place for the CLR to push ‘events’ related to diagnostics and performance. These ‘events’ will then be routed to one or more loggers which may include ETW, LTTng, and BPF for example, with the exact logger being determined by which OS/Platform the CLR is running on. There is also more background information in .NET Cross-Plat Performance and Eventing Design that explains the pros/cons of the different logging technologies.

All the work being done on ‘Event Pipes’ is being tracked in the ‘Performance Monitoring’ project and the associated ‘EventPipe’ Issues.

Future Plans

Finally, there are also future plans for a Performance Profiling Controller which has the following goal:

The controller is responsible for control of the profiling infrastructure and exposure of performance data produced by .NET performance diagnostics components in a simple and cross-platform way.

The idea is for it to expose the following functionality via a HTTP server, by pulling all the relevant data from ‘Event Pipes’:

REST APIs

Pri 1: Simple Profiling: Profile the runtime for X amount of time and return the trace.

Pri 1: Advanced Profiling: Start tracing (along with configuration)

Pri 1: Advanced Profiling: Stop tracing (the response to calling this will be the trace itself)

Pri 2: Get the statistics associated with all EventCounters or a specified EventCounter.

Browsable HTML Pages

Pri 1: Textual representation of all managed code stacks in the process.

Provides an snapshot overview of what’s currently running for use as a simple diagnostic report.

Pri 2: Display the current state (potentially with history) of EventCounters.

Provides an overview of the existing counters and their values.

OPEN ISSUE: I don’t believe the necessary public APIs are present to enumerate EventCounters.

I’m excited to see where the ‘Performance Profiling Controller’ (PPC?) goes, I think it’ll be really valuable for .NET to have this built-in to the CLR, it’s something that other runtimes have.

Profiling

Another powerful feature the CLR provides is the Profiling API, which is (mostly) used by 3rd party tools to hook into the runtime at a very low-level. You can find our more about the API in this overview, but at a high-level, it allows your to wire up callbacks that are triggered when:

GC-related events happen
Exceptions are thrown
Assemblies are loaded/unloaded
much, much more

Image from the BOTR page Profiling API – Overview

In addition is has other very power features. Firstly you can setup hooks that are called every time a .NET method is executed whether in the runtime or from users code. These callbacks are known as ‘Enter/Leave’ hooks and there is a nice sample that shows how to use them, however to make them work you need to understand ‘calling conventions’ across different OSes and CPU architectures, which isn’t always easy. Also, as a warning, the Profiling API is a COM component that can only be accessed via C/C++ code, you can’t use it from C#/F#/VB.NET!

Secondly, the Profiler is able to re-write the IL code of any .NET method before it is JITted, via the SetILFunctionBody() API. This API is hugely powerful and forms the basis of many .NET APM Tools, you can learn more about how to use it in my previous post How to mock sealed classes and static methods and the accompanying code.

ICorProfiler API

It turns out that the run-time has to perform all sorts of crazy tricks to make the Profiling API work, just look at what went into this PR Allow rejit on attach (for more info on ‘ReJIT’ see ReJIT: A How-To Guide).

The overall definition for all the Profiling API interfaces and callbacks is found in \vm\inc\corprof.idl (see Interface description language). But it’s divided into 2 logical parts, one is the Profiler -> ‘Execution Engine’ (EE) interface, known asICorProfilerInfo:

// Declaration of class that implements the ICorProfilerInfo* interfaces, which allow the
// Profiler to communicate with the EE.  This allows the Profiler DLL to get
// access to private EE data structures and other things that should never be exported
// outside of the EE.

Which is implemented in the following files:

The other main part is the EE -> Profiler callbacks, which are grouped together under the ICorProfilerCallback interface:

// This module implements wrappers around calling the profiler's 
// ICorProfilerCallaback* interfaces. When code in the EE needs to call the
// profiler, it goes through EEToProfInterfaceImpl to do so.

These callbacks are implemented across the following files:

Finally, it’s worth pointing out that the Profiler APIs might not work across all OSes and CPU-archs that .NET Core runs on, e.g. ELT call stub issues on Linux, see Status of CoreCLR Profiler APIs for more info.

Profiling v. Debugging

As a quick aside, ‘Profiling’ and ‘Debugging’ do have some overlap, so it’s helpful to understand what the different APIs provide in the context of the .NET Runtime, from CLR Debugging vs. CLR Profiling

Debugging

Debugging means different things to different people, for instance I asked on Twitter “what are the ways that you’ve debugged a .NET program” and got a wide range of different responses, although both sets of responses contain a really good list of tools and techniques, so they’re worth checking out, thanks #LazyWeb!

But perhaps this quote best sums up what Debugging really is 😊

Debugging is like being the detective in a crime movie where you are also the murderer.
— Filipe Fortes (@fortes) November 10, 2013

The CLR provides a very extensive range of features related to Debugging, but why does it need to provide these services, the excellent post Why is managed debugging different than native-debugging? provides 3 reasons:

Native debugging can be abstracted at the hardware level but managed debugging needs to be abstracted at the IL level
Managed debugging needs a lot of information not available until runtime
A managed debugger needs to coordinate with the Garbage Collector (GC)

So to give a decent experience, the CLR has to provide the higher-level debugging API known as ICorDebug, which is shown in the image below of a ‘common debugging scenario’ from the BOTR:

In addition, there is a nice description of how the different parts interact in How do Managed Breakpoints work?:

Here’s an overview of the pipeline of components:
1) End-user
2) Debugger (such as Visual Studio or MDbg).
3) CLR Debugging Services (which we call "The Right Side"). This is the implementation of ICorDebug (in mscordbi.dll).
---- process boundary between Debugger and Debuggee ----
4) CLR. This is mscorwks.dll. This contains the in-process portion of the debugging services (which we call "The Left Side") which communicates directly with the RS in stage #3.
5) Debuggee's code (such as end users C# program)

ICorDebug API

But how is all this implemented and what are the different components, from CLR Debugging, a brief introduction:

All of .Net debugging support is implemented on top of a dll we call “The Dac”. This file (usually named mscordacwks.dll) is the building block for both our public debugging API (ICorDebug) as well as the two private debugging APIs: The SOS-Dac API and IXCLR.

In a perfect world, everyone would use ICorDebug, our public debugging API. However a vast majority of features needed by tool developers such as yourself is lacking from ICorDebug. This is a problem that we are fixing where we can, but these improvements go into CLR v.next, not older versions of CLR. In fact, the ICorDebug API only added support for crash dump debugging in CLR v4. Anyone debugging CLR v2 crash dumps cannot use ICorDebug at all!

(for an additional write-up, see SOS & ICorDebug)

The ICorDebug API is actually split up into multiple interfaces, there are over 70 of them!! I won’t list them all here, but I will show the categories they fall into, for more info see Partition of ICorDebug where this list came from, as it goes into much more detail.

Top-level: ICorDebug + ICorDebug2 are the top-level interfaces which effectively serve as a collection of ICorDebugProcess objects.
Callbacks: Managed debug events are dispatched via methods on a callback object implemented by the debugger
Process: This set of interfaces represents running code and includes the APIs related to eventing.
Code / Type Inspection: Could mostly operate on a static PE image, although there are a few convenience methods for live data.
Execution Control: Execution is the ability to “inspect” a thread’s execution. Practically, this means things like placing breakpoints (F9) and doing stepping (F11 step-in, F10 step-over, S+F11 step-out). ICorDebug’s Execution control only operates within managed code.
Threads + Callstacks: Callstacks are the backbone of the debugger’s inspection functionality. The following interfaces are related to taking a callstack. ICorDebug only exposes debugging managed code, and thus the stacks traces are managed-only.
Object Inspection: Object inspection is the part of the API that lets you see the values of the variables throughout the debuggee. For each interface, I list the “MVP” method that I think must succinctly conveys the purpose of that interface.

One other note, as with the Profiling APIs the level of support for the Debugging API varies across OS’s and CPU architectures. For instance, as of Aug 2018 there’s “no solution for Linux ARM of managed debugging and diagnostic”. For more info on ‘Linux’ support in general, see this great post Debugging .NET Core on Linux with LLDB and check-out the Diagnostics repository from Microsoft that has the goal of making it easier to debug .NET programs on Linux.

Finally, if you want to see what the ICorDebug APIs look like in C#, take a look at the wrappers included in CLRMD library, include all the available callbacks (CLRMD will be covered in more depth, later on in this post).

SOS and the DAC

The ‘Data Access Component’ (DAC) is discussed in detail in the BOTR page, but in essence it provides ‘out-of-process’ access to the CLR data structures, so that their internal details can be read from another process. This allows a debugger (via ICorDebug) or the ‘Son of Strike’ (SOS) extension to reach into a running instance of the CLR or a memory dump and find things like:

all the running threads
what objects are on the managed heap
full information about a method, including the machine code
the current ‘stack trace’

Quick aside, if you want an explanation of all the strange names and a bit of a ‘.NET History Lesson’ see this Stack Overflow answer.

The full list of SOS Commands is quite impressive and using it along-side WinDBG allows you a very low-level insight into what’s going on in your program and the CLR. To see how it’s implemented, lets take a look at the !HeapStat command that gives you a summary of the size of different Heaps that the .NET GC is using:

(image from SOS: Upcoming release has a few new commands – HeapStat)

Here’s the code flow, showing how SOS and the DAC work together:

SOS The full !HeapStat command (link)
SOS The code in the !HeapStat command that deals with the ‘Workstation GC’ (link)
SOS GCHeapUsageStats(..) function that does the heavy-lifting (link)
Shared The DacpGcHeapDetails data structure that contains pointers to the main data in the GC heap, such as segments, card tables and individual generations (link)
DAC GetGCHeapStaticData function that fills-out the DacpGcHeapDetails struct (link)
Shared the DacpHeapSegmentData data structure that contains details for an individual ‘segment’ with the GC Heap (link)
DAC GetHeapSegmentData(..) that fills-out the DacpHeapSegmentData struct (link)

3rd Party ‘Debuggers’

Because Microsoft published the debugging API it allowed 3rd parties to make use of the use of the ICorDebug interfaces, here’s a list of some that I’ve come across:

Debugger for .NET Core runtime from Samsung
- The debugger provides GDB/MI or VSCode debug adapter interface and allows to debug .NET apps under .NET Core runtime.
- Probably written as part of their work of porting .NET Core to their Tizen OS
dnSpy - “.NET debugger and assembly editor”
- A very impressive tool, it’s a ‘debugger’, ‘assembly editor’, ‘hex editor’, ‘decompiler’ and much more!
MDbg.exe (.NET Framework Command-Line Debugger)
- Available as a NuGet package and a GitHub repo or you can download is from Microsoft.
- However, at the moment is MDBG doesn’t seem to work with .NET Core, see Port MDBG to CoreCLR and ETA for porting mdbg to coreclr for some more information.
JetBrains ‘Rider’ allows .NET Core debugging on Windows
- Although there was some controversy due to licensing issues
- For more info, see this HackerNews thread

Memory Dumps

The final area we are going to look at is ‘memory dumps’, which can be captured from a live system and analysed off-line. The .NET runtime has always had good support for creating ‘memory dumps’ on Windows and now that .NET Core is ‘cross-platform’, the are also tools available do the same on other OSes.

One of the issues with ‘memory dumps’ is that it can be tricky to get hold of the correct, matching versions of the SOS and DAC files. Fortunately Microsoft have just released the dotnet symbol CLI tool that:

can download all the files needed for debugging (symbols, modules, SOS and DAC for the coreclr module given) for any given core dump, minidump or any supported platform’s file formats like ELF, MachO, Windows DLLs, PDBs and portable PDBs.

Finally, if you spend any length of time analysing ‘memory dumps’ you really should take a look at the excellent CLR MD library that Microsoft released a few years ago. I’ve previously written about what you can do with it, but in a nutshell, it allows you to interact with memory dumps via an intuitive C# API, with classes that provide access to the ClrHeap, GC Roots, CLR Threads, Stack Frames and much more. In fact, aside from the time needed to implemented the work, CLR MD could implement most (if not all) of the SOS commands.

But how does it work, from the announcement post:

The ClrMD managed library is a wrapper around CLR internal-only debugging APIs. Although those internal-only APIs are very useful for diagnostics, we do not support them as a public, documented release because they are incredibly difficult to use and tightly coupled with other implementation details of the CLR. ClrMD addresses this problem by providing an easy-to-use managed wrapper around these low-level debugging APIs.

By making these APIs available, in an officially supported library, Microsoft have enabled developers to build a wide range of tools on top of CLRMD, which is a great result!

So in summary, the .NET Runtime provides a wide-range of diagnostic, debugging and profiling features that allow a deep-insight into what’s going on inside the CLR.

Discuss this post on HackerNews, /r/programming or /r/csharp

Presentations and Talks covering '.NET Internals'

2018-07-12T00:00:00+00:00

I’m constantly surprised at just how popular resources related to ‘.NET Internals’ are, for instance take this tweet and the thread that followed:

If you like learning about '.NET Internals' here's a few talks/presentations I've watched that you might also like. First 'Writing High Performance Code in .NET' by Bart de Smet https://t.co/L5S9BsBlWe
— Matt Warren (@matthewwarren) July 9, 2018

All I’d done was put together a list of Presentations/Talks (based on the criteria below) and people really seemed to appreciate it!!

Criteria

To keep things focussed, the talks or presentations:

Must explain some aspect of the ‘internals’ of the .NET Runtime (CLR)
- i.e. something ‘under-the-hood’, the more ‘low-level’ the better!
- e.g. how the GC works, what the JIT does, how assemblies are structured, how to inspect what’s going on, etc
Be entertaining and worth watching!
- i.e. worth someone giving up 40-50 mins of their time for
- this is hard when you’re talking about low-level details, not all speakers manage it!
Needs to be a talk that I’ve watched myself and actually learnt something from
- i.e. I don’t just hope it’s good based on the speaker/topic
Doesn’t have to be unique, fine if it overlaps with another talk
- it often helps having two people cover the same idea, from different perspectives

If you want more general lists of talks and presentations see Awesome talks and Awesome .NET Performance

List of Talks

Here’s the complete list of talks, including a few bonus ones that weren’t in the tweet:

PerfView: The Ultimate .NET Performance Tool by Sasha Goldshtein
Writing High Performance Code in .NET by Bart De Smet
State of the .NET Performance by Adam Sitnik
Let’s talk about microbenchmarking by Andrey Akinshin
Safe Systems Programming in C# and .NET (summary) by Joe Duffy
FlingOS - Using C# for an OS by Ed Nutting
Maoni Stephens on .NET GC by Maoni Stephens
What’s new for performance in .NET Core 2.0 by Ben Adams
Open Source Hacking the CoreCLR by Geoff Norton
.NET Core & Cross Platform by Matt Ellis
.NET Core on Unix by Jan Vorlicek
Multithreading Deep Dive by Gael Fraiteur
Everything you need to know about .NET memory by Ben Emmett

I also added these 2 categories:

‘Channel 9’ Talks
- So many great talks featuring the Microsoft Engineers who work on the .NET runtime
Talks I plan to watch (but haven’t yet)

If I’ve missed any out, please let me know in the comments (or on twitter)

PerfView: The Ultimate .NET Performance Tool by Sasha Goldshtein (slides)

In fact, just watch all the talks/presentations that Sasha has done, they’re great!! For example Modern Garbage Collection in Theory and Practice and Making .NET Applications Faster

This talk is a great ‘how-to’ guide for PerfView, what it can do and how to use it (JIT stats, memory allocations, CPU profiling). For more on PerfView see this interview with it’s creator, Vance Morrison: Performance and PerfView.

Writing High Performance Code in .NET by Bart De Smet (he also has a some Pluralsight Courses on the same subject)

Features CLRMD, WinDBG, ETW Events and PerfView, plus some great ‘real world’ performance issues

State of the .NET Performance by Adam Sitnik (slides)

How to write high-perf code that plays nicely with the .NET GC, covering Span<T>, Memory<T> & ValueTask

Let’s talk about microbenchmarking by Andrey Akinshin (slides)

Primarily a look at how to benchmark .NET code, but along the way it demonstrates some of the internal behaviour of the JIT compiler (Andrey is the creator of BenchmarkDotNet)

Safe Systems Programming in C# and .NET (summary) by Joe Duffy (slides and blog)

Joe Duffy (worked on the Midori project) shows why C# is a good ‘System Programming’ language, including what low-level features it provides

FlingOS - Using C# for an OS by Ed Nutting (slides)

Shows what you need to do if you want to write and entire OS in C# (!!) The FlingOS project is worth checking out, it’s a great learning resource.

Maoni Stephens on .NET GC by Maoni Stephens who is the main (only?) .NET GC developer. In addition CLR 4.5 Server Background GC and .NET 4.5 in Practice: Bing are also worth a watch.

An in-depth Q&A on how the .NET GC works, why is does what it does and how to use it efficiently

What’s new for performance in .NET Core 2.0 by Ben Adams (slides)

Whilst it mostly focuses on performance, there is some great internal details on how the JIT generates code for ‘de-virtualisation’, ‘exception handling’ and ‘bounds checking’

Open Source Hacking the CoreCLR by Geoff Norton

Making .NET Core (the CoreCLR) work on OSX was mostly a ‘community contribution’, this talks is a ‘walk-through’ of what it took to make it happen

.NET Core & Cross Platform by Matt Ellis, one of the .NET Runtime Engineers (this one on how made .NET Core ‘Open Source’ is also worth a watch)

Discussion of the early work done to make CoreCLR ‘cross-platform’, including the build setup, ‘Platform Abstraction Layer’ (PAL) and OS differences that had to be accounted for

.NET Core on Unix by Jan Vorlicek a .NET Runtime Engineer (slides)

This talk discusses which parts of the CLR had to be changed to run on Unix, including exception handling, calling conventions, runtime suspension and the PAL

Multithreading Deep Dive by Gael Fraiteur (creator of PostSharp)

Takes a really in-depth look at the CLR memory-model and threading primitives

Everything you need to know about .NET memory by Ben Emmett (slides)

Explains how the .NET GC works using Lego! A very innovative and effective approach!!

Channel 9

The Channel 9 videos recorded by Microsoft deserve their own category, because there’s so much deep, technical information in them. This list is just a selection, including some of my favourites, there are many, many more available!!

Ones to watch

I can’t recommend these yet, because I haven’t watched them myself! (I can’t break my own rules!!).

But they all look really interesting and I will watch them as soon as I get a chance, so I thought they were worth including:

If this post causes you to go off and watch hours and hours of videos, ignoring friends, family and work for the next few weeks, Don’t Blame Me

.NET JIT and CLR - Joined at the Hip

2018-07-05T00:00:00+00:00

I’ve been digging into .NET Internals for a while now, but never really looked closely at how the ‘Just-in-Time’ (JIT) compiler works. In my mind, the interaction between the .NET Runtime and the JIT has always looked like this:

Nice and straight-forward, the CLR asks the JIT to compile some ‘Intermediate Language’ (IL) code into machine code and the JIT hands back the bytes when it’s done.

However, it turns out the interaction is much more complicated, in reality it looks more like this:

The JIT and the CLR’s ‘Execution Engine’ (EE) or ‘Virtual Machine’ (VM) work closely with one another, they really are ‘joined at the hip’.

The rest of this post will explore the interaction between the 2 components, how they work together and why they need to.

The JIT Compiler

As a quick aside, this post will not be talking about the internals of the JIT compiler itself, if you want to find out more about how that works I recommend reading the fantastic overview in the BOTR and this excellent tutorial, where this very helpful diagram comes from:

After all that, if you still want more, you can take a look at the ‘JIT’ section in the ‘Hitchhikers-Guide-to-the-CoreCLR-Source-Code’.

Components within the CLR

Before we go any further it’s helpful to discuss how the ‘Common Language Runtime’ (CLR) is actually composed. It’s actually made up of several different components including the VM/EE, JIT, GC and others. The treemap below shows the different areas of the source code, grouped by colour into the top-level sections they fall under. You can clearly see that the VM and JIT dominate as well as ‘mscorlib’ which is the only component written in C#.

You can hover over an individual box to get more detailed information and can click on the different radio buttons to toggle the sizing (LOC/Files/Commits)

Note: This treemap is from my previous post ‘Hitchhikers-Guide-to-the-CoreCLR-Source-Code’ which was written over a year ago, so the exact numbers will have changed in the meantime.

You can also see these ‘components’ or ‘areas’ reflected in the classification scheme used for the CoreCLR GitHub issues (one difference is that area-CodeGen is used instead of JIT).

The CLR and the JIT Compiler

Onto the main subject, just how do the CLR and the JIT compiler work together to transform a method from IL to machine code? As always, the ‘Book of the Runtime’ is a good place to start, from the ‘Execution Environment and External Interface’ section of the RyuJIT Overview:

RyuJIT provides the just in time compilation service for the .NET runtime. The runtime itself is variously called the EE (execution engine), the VM (virtual machine) or simply the CLR (common language runtime). Depending upon the configuration, the EE and JIT may reside in the same or different executable files. RyuJIT implements the JIT side of the JIT/EE interfaces:

ICorJitCompiler – this is the interface that the JIT compiler implements. This interface is defined in src/inc/corjit.h and its implementation is in src/jit/ee_il_dll.cpp. The following are the key methods on this interface:

compileMethod is the main entry point for the JIT. The EE passes it a ICorJitInfo object, and the “info” containing the IL, the method header, and various other useful tidbits. It returns a pointer to the code, its size, and additional GC, EH and (optionally) debug info.

getVersionIdentifier is the mechanism by which the JIT/EE interface is versioned. There is a single GUID (manually generated) which the JIT and EE must agree on.

getMaxIntrinsicSIMDVectorLength communicates to the EE the largest SIMD vector length that the JIT can support.

ICorJitInfo – this is the interface that the EE implements. It has many methods defined on it that allow the JIT to look up metadata tokens, traverse type signatures, compute field and vtable offsets, find method entry points, construct string literals, etc. This bulk of this interface is inherited from ICorDynamicInfo which is defined in src/inc/corinfo.h. The implementation is defined in src/vm/jitinterface.cpp.

So there are 2 main interfaces, ICorJitCompiler which is implemented by the JIT compiler and allows the EE to control how a method is compiled. Second there is ICorJitInfo which the EE implements to allow the JIT to request information it needs during compilation.

Let’s now look at these interfaces in more detail.

EE ➜ JIT ICorJitCompiler

Firstly, we’ll examine ICorJitCompiler, the interface exposed by the JIT. It’s actually pretty straight-forward and only contains 7 methods:

CorJitResult __stdcall compileMethod (..)
void clearCache()
BOOL isCacheCleanupRequired()
void ProcessShutdownWork(ICorStaticInfo* info)
void getVersionIdentifier(..)
unsigned getMaxIntrinsicSIMDVectorLength(..)
void setRealJit(..)

Of these, the most interesting one is compileMethod(..), which has the following signature:

    virtual CorJitResult __stdcall compileMethod (
            ICorJitInfo                 *comp,               /* IN */
            struct CORINFO_METHOD_INFO  *info,               /* IN */
            unsigned /* code:CorJitFlag */   flags,          /* IN */
            BYTE                        **nativeEntry,       /* OUT */
            ULONG                       *nativeSizeOfCode    /* OUT */
            ) = 0;

The EE provides the JIT with information about the method it wants compiled (CORINFO_METHOD_INFO) as well as flags (CorJitFlag) which control the:

Level of optimisation
Whether the code is compiled in Debug or Release mode
If the code needs to be ‘Profilable’ or support ‘Edit-and-Continue’
Alignment of loops, i.e. should they be aligned on byte-boundaries
If SSE3/SSE4 should be used
and many other scenarios

The final parameter is a reference to the ICorJitInfo interface, which is covered in the next section.

JIT ➜ EE ICorJitHost and ICorJitInfo

The APIs that the EE has to implement to work with the JIT are not simple, there are almost 180 functions or callbacks!!

Interface	Method Count
ICorJitHost	5
ICorJitInfo	19
ICorDynamicInfo	36
ICorStaticInfo	118
Total	178

Note: The links take you to the function ‘definitions’ for a given interface. Alternatively all the methods are listed together in this gist.

ICorJitHost makes available ‘functionality that would normally be provided by the operating system’, predominantly the ability to allocate the ‘pages’ of memory that the JIT uses during compilation.

ICorJitInfo (class ICorJitInfo : public ICorDynamicInfo) contains more specific memory allocation routines, including ones for the ‘GC Info’ data, a ‘method/funclet’s unwind information’, ‘.rdata and .pdata for a method’ and the ‘exception handler blocks’.

ICorDynamicInfo (class ICorDynamicInfo : public ICorStaticInfo) provides data that can change from ‘invocation to invocation’, i.e. the JIT cannot cache the results of these method calls. It includes functions that provide:

Thread Local Storage (TLS) index
Function Entry Point (address)
EE ‘helper functions’
Address of a Field
Constructor for a delegate
and much more

Finally, ICorStaticInfo, which is further sub-divided up into more specific interfaces:

Interface	Method Count
ICorMethodInfo	28
ICorModuleInfo	9
ICorClassInfo	49
ICorFieldInfo	7
ICorDebugInfo	4
ICorArgInfo	4
ICorErrorInfo	7
Diagnostic methods	6
General methods	2
Misc methods	2
Total	118

Because the interface is nicely composed we can easily see what it provides. The bulk of the functions are concerned with information about a module, class, method or field. For instance the JIT can query the class size, GC layout and obtain the address of a field within a class. It can also learn about a method’s signature, find it’s parent class and get ‘exception handling’ information (the full list of methods are available in this gist).

These interfaces and the methods they contain give a nice insight into what information the JIT requests from the runtime and therefore what knowledge it requires when compiling a single method.

Now, let’s look at the end-to-end flow of a couple of these methods and see where they are implemented in the CoreCLR source code.

EE ➜ JIT `getFunctionEntryPoint(..)`

First we’ll look at a method where the EE provides information to the JIT:

/src/inc/corinfo.h (shared definition)
/src/jit/lower.cpp (method call from the JIT)
/src/vm/jitinterface.h (VM definition)
/src/vm/jitinterface.cpp (implementation in the VM)
/src/zap/zapinfo.cpp (ZAP/NGEN implementation)
/src/jit/ICorJitInfo_API_wrapper.hpp (wrapper)
/src/ToolBox/superpmi/superpmi/icorjitinfo.cpp (SuperPMI implementation)

JIT ➜ EE `reportInliningDecision()`

Next we’ll look at a scenario where the data flows from the JIT back to the EE:

/src/inc/corinfo.h (shared definition)
/src/jit/inline.cpp (method call from the JIT)
/src/vm/jitinterface.h (VM definition)
/src/vm/jitinterface.cpp (implementation in the VM)
/src/zap/zapinfo.cpp (ZAP/NGEN implementation)
/src/jit/ICorJitInfo_API_wrapper.hpp (wrapper)
/src/ToolBox/superpmi/superpmi/icorjitinfo.cpp (SuperPMI implementation)

SuperPMI tool

Finally, I just want to cover the ‘SuperPMI’ tool that showed up in the previous 2 scenarios. What is this tool and what does it do? From the CoreCLR glossary:

SuperPMI - JIT component test framework (super fast JIT testing - it mocks/replays EE in EE-JIT interface)

So in a nutshell it allows JIT development and testing to be de-coupled from the EE, which is useful because we’ve just seen that the 2 components are tightly integrated.

But how does it work? From the README:

SuperPMI works in two phases: collection and playback. In the collection phase, the system is configured to collect SuperPMI data. Then, run any set of .NET managed programs. When these managed programs invoke the JIT compiler, SuperPMI gathers and captures all information passed between the JIT and its .NET host. In the playback phase, SuperPMI loads the JIT directly, and causes it to compile all the functions that it previously compiled, but using the collected data to provide answers to various questions that the JIT needs to ask. The .NET execution engine (EE) is not invoked at all.

This explains why there is a SuperPMI implementation for every method that is part of the JIT <-> EE interface. SuperPMI needs to ‘record’ or ‘collect’ each interaction with the EE and store the information so that it can be ‘played back’ at a later time, when the EE isn’t present.

Discuss this post on Hacker News or /r/dotnet

Tools for Exploring .NET Internals

2018-06-15T00:00:00+00:00

Whether you want to look at what your code is doing ‘under-the-hood’ or you’re trying to see what the ‘internals’ of the CLR look like, there is a whole range of tools that can help you out.

To give ‘credit where credit is due’, this post is based on a tweet, so thanks to everyone who contributed to the list and if I’ve missed out any tools, please let me know in the comments below.

While you’re here, I’ve also written other posts that look at the ‘internals’ of the .NET Runtime:

Exploring the Internals of the .NET Runtime (a ‘how-to’ guide)
Resources for Learning about .NET Internals (other blogs that cover ‘internals’)

Honourable Mentions

Firstly I’ll start by mentioning that Visual Studio has a great debugger and so does VSCode. Also there are lots of very good (commercial) .NET Profilers and Application Monitoring Tools available that you should also take a look at. For example I’ve recently been playing around with Codetrack and I’m very impressed by what it can do!

However, the rest of the post is going to look at some more single-use tools that give a even deeper insight into what is going on. As a added bonus they’re all ‘open-source’, so you can take a look at the code and see how they work!!

PerfView by Vance Morrison

PerfView is simply an excellent tool and is the one that I’ve used most over the years. It uses ‘Event Tracing for Windows’ (ETW) Events to provide a deep insight into what the CLR is doing, as well as allowing you to profile Memory and CPU usage. It does have a fairly steep learning curve, but there are some nice tutorials to help you along the way and it’s absolutely worth the time and effort.

Also, if you need more proof of how useful it is, Microsoft Engineers themselves use it and many of the recent performance improvements in MSBuild were carried out after using PerfView to find the bottlenecks.

PerfView is built on-top of the Microsoft.Diagnostics.Tracing.TraceEvent library which you can use in your own tools. In addition, since it’s been open-sourced the community has contributed and it has gained some really nice features, including flame-graphs:

(Click for larger version)

SharpLab by Andrey Shchekin

SharpLab started out as a tool for inspecting the IL code emitted by the Roslyn compiler, but has now grown into much more:

SharpLab is a .NET code playground that shows intermediate steps and results of code compilation. Some language features are thin wrappers on top of other features – e.g. using() becomes try/catch. SharpLab allows you to see the code as compiler sees it, and get a better understanding of .NET languages.

If supports C#, Visual Basic and F#, but most impressive are the ‘Decompilation/Disassembly’ features:

There are currently four targets for decompilation/disassembly:

C#

Visual Basic

IL

JIT Asm (Native Asm Code)

That’s right, it will output the assembly code that the .NET JIT generates from your C#:

Object Layout Inspector by Sergey Teplyakov

This tool gives you an insight into the memory layout of your .NET objects, i.e. it will show you how the JITter has decided to arrange the fields within your class or struct. This can be useful when writing high-performance code and it’s helpful to have a tool that does it for us because doing it manually is tricky:

There is no official documentation about fields layout because the CLR authors reserved the right to change it in the future. But knowledge about the layout can be helpful if you’re curious or if you’re working on a performance critical application.

How can we inspect the layout? We can look at a raw memory in Visual Studio or use !dumpobj command in SOS Debugging Extension. These approaches are tedious and boring, so we’ll try to write a tool that will print an object layout at runtime.

From the example in the GitHub repo, if you use TypeLayout.Print<NotAlignedStruct>() with code like this:

public struct NotAlignedStruct
{
    public byte m_byte1;
    public int m_int;

    public byte m_byte2;
    public short m_short;
}

You’ll get the following output, showing exactly how the CLR will layout the struct in memory, based on it’s padding and optimization rules.

Size: 12. Paddings: 4 (%33 of empty space)
|================================|
|     0: Byte m_byte1 (1 byte)   |
|--------------------------------|
|   1-3: padding (3 bytes)       |
|--------------------------------|
|   4-7: Int32 m_int (4 bytes)   |
|--------------------------------|
|     8: Byte m_byte2 (1 byte)   |
|--------------------------------|
|     9: padding (1 byte)        |
|--------------------------------|
| 10-11: Int16 m_short (2 bytes) |
|================================|

The Ultimate .NET Experiment (TUNE) by Konrad Kokosa

TUNE is a really intriguing tool, as it says on the GitHub page, it’s purpose is to help you

… learn .NET internals and performance tuning by experiments with C# code.

You can find out more information about what it does in this blog post, but at a high-level it works like this:

write a sample, valid C# script which contains at least one class with public method taking a single string parameter. It will be executed by hitting Run button. This script can contain as many additional methods and classes as you wish. Just remember that first public method from the first public class will be executed (with single parameter taken from the input box below the script). …

after clicking Run button, the script will be compiled and executed. Additionally, it will be decompiled both to IL (Intermediate Language) and assembly code in the corresponding tabs.

all the time Tune is running (including time during script execution) a graph with GC data is being drawn. It shows information about generation sizes and GC occurrences (illustrated as vertical lines with the number below indicating which generation has been triggered).

And looks like this:

(Click for larger version)

Tools based on CLR Memory Diagnostics (ClrMD)

Finally, we’re going to look at a particular category of tools. Since .NET came out you’ve always been able to use WinDBG and the SOS Debugging Extension to get deep into the .NET runtime. However it’s not always the easiest tool to get started with and as this tweet says, it’s not always the most productive way to do things:

Besides how complex it is, the idea is to build better abstractions. Raw debugging at the low level is just usually too unproductive. That to me is the promise of ClrMD, that it lets us build specific extensions to extract quickly the right info
— Tomas Restrepo (@tomasrestrepo) March 14, 2018

Fortunately Microsoft made the ClrMD library available (a.k.a Microsoft.Diagnostics.Runtime), so now anyone can write a tool that analyses memory dumps of .NET programs. You can find out even more info in the official blog post and I also recommend taking a look at ClrMD.Extensions that “.. provide integration with LINPad and to make ClrMD even more easy to use”.

I wanted to pull together a list of all the existing tools, so I enlisted twitter to help. Note to self: careful what you tweet, the WinDBG Product Manager might read your tweets and get a bit upset!!

Well this just hurts my feelings :(
— Andy Luhrs (@aluhrs13) March 14, 2018

Most of these tools are based on ClrMD because it’s the easiest way to do things, however you can use the underlying COM interfaces directly if you want. Also, it’s worth pointing out that any tool based on ClrMD is not cross-platform, because ClrMD itself is Windows-only. For cross-platform options see Analyzing a .NET Core Core Dump on Linux

Finally, in the interest of balance, there have been lots of recent improvements to WinDBG and because it’s extensible there have been various efforts to add functionality to it:

Extending the new WinDbg, Part 1 – Buttons and commands
Extending the new WinDbg, Part 2 – Tool windows and command output
Extending the new WinDbg, Part 3 – Embedding a C# interpreter
WinDBG extension + UI tool extensions and here
NetExt a WinDBG application that makes .NET debugging much easier as compared to the current options: sos or psscor, also see this InfoQ article

Having said all that, onto the list:

SuperDump (GitHub)
- A service for automated crash-dump analysis (presentation)
msos (GitHub)
- Command-line environment a-la WinDbg for executing SOS commands without having SOS available.
MemoScope.Net (GitHub)
- A tool to analyze .Net process memory Can dump an application’s memory in a file and read it later.
- The dump file contains all data (objects) and threads (state, stack, call stack). MemoScope.Net will analyze the data and help you to find memory leaks and deadlocks
dnSpy (GitHub)
- .NET debugger and assembly editor
- You can use it to edit and debug assemblies even if you don’t have any source code available!!
MemAnalyzer (GitHub)
- A command line memory analysis tool for managed code.
- Can show which objects use most space on the managed heap just like !DumpHeap from Windbg without the need to install and attach a debugger.
DumpMiner (GitHub)
- UI tool for playing with ClrMD, with more features coming soon
Trace CLI (GitHub)
- A production debugging and tracing tool
Shed (GitHub)
- Shed is an application that allow to inspect the .NET runtime of a program in order to extract useful information. It can be used to inspect malicious applications in order to have a first general overview of which information are stored once that the malware is executed. Shed is able to:
  - Extract all objects stored in the managed heap
  - Print strings stored in memory
  - Save the snapshot of the heap in a JSON format for post-processing
  - Dump all modules that are loaded in memory

You can also find many other tools that make use of ClrMD, it was a very good move by Microsoft to make it available.

Other Tools

A few other tools that are also worth mentioning:

DebugDiag
- The DebugDiag tool is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or memory fragmentation, and crashes in any user-mode process (now with ‘CLRMD Integration’)
SOSEX (might not be developed any more)
- … a debugging extension for managed code that begins to alleviate some of my frustrations with SOS
VMMap from Sysinternals
- VMMap is a process virtual and physical memory analysis utility.
- I’ve previously used it to look at Memory Usage Inside the CLR

Discuss this post on Hacker News or /r/programming

CoreRT - A .NET Runtime for AOT

2018-06-07T00:00:00+00:00

Firstly, what exactly is CoreRT? From its GitHub repo:

.. a .NET Core runtime optimized for AOT (ahead of time compilation) scenarios, with the accompanying .NET native compiler toolchain

The rest of this post will look at what that actually means.

Existing .NET ‘AOT’ Implementations
High-Level Overview
The Compiler
The Runtime
‘Hello World’ Program
Limitations
Further Reading

Existing .NET ‘AOT’ Implementations

However, before we look at what CoreRT is, it’s worth pointing out there are existing .NET ‘Ahead-of-Time’ (AOT) implementations that have been around for a while:

Mono

Ahead of Time Compilation in Mono (August 2006)
Mono Docs - AOT (also see this link)
How Xamarin.Android AOT Works
Xamarin.iOS - Architecture - AOT

.NET Native (Windows 10/UWP apps only, a.k.a ‘Project N’)

So if there were existing implementations, why was CoreRT created? The official announcement gives us some idea:

If we want to shortcut this two-step compilation process and deliver a 100% native application on Windows, Mac, and Linux, we need an alternative to the CLR. The project that is aiming to deliver that solution with an ahead-of-time compilation process is called CoreRT.

The main difference is that CoreRT is designed to support .NET Core scenarios, i.e. .NET Standard, cross-platform, etc.

Also worth pointing out is that whilst .NET Native is a separate product, they are related and in fact “.NET Native shares many CoreRT parts”.

High-Level Overview

Because all the code is open source, we can very easily identify the main components and understand where the complexity is. Firstly lets look at where the most ‘lines of code’ are:

We clearly see that the majority of the code is written in C#, with only the Native component written in C++. The largest single component is System.Private.CoreLib which is all C# code, although there are other sub-components that contribute to it (‘System.Private.XXX’), such as System.Private.Interop (36,547 LOC), System.Private.TypeLoader (30,777) and System.Private.Reflection.Core (24,964). Other significant components are the ‘Intermediate Language (IL) Compiler’ and the Common code that is used re-used by everything else.

All these components are discussed in more detail below.

The Compiler

So whilst CoreRT is a run-time, it also needs a compiler to put everything together, from Intro to .NET Native and CoreRT:

.NET Native is a native toolchain that compiles CIL byte code to machine code (e.g. X64 instructions). By default, .NET Native (for .NET Core, as opposed to UWP) uses RyuJIT as an ahead-of-time (AOT) compiler, the same one that CoreCLR uses as a just-in-time (JIT) compiler. It can also be used with other compilers, such as LLILC, UTC for UWP apps and IL to CPP (an IL to textual C++ compiler we have built as a reference prototype).

But what does this actually look like in practice, as they say ‘a picture paints a thousand words’:

(Click for larger version)

To give more detail, the main compilation phases (started from \ILCompiler\src\Program.cs) are the following:

Calculate the reachable modules/types/classes, i.e. the ‘compilation roots’ using the ILScanner.cs
Allow for reflection, via an optional rd.xml file and generate the necessary metadata using ILCompiler.MetadataWriter
Compile the IL using the specific back-end (generic/shared code is in Compilation.cs)
- RyuJIT RyuJitCompilation.cs
- Web Assembly (WASM) WebAssemblyCodegenCompilation.cs
- C++ Code CppCodegenCompilation.cs
Finally, write out the compiled methods using ObjectWriter which in turn uses LLVM under-the-hood

But it’s not just your code that ends up in the final .exe, along the way the CoreRT compiler also generates several ‘helper methods’ to cover the following scenarios:

IL Code (via the ‘EmitIL()’ method)
Assembly Code (via the ‘EmitCode()’ method) (different implementaions for each CPU architecure)
- Unboxing (x64)
- Jump Stubs (ARM64)
- ‘Ready to Run’ Generic helper (x86)

Fortunately the compiler doesn’t blindly include all the code it finds, it is intelligent enough to only include code that’s actually used:

We don’t use ILLinker, but everything gets naturally treeshaken by the compiler itself (we start with compiling Main/NativeCallable exports and continue compiling other methods and generating necessary data structures as we go). If there’s a type or method that is not used, the compiler doesn’t even look at it.

The Runtime

All the user/helper code then sits on-top of the CoreRT runtime, from Intro to .NET Native and CoreRT:

CoreRT is the .NET Core runtime that is optimized for AOT scenarios, which .NET Native targets. This is a refactored and layered runtime. The base is a small native execution engine that provides services such as garbage collection(GC). This is the same GC used in CoreCLR. Many other parts of the traditional .NET runtime, such as the type system, are implemented in C#. We’ve always wanted to implement runtime functionality in C#. We now have the infrastructure to do that. In addition, library implementations that were built deep into CoreCLR, have also been cleanly refactored and implemented as C# libraries.

This last point is interesting, why is it advantageous to implement ‘runtime functionality in C#’? Well it turns out that it’s hard to do in an un-managed language because there’s some very subtle and hard-to-track-down ways that you can get it wrong:

Reliability and performance. The C/C++ code has to manually managed. It means that one has to be very careful to report all GC references to the GC. The manually managed code is both very hard to get right and it has performance overhead.
— Jan Kotas (@JanKotas7) April 24, 2018

These are known as ‘GC Holes’ and the BOTR provides more detail on them. The author of that tweet is significant, Jan Kotas has worked on the .NET runtime for a long time, if he thinks something is hard, it really is!!

Runtime Components

As previously mentioned it’s a layered runtime, i.e made up of several, distinct components, as explained in this comment:

At the core of CoreRT, there’s a runtime that provides basic services for the code to run (think: garbage collection, exception handling, stack walking). This runtime is pretty small and mostly depends on C/C++ runtime (even the C++ runtime dependency is not a hard requirement as Jan pointed out - #3564). This code mostly lives in src/Native/Runtime, src/Native/gc, and src/Runtime.Base. It’s structured so that the places that do require interacting with the underlying platform (allocating native memory, threading, etc.) go through a platform abstraction layer (PAL). We have a PAL for Windows, Linux, and macOS, but others can be added.

And you can see the PAL Components in the following locations:

C# Code shared with CoreCLR

One interesting aspect of the CoreRT runtime is that wherever possible it shares code with the CoreCLR runtime, this is part of a larger effort to ensure that wherever possible code is shared across multiple repositories:

This directory contains the shared sources for System.Private.CoreLib. These are shared between dotnet/corert, dotnet/coreclr and dotnet/corefx. The sources are synchronized with a mirroring tool that watches for new commits on either side and creates new pull requests (as @dotnet-bot) in the other repository.

Recently there has been a significant amount of work done to moved more and more code over into the ‘shared partition’ to ensure work isn’t duplicated and any fixes are shared across both locations. You can see how this works by looking at the links below:

CoreRT
CoreCLR

What this means is that about 2/3 of the C# code in System.Private.CoreLib is shared with CoreCLR and only 1/3 is unique to CoreRT:

Group	C# LOC (Files)
shared	170,106 (759)
src	96,733 (351)
Total	266,839 (1,110)

Native Code

Finally, whilst it is advantageous to write as much code as possible in C#, there are certain components that have to be written in C++, these include the GC (the majority of which is one file, gc.cpp which is almost 37,000 LOC!!), the JIT Interface, ObjWriter (based on LLVM) and most significantly the Core Runtime that contains code for activities like:

Threading
Stack Frame handling
Debugging/Profiling
Interfacing to the OS
CPU specific helpers for:
- Exception handling
- GC Write Barriers
- Stubs/Thunks
- Optimised object allocation

‘Hello World’ Program

One of the first things people asked about CoreRT is “what is the size of a ‘Hello World’ app” and the answer is ~3.93 MB (if you compile in Release mode), but there is work being done to reduce this. At a ‘high-level’, the .exe that is produced looks like this:

Note the different colours correspond to the original format of a component, obviously the output is a single, native, executable file.

This file comes with a full .NET specific ‘base runtime’ or ‘class libraries’ (‘System.Private.XXX’) so you get a lot of functionality, it is not the absolute bare-minimum app. Fortunately there is a way to see what a ‘bare-minimum’ runtime would look like by compiling against the Test.CoreLib project included in the CoreRT source. By using this you end up with an .exe that looks like this:

But it’s so minimal that OOTB you can’t even write ‘Hello World’ to the console as there is no System.Console type! After a bit of hacking I was able to build a version that did have a working Console output (if you’re interested, this diff is available here). To make it work I had to include the following components:

System.Console
System.Text.UnicodeEncoding
String handling
P/Invoke and Marshalling support (to call an OS function)

So Test.CoreLib really is a minimal runtime!! But the difference in size is dramatic, it shrinks down to 0.49 MB compared to 3.93 MB for the fully-featured runtime!

Type	Standard (bytes)	Test.CoreLib (bytes)	Difference
.data	163,840	36,864	-126,976
.managed	1,540,096	65,536	-1,474,560
.pdata	147,456	20,480	-126,976
.rdata	1,712,128	81,920	-1,630,208
.reloc	98,304	4,096	-94,208
.text	360,448	299,008	-61,440
rdata	98,304	4,096	-94,208

Total (bytes)	4,120,576	512,000	-3,608,576
Total (MB)	3.93	0.49	-3.44

These data sizes were obtained by using the Microsoft DUMPBIN tool and the /DISASM cmd line switch (zip file of the full ouput), which produces the following summary (note: size values are in HEX):

  Summary

       28000 .data
      178000 .managed
       24000 .pdata
      1A2000 .rdata
       18000 .reloc
       58000 .text
       18000 rdata

Also contained in the output is the assembly code for a simple Hello World method:

HelloWorld_HelloWorld_Program__Main:
  0000000140004C50: 48 8D 0D 19 94 37  lea         rcx,[__Str_Hello_World__E63BA1FD6D43904697343A373ECFB93457121E4B2C51AF97278C431E8EC85545]
                    00
  0000000140004C57: 48 8D 05 DA C5 00  lea         rax,[System_Console_System_Console__WriteLine_12]
                    00
  0000000140004C5E: 48 FF E0           jmp         rax
  0000000140004C61: 90                 nop
  0000000140004C62: 90                 nop
  0000000140004C63: 90                 nop

and if we dig further we can see the code for System.Console.WriteLine(..):

System_Console_System_Console__WriteLine_12:
  0000000140011238: 56                 push        rsi
  0000000140011239: 48 83 EC 20        sub         rsp,20h
  000000014001123D: 48 8B F1           mov         rsi,rcx
  0000000140011240: E8 33 AD FF FF     call        System_Console_System_Console__get_Out
  0000000140011245: 48 8B C8           mov         rcx,rax
  0000000140011248: 48 8B D6           mov         rdx,rsi
  000000014001124B: 48 8B 00           mov         rax,qword ptr [rax]
  000000014001124E: 48 8B 40 68        mov         rax,qword ptr [rax+68h]
  0000000140011252: 48 83 C4 20        add         rsp,20h
  0000000140011256: 5E                 pop         rsi
  0000000140011257: 48 FF E0           jmp         rax
  000000014001125A: 90                 nop
  000000014001125B: 90                 nop

Limitations

Missing Functionality

There have been some people who’ve successfully run complex apps using CoreRT, but, as it stands CoreRT is still an alpha product. At least according to the NuGet package ‘1.0.0-alpha-26529-02’ that the official samples instruct you to use and I’ve not seen any information about when a full 1.0 Release will be available.

So there is some functionality that is not yet implemented, e.g. F# Support, GC.GetMemoryInfo or canGetCookieForPInvokeCalliSig (a calli to a p/invoke). For more information on this I recommend this entertaining presentation on Building Native Executables from .NET with CoreRT by Mark Rendle. In the 2nd half he chronicles all the issues that he ran into when he was trying to run an ASP.NET app under CoreRT (some of which may well be fixed now).

Reflection

But more fundamentally, because of the nature of AOT compilation, there are 2 main stumbling blocks that you may also run into Reflection and Runtime Code-Generation.

Firstly, if you want to use reflection in your code you need to tell the CoreRT compiler about the types you expect to reflect over, because by-default it only includes the types it knows about. You can do with by using a file called rd.xml as shown here. Unfortunately this will always require manual intervention for the reasons explained in this issue. More information is available in this comment ‘…some details about CoreRT’s restriction on MakeGenericType and MakeGenericMethod’.

To make reflection work the compiler adds the required metadata to the final .exe using this process:

This would reuse the same scheme we already have for the RyuJIT codegen path:

The compiler generates a blob of bytes that describes the metadata (namespaces, types, their members, their custom attributes, method parameters, etc.). The data is generated as a byte array in the ComputeMetadata method.

The metadata gets embedded as a data blob into the executable image. This is achieved by adding the blob to a “ready to run header”. Ready to run header is a well known data structure that can be located by the code in the framework at runtime.

The ready to run header along with the blobs it refers to is emitted into the final executable.

At runtime, pointer to the byte array is located using the RhFindBlob API, and a parser is constructed over the array, to be used by the reflection stack.

Runtime Code-Generation

In .NET you often use reflection once (because it can be slow) followed by ‘dynamic’ or ‘runtime’ code-generation with Reflection.Emit(..). This technique is widely using in .NET libraries for Serialisation/Deserialisation, Dependency Injection, Object Mapping and ORM.

The issue is that ‘runtime’ code generation is problematic in an ‘AOT’ scenario:

ASP.NET dependency injection introduced dependency on Reflection.Emit in aspnet/DependencyInjection#630 unfortunately. It makes it incompatible with CoreRT.

We can make it functional in CoreRT AOT environment by introducing IL interpretter (#5011), but it would still perform poorly. The dependency injection framework is using Reflection.Emit on performance critical paths.

It would be really up to ASP.NET to provide AOT-friendly flavor that generates all code at build time instead of runtime to make this work well. It would likely help the startup without CoreRT as well.

I’m sure this will be solved one way or the other (see #5011), but at the moment it’s still ‘work-in-progress’.

Discuss this post on HackerNews and /r/dotnet

Taking a look at the ECMA-335 Standard for .NET

2018-04-06T00:00:00+00:00

It turns out that the .NET Runtime has a technical standard (or specification), known by its full name ECMA-335 - Common Language Infrastructure (CLI) (not to be confused with ECMA-334 which is the ‘C# Language Specification’). The latest update is the 6th edition from June 2012.

The specification or standard was written before .NET Core existed, so only applies to the .NET Framework, I’d be interested to know if there are any plans for an updated version?

The rest of this post will take a look at the standard, exploring the contents and investigating what we can learn from it (hint: lots of low-level details and information about .NET internals)

Why is it useful?

Having a standard means that different implementations, such as Mono and DotNetAnywhere can exist, from Common Language Runtime (CLR):

Compilers and tools are able to produce output that the common language runtime can consume because the type system, the format of metadata, and the runtime environment (the virtual execution system) are all defined by a public standard, the ECMA Common Language Infrastructure specification. For more information, see ECMA C# and Common Language Infrastructure Specifications.

and from the CoreCLR documentation on .NET Standards:

There was a very early realization by the founders of .NET that they were creating a new programming technology that had broad applicability across operating systems and CPU types and that advanced the state of the art of late 1990s (when the .NET project started at Microsoft) programming language implementation techniques. This led to considering and then pursuing standardization as an important pillar of establishing .NET in the industry.

The key addition to the state of the art was support for multiple programming languages with a single language runtime, hence the name Common Language Runtime. There were many other smaller additions, such as value types, a simple exception model and attributes. Generics and language integrated query were later added to that list.

Looking back, standardization was quite effective, leading to .NET having a strong presence on iOS and Android, with the Unity and Xamarin offerings, both of which use the Mono runtime. The same may end up being true for .NET on Linux.

The various .NET standards have been made meaningful by the collaboration of multiple companies and industry experts that have served on the working groups that have defined the standards. In addition (and most importantly), the .NET standards have been implemented by multiple commercial (ex: Unity IL2CPP, .NET Native) and open source (ex: Mono) implementors. The presence of multiple implementations proves the point of standardization.

As the last quote points out, the standard is not produced solely by Microsoft:

There is also a nice Wikipedia page that has some additional information.

What is in it?

At a high-level overview, the specification is divided into the following ‘partitions’ :

I: Concepts and Architecture
- A great introduction to the CLR itself, explaining many of the key concepts and components, as well as the rationale behind them
II: Metadata Definition and Semantics
- An explanation of the format of .NET dll/exe files, the different sections within them and how they’re laid out in-memory
III: CIL Instruction Set
- A complete list of all the Intermediate Language (IL) instructions that the CLR understands, along with a detailed description of what they do and how to use them
IV: Profiles and Libraries
- Describes the various different ‘Base Class libraries’ that make-up the runtime and how they are grouped into ‘Profiles’
V: Binary Formats (Debug Interchange Format)
- An overview of ‘Portable CILDB files’, which give a way for additional debugging information to be provided
VI: Annexes
- Annex A - Introduction
- Annex B - Sample programs
- Annex C - CIL assembler implementation
- Annex D - Class library design guidelines
- Annex E - Portability considerations
- Annex F - Imprecise faults
- Annex G - Parallel library

But, working your way through the entire specification is a mammoth task, generally I find it useful to just search for a particular word or phrase and locate the parts I need that way. However if you do want to read through one section, I recommend ‘Partition I: Concepts and Architecture’, at just over 100 pages it is much easier to fully digest! This section is a very comprehensive overview of the key concepts and components contained within the CLR and well worth a read.

Also, I’m convinced that the authors of the spec wanted to help out any future readers, so to break things up they included lots of very helpful diagrams:

For more examples see:

On top of all that, they also dropped in some Comic Sans 😀, just to make it clear when the text is only ‘informative’:

How has it changed?

The spec has been through 6th editions and it’s interesting to look at the changes over time:

Edition	Release Date	CLR Version	Significant Changes
1st	December 2001	1.0 (February 2002)	N/A
2nd	December 2002	1.1 (April 2003)
3rd	June 2005	2.0 (January 2006)	See below (link)
4th	June 2006		None, revision of 3rd edition (link)
5th	December 2010	4.0 (April 2010)	See below (link)
6th	June 2012		None, revision of 5th edition (link)

However, only 2 editions contained significant updates, they are explained in more detail below:

3rd Edition (link)

Support for generic types and methods (see ‘How generics were added to .NET’)
New IL instructions - ldelem, stelem and unbox.any
Added the constrained., no. and readonly. IL instruction prefixes
Brand new ‘namespaces’ (with corresponding types) - System.Collections.Generics, System.Threading.Parallel
New types added, including Action<T>, Nullable<T> and ThreadStaticAttribute

5th Edition (link)

Type-forwarding added
Semantics of ‘variance’ redefined, became a core feature
Multiple types added or updated, including System.Action, System.MulticastDelegate and System.WeakReference
System.Math and System.Double modified to better conform to IEEE

Microsoft Specific Implementation

Another interesting aspect to look at is the Microsoft specific implementation details and notes. The following links are to pdf documents that are modified versions of the 4th edition:

They all contain multiple occurrences of text like this ‘Implementation Specific (Microsoft)’:

More Information

Finally, if you want to find out more there’s a book available (affiliate link):

Exploring the internals of the .NET Runtime

2018-03-23T00:00:00+00:00

I recently appeared on Herding Code and Stackify ‘Developer Things’ podcasts and in both cases, the first question asked was ‘how do you figure out the internals of the .NET runtime’?

This post is an attempt to articulate that process, in the hope that it might be useful to others.

Here are my suggested steps:

Decide what you want to investigate
See if someone else has already figured it out (optional)
Read the ‘Book of the Runtime’
Build from the source
Debugging
Verify against .NET Framework (optional)

Note: As with all these types of lists, just because it worked for me doesn’t mean that it will for everyone. So, ‘your milage may vary’.

Step One - Decide what you want to investigate

For me, this means working out what question I’m trying to answer, for example here are some previous posts I’ve written:

(it just goes to show, you don’t always need fancy titles!)

I put this as ‘Step 1’ because digging into .NET internals isn’t quick or easy work, some of my posts take weeks to research, so I need to have a motivation to keep me going, something to focus on. In addition, the CLR isn’t a small run-time, there’s a lot in there, so just blindly trying to find your way around it isn’t easy! That’s why having a specific focus helps, looking at one feature or section at a time is more manageable.

The very first post where I followed this approach was Strings and the CLR - a Special Relationship. I’d previously spent some time looking at the CoreCLR source and I knew a bit about how Strings in the CLR worked, but not all the details. During the research of that post I then found more and more areas of the CLR that I didn’t understand and the rest of my blog grew from there (delegates, arrays, fixed keyword, type loader, etc).

Aside: I think this is generally applicable, if you want to start blogging, but you don’t think you have enough ideas to sustain it, I’d recommend that you start somewhere and other ideas will follow.

Another tip is to look at HackerNews or /r/programming for posts about the ‘internals’ of other runtimes, e.g. Java, Ruby, Python, Go etc, then write the equivalent post about the CLR. One of my most popular posts A Hitchhikers Guide to the CoreCLR Source Code was clearly influenced by equivalent articles!

Finally, for more help with learning, ‘figuring things out’ and explaining them to others, I recommend that you read anything by Julia Evans. Start with Blogging principles I use and So you want to be a wizard (also available as a zine), then work your way through all the other posts related to blogging or writing.

I’ve been hugely influenced, for the better, by Julia’s approach to blogging.

Step Two - See if someone else has already figured it out (optional)

I put this in as ‘optional’, because it depends on your motivation. If you are trying to understand .NET internals for your own education, then feel-free to write about whatever you want. If you are trying to do it to also help others, I’d recommend that you first see what’s already been written about the subject. If, once you’ve done that you still think there is something new or different that you can add, then go ahead, but I try not to just re-hash what is already out there.

To see what’s already been written, you can start with Resources for Learning about .NET Internals or peruse the ‘Internals’ tag on this blog. Another really great resource is all the answers by Hans Passant on StackOverflow, he is prolific and amazingly knowledgeable, here’s some examples to get you started:

Step Three - Read the ‘Book of the Runtime’

You won’t get far in investigating .NET internals without coming across the ‘Book of the Runtime’ (BOTR) which is an invaluable resource, even Scott Hanselman agrees!

It was written by the .NET engineering team, for the .NET engineering team, as per this HackerNews comment:

Having worked for 7 years on the .NET runtime team, I can attest that the BOTR is the official reference. It was created as documentation for the engineering team, by the engineering team. And it was (supposed to be) kept up to date any time a new feature was added or changed.

However, just a word of warning, this means that it’s an in-depth, non-trivial document and hard to understand when you are first learning about a particular topic. Several of my blog posts have consisted of the following steps:

Read the BOTR chapter on ‘Topic X’
Understand about 5% of what I read
Go away and learn more (read the source code, read other resources, etc)
GOTO ‘Step 1’, understanding more this time!

Related to this, the source code itself is often as helpful as the BOTR due to the extensive comments, for example this one describing the rules for prestubs really helped me out. The downside of the source code comments is that they are bit harder to find, whereas the BOTR is all in one place.

Step Four - Build from the source

However, at some point, just reading about the internals of the CLR isn’t enough, you actually need to ‘get your hands’ dirty and see it in action. Now that the Core CLR is open source it’s very easy to build it yourself and then once you’ve done that, there are even more docs to help you out if you are building on different OSes, want to debug, test CoreCLR in conjunction with CoreFX, etc.

But why is building from source useful?

Because it lets you build a Debug/Diagnostic version of the runtime that gives you lots of additional information that isn’t available in the Release/Retails builds. For instance you can view JIT Dumps using COMPlus_JitDump=..., however this is just one of many COMPlus_XXX settings you can use, there are 100’s available.

However, even more useful is the ability to turn on diagnostic logging for a particular area of the CLR. For instance, lets imagine that we want to find out more about AppDomains and how they work under-the-hood, we can use the following logging configuration settings:

SET COMPLUS_LogEnable=1
SET COMPLUS_LogToFile=1
SET COMPLUS_LogFacility=02000000
SET COMPLUS_LogLevel=A

Where LogFacility is set to LF_APPDOMAIN, there are many other values you can provide as a HEX bit-mask the full list is available in the source code. If you set these variables and then run an app, you will get a log output like this one. Once you have this log you can very easily search around in the code to find where the messages came from, for instance here are all the places that LF_APPDOMAIN is logged. This is a great technique to find your way into a section of the CLR that you aren’t familiar with, I’ve used it many times to great effect.

Step Five - Debugging

For me, biggest boon of Microsoft open sourcing .NET is that you can discover so much more about the internals without having to resort to ‘old school’ debugging using WinDBG. But there still comes a time when it’s useful to step through the code line-by-line to see what’s going on. The added advantage of having the source code is that you can build a copy locally and then debug through that using Visual Studio which is slightly easier than WinDBG.

I always leave debugging to last, as it can be time-consuming and I only find it helpful when I already know where to set a breakpoint, i.e. I already know which part of the code I want to step through. I once tried to blindly step through the source of the CLR whilst it was starting up and it was very hard to see what was going on, as I’ve said before the CLR is a complex runtime, there are many things happening, so stepping through lots of code, line-by-line can get tricky.

Step Six - Verify against .NET Framework

I put this final step in because the .NET CLR source available on GitHub is the ‘.NET Core’ version of the runtime, which isn’t the same as the full/desktop .NET Framework that’s been around for years. So you may need to verify the behavior matches, if you want to understand the internals ‘as they were’, not just ‘as they will be’ going forward. For instance .NET Core has removed the ability to create App Domains as a way to provide isolation but interestingly enough the internal class lives on!

To verify the behaviour, your main option is to debug the CLR using WinDBG. Beyond that, you can resort to looking at the ‘Rotor’ source code (roughly the same as .NET Framework 2.0), or petition Microsoft the release the .NET Framework Source Code (probably not going to happen)!

However, low-level internals don’t change all that often, so more often than not the way things behave in the CoreCLR is the same as they’ve always worked.

Resources

Finally, for your viewing pleasure, here are a few talks related to ‘.NET Internals’:

Discuss this post on /r/programming or /r/dotnet

How generics were added to .NET

2018-03-02T00:00:00+00:00

Discuss this post on HackerNews and /r/programming

Before we dive into the technical details, let’s start with a quick history lesson, courtesy of Don Syme who worked on adding generics to .NET and then went on to design and implement F#, which is a pretty impressive set of achievements!!

Background and History

1999 Initial research, design and planning
- .NET/C# Generics History: Some Photos From Feb 1999
1999 First ‘white paper’ published
- More C#/.NET Generics Research Project History – The MSR white paper
- MSR White Paper: Proposed Extensions to COM+ VOS (Draft) (pdf)
2001 C# Language Design Specification created
- Some History: 2001 “GC#” (Generic C#) research project draft
- MSR - .NET Generics Research Project - Generic C# Specification (pdf)
2001 Research paper published
- Design and Implementation of Generics for the .NET CLR (pdf)
2004 Work completed and all bugs fixed
- Some more .NET/C# Generics Research Project History

Update: Don Syme, pointed out another research paper related to .NET generics, Combining Generics, Precompilation and Sharing Between Software Based Processes (pdf)

To give you an idea of how these events fit into the bigger picture, here are the dates of .NET Framework Releases, up-to 2.0 which was the first version to have generics:

Version number	CLR version	Release date
1.0	1.0	2002-02-13
1.1	1.1	2003-04-24
2.0	2.0	2005-11-07

Aside from the historical perspective, what I find most fascinating is just how much the addition of generics in .NET was due to the work done by Microsoft Research, from .NET/C# Generics History:

It was only through the total dedication of Microsoft Research, Cambridge during 1998-2004, to doing a complete, high quality implementation in both the CLR (including NGEN, debugging, JIT, AppDomains, concurrent loading and many other aspects), and the C# compiler, that the project proceeded.

He then goes on to say:

What would the cost of inaction have been? What would the cost of failure have been? No generics in C# 2.0? No LINQ in C# 3.0? No TPL in C# 4.0? No Async in C# 5.0? No F#? Ultimately, an erasure model of generics would have been adopted, as for Java, since the CLR team would never have pursued a in-the-VM generics design without external help.

Wow, C# and .NET would look very different without all these features!!

The ‘Gyro’ Project - Generics for Rotor

Unfortunately there doesn’t exist a publicly accessible version of the .NET 1.0 and 2.0 source code, so we can’t go back and look at the changes that were made (if I’m wrong, please let me know as I’d love to read it).

However, we do have the next best thing, the ‘Gyro’ project in which the equivalent changes were made to the ‘Shared Source Common Language Implementation’ (SSCLI) code base (a.k.a ‘Rotor’). As an aside, if you want to learn more about the Rotor code base I really recommend the excellent book by Ted Neward, which you can download from his blog.

Gyro 1.0 was released in 2003 which implies that is was created after the work has been done in the real .NET Framework source code, I assume that Microsoft Research wanted to publish the ‘Rotor’ implementation so it could be studied more widely. Gyro is also referenced in one Don Syme’s posts, from Some History: 2001 “GC#” research project draft, from the MSR Cambridge team:

With Dave Berry’s help we later published a version of the corresponding code as the “Gyro” variant of the “Rotor” CLI implementation.

The rest of this post will look at how generics were implemented in the Rotor source code.

Note: There are some significant differences between the Rotor source code and the real .NET framework. Most notably the JIT and GC are completely different implementations (due to licensing issues, listen to DotNetRocks show 360 - Ted Neward and Joel Pobar on Rotor 2.0 for more info). However, the Rotor source does give us an accurate idea about how other core parts of the CLR are implemented, such as the Type-System, Debugger, AppDomains and the VM itself. It’s interesting to compare the Rotor source with the current CoreCLR source and see how much of the source code layout and class names have remained the same.

Implementation

To make things easier for anyone who wants to follow-along, I created a GitHub repo that contains the Rotor code for .NET 1.0 and then checked in the Gyro source code on top, which means that you can see all the changes in one place:

The first thing you notice in the Gyro source is that all the files contain this particular piece of legalese:

 ;    By using this software in any fashion, you are agreeing to be bound by the
 ;    terms of this license.
 ;   
+;    This file contains modifications of the base SSCLI software to support generic
+;    type definitions and generic methods. These modifications are for research
+;    purposes. They do not commit Microsoft to the future support of these or
+;    any similar changes to the SSCLI or the .NET product. -- 31st October, 2002.
+;   
 ;    You must not remove this notice, or any other, from this software.

It’s funny that they needed to add the line ‘They do not commit Microsoft to the future support of these or any similar changes to the SSCLI or the .NET product’, even though they were just a few months away from doing just that!!

Components (Directories) with the most changes

To see where the work was done, lets start with a high-level view, showing the directories with a significant amount of changes (> 1% of the total changes):

$ git diff --dirstat=lines,1 464bf98 2714cca
1% bcl/
4% csharp/csharp/sccomp/
1% debug/di/
9% debug/ee/
1% debug/inc/
9% debug/shell/
5% fjit/
1% ilasm/
5% ildasm/
2% inc/
4% md/compiler/
9% vm/

Note: fjit is the “Fast JIT” compiler, i.e the version released with Rotor, which was significantly different to one available in the full .NET framework.

The full output from git diff --dirstat=lines,0 is available here and the output from git diff --stat is here.

0.1% bcl/ is included only to show that very little C# code changes were needed, these were mostly plumbing code to expose the underlying C++ methods and changes to the various ToString() methods to include generic type information, e.g. ‘Class[int,double]’. However there are 2 more significant ones:

bcl/system/reflection/emit/opcodes.cs (diff)
- Add the additional IL opcode needed to make generics work (this just mirrors the main change made in core of the runtime, so that the opcodes available in C# are consistent)
bcl/system/reflection/emit/signaturehelper.cs (diff)
- Add the ability to parse method metadata that contains generic related information, such as methods with generic parameters.

Files with the most changes

Next, we’ll take a look at the specific classes/files that had the most changes as this gives us a really good idea about where the complexity was

Added	Deleted	Total Changes	File (click to go directly to the diff)
1794	323	1471	debug/di/module.cpp
1418	337	1081	vm/class.cpp
1335	308	1027	vm/jitinterface.cpp
1616	888	728	debug/ee/debugger.cpp
741	46	695	csharp/csharp/sccomp/symmgr.cpp
693	0	693	vm/genmeth.cpp
999	362	637	csharp/csharp/sccomp/clsdrec.cpp
926	321	605	csharp/csharp/sccomp/fncbind.cpp
559	0	559	vm/typeparse.cpp
605	156	449	vm/siginfo.cpp
417	29	388	vm/method.hpp
642	255	387	fjit/fjit.cpp
379	0	379	vm/jitinterfacegen.cpp
3045	2672	373	ilasm/parseasm.cpp
465	94	371	vm/class.h
515	163	352	debug/inc/cordb.h
339	0	339	vm/generics.cpp
733	418	315	csharp/csharp/sccomp/parser.cpp
471	169	302	debug/shell/dshell.cpp
382	88	294	csharp/csharp/sccomp/import.cpp

Components of the Runtime

Now we’ll look at individual components in more detail so we can get an idea of how different parts of the runtime had to change to accommodate generics.

Type System changes

Not surprisingly the bulk of the changes are in the Virtual Machine (VM) component of the CLR and related to the ‘Type System’. Obviously adding ‘parameterised types’ to a type system that didn’t already have them requires wide-ranging and significant changes, which are shown in the list below:

vm/class.cpp (diff )
- Allow the type system to distinguish between open and closed generic types and provide APIs to allow working them, such as IsGenericVariable() and GetGenericTypeDefinition()
vm/genmeth.cpp (diff)
- Contains the bulk of the functionality to make ‘generic methods’ possible, i.e. MyMethod<T, U>(T item, U filter), including to work done to enable ‘shared instantiation’ of generic methods
vm/typeparse.cpp (diff)
- Changes needed to allow generic types to be looked-up by name, i.e. ‘MyClass[System.Int32]’
vm/siginfo.cpp (diff)
- Adds the ability to work with ‘generic-related’ method signatures
vm/method.hpp (diff) and vm/method.cpp (diff)
- Provides the runtime with generic related methods such as IsGenericMethodDefinition(), GetNumGenericMethodArgs() and GetNumGenericClassArgs()
vm/generics.cpp (diff)
- All the completely new ‘generics’ specific code is in here, mostly related to ‘shared instantiation’ which is explained below

Bytecode or ‘Intermediate Language’ (IL) changes

The main place that the implementation of generics in the CLR differs from the JVM is that they are ‘fully reified’ instead of using ‘type erasure’, this was possible because the CLR designers were willing to break backwards compatibility, whereas the JVM had been around longer so I assume that this was a much less appealing option. For more discussion on this issue see Erasure vs reification and Reified Generics for Java. Update: this HackerNews discussion is also worth a read.

The specific changes made to the .NET Intermediate Language (IL) op-codes can be seen in the inc/opcode.def (diff), in essence the following 3 instructions were added

In addition the IL Assembler tool (ILASM) needed significant changes as well as it’s counter part `IL Disassembler (ILDASM) so it could handle the additional instructions.

There is also a whole section titled ‘Support for Polymorphism in IL’ that explains these changes in greater detail in Design and Implementation of Generics for the .NET Common Language Runtime

Shared Instantiations

From Design and Implementation of Generics for the .NET Common Language Runtime

Two instantiations are compatible if for any parameterized class its compilation at these instantiations gives rise to identical code and other execution structures (e.g. field layout and GC tables), apart from the dictionaries described below in Section 4.4. In particular, all reference types are compatible with each other, because the loader and JIT compiler make no distinction for the purposes of field layout or code generation. On the implementation for the Intel x86, at least, primitive types are mutually incompatible, even if they have the same size (floats and ints have different parameter passing conventions). That leaves user-defined struct types, which are compatible if their layout is the same with respect to garbage collection i.e. they share the same pattern of traced pointers

ClassLoader::NewInstantiation(..) source code
TypeHandle::GetCanonicalFormAsGenericArgument() source code

From a comment with more info:

// For an generic type instance return the representative within the class of
// all type handles that share code.  For example, 
//    <int> --> <int>,
//    <object> --> <object>,
//    <string> --> <object>,
//    <List<string>> --> <object>,
//    <Struct<string>> --> <Struct<object>>
//
// If the code for the type handle is not shared then return 
// the type handle itself.

In addition, this comment explains the work that needs to take place to allow shared instantiations when working with generic methods.

Update: If you want more info on the ‘code-sharing’ that takes places, I recommend reading these 4 posts:

Compiler and JIT Changes

If seems like almost every part of the compiler had to change to accommodate generics, which is not surprising given that they touch so many parts of the code we write, Types, Classes and Methods. Some of the biggest changes were:

csharp/csharp/sccomp/clsdrec.cpp - +999 -363 - (diff)
csharp/csharp/sccomp/emitter.cpp - +347 -127 - (diff)
csharp/csharp/sccomp/fncbind.cpp - +926 -321 - (diff)
csharp/csharp/sccomp/import.cpp - +382 - 88 - (diff)
csharp/csharp/sccomp/parser.cpp - +733 -418 - (diff)
csharp/csharp/sccomp/symmgr.cpp - +741 -46 - (diff)

In the ‘just-in-time’ (JIT) compiler extra work was needed because it’s responsible for implementing the additional ‘IL Instructions’. The bulk of these changes took place in fjit.cpp (diff) and fjitdef.h (diff).

Finally, a large amount of work was done in vm/jitinterface.cpp (diff) to enable the JIT to access the extra information it needed to emit code for generic methods.

Debugger Changes

Last, but by no means least, a significant amount of work was done to ensure that the debugger could understand and inspect generics types. It goes to show just how much inside information a debugger needs to have of the type system in an managed language.

debug/ee/debugger.cpp (diff)
debug/ee/debugger.h (diff)
debug/di/module.cpp (diff)
debug/di/rsthread.cpp (diff)
debug/shell/dshell.cpp (diff)

Resources for Learning about .NET Internals

2018-01-22T00:00:00+00:00

It all started with a tweet, which seemed to resonate with people:

If you like reading my posts on .NET internals, you'll like all these other blogs. So I've put them together in a thread for you!!
— Matt Warren (@matthewwarren) January 12, 2018

The aim was to list blogs that specifically cover .NET internals at a low-level or to put it another way, blogs that answer the question how does feature ‘X’ work, under-the-hood. The list includes either typical posts for that blog, or just some of my favourites!

Note: for a wider list of .NET and performance related blogs see Awesome .NET Performance by Adam Sitnik

I wouldn’t recommend reading through the entire list, at least not in one go, your brain will probably melt. Picks some posts/topics that interest you and start with those.

Finally, bear in mind that some of the posts are over 10 years old, so there’s a chance that things have changed since then (however, in my experience, the low-levels parts of the CLR are more stable). If you want to double-check the latest behaviour, you’re best option is to read the source!

Community or Non-Microsoft Blogs

These blogs are all written by non-Microsoft employees (AFAICT), or if they do work for Microsoft, they don’t work directly on the CLR. If I’ve missed any interesting blogs out, please let me know!

Special mention goes to Sasha Goldshtein, he’s been blogging about this longer than anyone!!

All Your Base Are Belong To Us by Sasha Goldshtein (@goldshtn)

Update: I missed out a few blogs and learnt about some new ones:

Honourable mention goes to .NET Type Internals - From a Microsoft CLR Perspective on CodeProject, it’s a great article!!

Book of the Runtime (BotR)

The BotR deserves it’s own section (thanks to svick to reminding me about it).

If you haven’t heard of the BotR before, there’s a nice FAQ that explains what it is:

The Book of the Runtime is a set of documents that describe components in the CLR and BCL. They are intended to focus more on architecture and invariants and not an annotated description of the codebase.

It was originally created within Microsoft in ~2007, including this document. Developers were responsible to document their feature areas. This helped new devs joining the team and also helped share the product architecture across the team.

To find your way around it, I recommend starting with the table of contents and then diving in.

Note: It’s written for developers working on the CLR, so it’s not an introductory document. I’d recommend reading some of the other blog posts first, then referring to the BotR once you have the basic knowledge. For instance many of my blog posts started with me reading a chapter from the BotR, not fully understanding it, going away and learning some more, writing up what I found and then pointing people to the relevant BotR page for more information.

Microsoft Engineers

The blogs below are written by the actual engineers who worked on, designed or managed various parts of the CLR, so they give a deep insight (again, if I’ve missed any blogs out, please let me know):

Books

Finally, if you prefer reading off-line there are some decent books that discuss .NET Internals (Note: all links are Amazon Affiliate links):

CLR via C#, 4ed by Jeffrey Richter
Shared Source CLI Essentials Paperback by David Stutz, Ted Neward, Geoff Shilling Ted (Ted Neward also made a pdf version available to download from his web site)
Writing High-Performance .NET Code Paperback by Ben Watson
- His blog is also worth reading, e.g. Digging Into .NET Object Allocation Fundamentals and Digging Into .NET Loop Performance, Bounds-checking, Iteration, and Unrolling
Pro .NET Performance: Optimize Your C# Applications by Sasha Goldshtein

All the books listed above I own copies of and I’ve read cover-to-cover, they’re fantastic resources.

I’ve also been recently recommend the 2 books below, they look good and certainly the authors know their stuff, but I haven’t read them yet:

*New Release*

Pro .NET Memory Management: For Better Code, Performance, and Scalability by Konrad Kokosa (Nov 2018)

Discuss this post on HackerNews and /r/programming

A look back at 2017

2017-12-31T00:00:00+00:00

I’ve now been blogging consistently for over 2 years (~2 times per/month) and I decided it was time for my first ‘retrospective’ post.

Warning this post contains a large amount of humble brags, if you’ve come here to read about ‘.NET internals’ you’d better check back in a few weeks, when normal service will be resumed!

Overall Stats

Firstly, lets looks at my Google Analytics stats for 2017, showing Page Views and Sessions:

Which clearly shows that I took a bit of a break during the summer! But I still managed over 800K page views, mostly because I was fortunate enough to end up on the front page of HackerNews a few times!

As a comparison, here’s what ‘2017 v 2016’ looks like:

This is cool because it shows a nice trend, more people read my blog posts in 2017 than in 2016 (but I have no idea if it will continue in 2018?!)

Next, here are my top 10 most read posts. Surprising enough my most read post was literally just a list with 68 entries in it!!

Post	Page Views
The 68 things the CLR does before executing a single line of your code	101,382
A Hitchhikers Guide to the CoreCLR Source Code	61,169
A DoS Attack against the C# Compiler	50,884
Analysing C# code on GitHub with BigQuery	40,165
Adding a new Bytecode Instruction to the CLR	39,101
Open Source .NET – 3 years later	36,316
How do .NET delegates work?	36,047
Lowering in the C# Compiler (and what happens when you misuse it)	34,375
How the .NET Runtime loads a Type	32,813
DotNetAnywhere: An Alternative .NET Runtime	26,140

Traffic Sources

I was going to do a write-up on where/how I get my blog traffic, but instead I’d encourage you to read 6 Years of Thoughts on Programming by Henrik Warne as his experience exactly matches mine. But in summary, getting onto the front-page of HackerNews drives a lot of traffic to your site/blog.

Finally, a big thanks to everyone who has read, commented on or shared my blogs posts, it means a lot!!

Open Source .NET – 3 years later

2017-12-19T00:00:00+00:00

A little over 3 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as Scott Hanselman said in his Connect 2016 keynote, the community has been contributing in a significant way:

This post forms part of an on-going series, if you want to see how things have changed over time you can check out the previous ones:

In addition, I’ve recently done a talk covering this subject, the slides are below:

Microsoft & open source a 'brave new world' - CORESTART 2.0 from Matt Warren

Historical Perspective

Now that we are 3 years down the line, it’s interesting to go back and see what the aims were when it all started. If you want to know more about this, I recommend watching the 2 Channel 9 videos below, made by the Microsoft Engineers involved in the process:

It hasn’t always been plain sailing, it’s fair to say that there have been a few bumps along the way (I guess that’s what happens if you get to see “how the sausage gets made”), but I think that we’ve ended up in a good place.

During the past 3 years there have been a few notable events that I think are worth mentioning:

Samsung developers have made significant contributions to the CoreCLR source code, to support their Tizen OS
Microsoft really are developing ‘out in the open’, you can see this by how often GitHub issues are referenced in the source code
We saw the new Span<T> apis move their way through the various repos, CoreFXLabs -> CoreCLR -> Roslyn -> CoreFX before turning into a complete feature!
There’s been deeper integration between .NET Core and Mono
Significant Performance Improvements have been made in .NET Core
.NET Core and .NET Desktop have now sufficiently diverged (even though they still share code, such as JIT, GC)
Microsoft have made a concerted effort to ensure that all their Open Source code can be built just using other Open Source code
The Local GC effort has been started, aiming to ‘decouple the GC from the rest of the runtime’
.NET will be finally getting Tiered Compilation

Repository activity over time

But onto the data, first we are going to look at an overview of the level of activity in each repo, by looking at the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (yay sparklines FTW!!). If you are interested in how I got the data, see the previous post because the process is the same.

Whilst it’s clear that Visual Studio Code is way ahead of all the other repos in terms of ‘Issues’, it’s interesting to see that the .NET-only ones have the most ‘Pull-Requests’, notably CoreFX (Base Class Libraries), Roslyn (Compiler) and CoreCLR (Runtime).

Overall Participation - Community v. Microsoft

Next will will look at the total participation from the last 3 years, i.e. November 2014 to November 2017. All Pull Requests are Issues are treated equally, so a large PR counts the same as one that fixes a spelling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split.

Note: You can hover over the bars to get the actual numbers, rather than percentages.

Issues: Microsoft Community

Pull Requests: Microsoft Community

Participation over time - Community v. Microsoft

Finally we can see the ‘per-month’ data from the last 3 years, i.e. November 2014 to November 2017.

Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.

Issues: Microsoft Community

Pull Requests: Microsoft Community

Summary

Discuss this post on Hacker News and /r/programming

A look at the internals of 'Tiered JIT Compilation' in .NET Core

2017-12-15T00:00:00+00:00

The .NET runtime (CLR) has predominantly used a just-in-time (JIT) compiler to convert your executable into machine code (leaving aside ahead-of-time (AOT) scenarios for the time being), as the official Microsoft docs say:

At execution time, a just-in-time (JIT) compiler translates the MSIL into native code. During this compilation, code must pass a verification process that examines the MSIL and metadata to find out whether the code can be determined to be type safe.

But how does that process actually work?

The same docs give us a bit more info:

JIT compilation takes into account the possibility that some code might never be called during execution. Instead of using time and memory to convert all the MSIL in a PE file to native code, it converts the MSIL as needed during execution and stores the resulting native code in memory so that it is accessible for subsequent calls in the context of that process. The loader creates and attaches a stub to each method in a type when the type is loaded and initialized. When a method is called for the first time, the stub passes control to the JIT compiler, which converts the MSIL for that method into native code and modifies the stub to point directly to the generated native code. Therefore, subsequent calls to the JIT-compiled method go directly to the native code.

Simple really!! However if you want to know more, the rest of this post will explore this process in detail.

In addition, we will look at a new feature that is making its way into the Core CLR, called ‘Tiered Compilation’. This is a big change for the CLR, up till now .NET methods have only been JIT compiled once, on their first usage. Tiered compilation is looking to change that, allowing methods to be re-compiled into a more optimised version much like the Java Hotspot compiler.

How it works

But before we look at future plans, how does the current CLR allow the JIT to transform a method from IL to native code? Well, they say ‘a pictures speaks a thousand words’

Before the method is JITed

After the method has been JITed

The main things to note are:

The CLR has put in a ‘precode’ and ‘stub’ to divert the initial method call to the PreStubWorker() method (which ultimately calls the JIT). These are hand-written assembly code fragments consisting of only a few instructions.
Once the method had been JITed into ‘native code’, a stable entry point it created. For the rest of the life-time of the method the CLR guarantees that this won’t change, so the rest of the run-time can depend on it remaining stable.
The ‘temporary entry point’ doesn’t go away, it’s still available because there may be other methods that are expecting to call it. However the associated ‘precode fixup’ has been re-written or ‘back patched’ to point to the newly created ‘native code’ instead of PreStubWorker().
The CLR doesn’t change the address of the call instruction in the method that called the method being JITted, it only changes the address inside the ‘precode’. But because all method calls in the CLR go via a precode, the 2nd time the newly JITed method is called, the call will end up at the ‘native code’.

For reference, the ‘stable entry point’ is the same memory location as the IntPtr that is returned when you call the RuntimeMethodHandle.GetFunctionPointer() method.

If you want to see this process in action for yourself, you can either re-compile the CoreCLR source and add the relevant debug information as I did or just use WinDbg and follow the steps in this excellent blog post (for more on the same topic see ‘Advanced Call Processing in the CLR’ and Vance Morrison’s excellent write-up ‘Digging into interface calls in the .NET Framework: Stub-based dispatch’).

Finally, the different parts of the Core CLR source code that are involved are listed below:

Note: this post isn’t going to look at how the JIT itself works, if you are interested in that take a look as this excellent overview written by one of the main developers.

JIT and Execution Engine (EE) Interaction

The make all this work the JIT and the EE have to work together, to get an idea of what is involved, take a look at this comment describing the rules that determine which type of precode the JIT can use. All this info is stored in the EE as it’s the only place that has the full knowledge of what a method does, so the JIT has to ask which mode to work in.

In addition, the JIT has to ask the EE what the address of a functions entry point is, this is done via the following methods:

Precode and Stubs

There are different types or ‘precode’ available, ‘FIXUP’, ‘REMOTING’ or ‘STUB’, you can see the rules for which one is used in MethodDesc::GetPrecodeType(). In addition, because they are such a low-level mechanism, they are implemented differently across CPU architectures, from a comment in the code:

There two implementation options for temporary entrypoints:

(1) Compact entrypoints. They provide as dense entrypoints as possible, but can’t be patched to point to the final code. The call to unjitted method is indirect call via slot.

(2) Precodes. The precode will be patched to point to the final code eventually, thus the temporary entrypoint can be embedded in the code. The call to unjitted method is direct call to direct jump.

We use (1) for x86 and (2) for 64-bit to get the best performance on each platform. For ARM (1) is used.

There’s also a whole lot more information about ‘precode’ available in the BOTR.

Finally, it turns out that you can’t go very far into the internals of the CLR without coming across ‘stubs’ (or ‘trampolines’, ‘thunks’, etc), for instance they’re used in

Tiered Compilation

Before we go any further I want to point out that Tiered Compilation is very much work-in-progress. As an indication, to get it working you currently have to set an environment variable called COMPLUS_EXPERIMENTAL_TieredCompilation. It appears that the current work is focussed on the infrastructure to make it possible (i.e. CLR changes), then I assume that there has to be a fair amount of testing and performance analysis before it’s enabled by default.

If you want to learn about the goals of the feature and how it fits into the wider process of ‘code versioning’, I recommend reading the excellent design docs, including the future roadmap possibilities.

To give an indications of what has been involved so far, there has been work going on in the:

Debugger (e.g. Breakpoints aren’t hit if tiered jitting recompiled the method before the debugger was attached and Source line breakpoints stop working when tiered jitting replaces the code)
Profiling APIs - e.g. Tiered jitting: Implement additional profiler APIs
Diagnostics - (all tracked via Tiered jitting: Design/Implement appropriate diagnostics, e.g. Tiered Jitting: Fix IL to native mapping for ETW)
Interpreter - yes the CLR has a built-in Interpreter
Many other places

If you want to follow along you can take a look at the related issues/PRs, here are the main ones to get you started:

Tiered Compilation step 1
WIP - Tiered Jitting Part Deux
All PRs by Noah Falk (main Microsoft Developer working on the feature)

There is also some nice background information available in Introduce a tiered JIT and if you want to understand how it will eventually makes use of changes in the JIT (‘MinOpts’), take a look at Low Tier Back-Off and JIT: enable aggressive inline policy for Tier1.

History - ReJIT

As an quick historical aside, you have previously been able to get the CLR to re-JIT a method for you, but it only worked with the Profiling APIs, which meant you had to write some C/C++ COM code to make it happen! In addition ReJIT only allowed the method to be re-compiled at the same level, so it wouldn’t ever produce more optimised code. It was mostly meant to help monitoring or profiling tools.

How it works

Finally, how does it work, again lets look at some diagrams. Firstly, as a recap, lets take a look at how things ends up once a method had been JITed, with tiered compilation turned off (the same diagram as above):

Now, as a comparison, here’s what the same stage looks like with tiered compilation enabled:

The main difference is that tiered compilation has forced the method call to go through another level of indirection, the ‘pre stub’. This is to make it possible to count the number of times the method is called, then once it has hit the threshold (currently 30), the ‘pre stub’ is re-written to point to the ‘optimised native code’ instead:

Note that the original ‘native code’ is still available, so if needed the changes can be reverted and the method call can go back to the unoptimised version.

Using a counter

We can see a bit more details about the counter in this comments from prestub.cpp:

    /***************************   CALL COUNTER    ***********************/
    // If we are counting calls for tiered compilation, leave the prestub
    // in place so that we can continue intercepting method invocations.
    // When the TieredCompilationManager has received enough call notifications
    // for this method only then do we back-patch it.
    BOOL fCanBackpatchPrestub = TRUE;
#ifdef FEATURE_TIERED_COMPILATION
    BOOL fEligibleForTieredCompilation = IsEligibleForTieredCompilation();
    if (fEligibleForTieredCompilation)
    {
        CallCounter * pCallCounter = GetCallCounter();
        fCanBackpatchPrestub = pCallCounter->OnMethodCalled(this);
    }
#endif

In essence the ‘stub’ calls back into the TieredCompilationManager until the ‘tiered compilation’ is triggered, once that happens the ‘stub’ is ‘back-patched’ to stop it being called any more.

Why not ‘Interpreted’?

If you’re wondering why tiered compilation doesn’t have an interpreted mode, you’re not alone, I asked the same question (for more info see my previous post on the .NET Interpreter)

And the answer I got was:

There’s already an Interpreter available, or is it not considered suitable for production code?

Its a fine question, but you guessed correctly - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do). Given enough time and effort it is all solvable, it just isn’t the easiest place to start.

How different is the overhead between non-optimised and optimised JITting?

On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.

But that’s from a few months ago, maybe Mono’s New .NET Interpreter will change things, who knows?

Why not LLVM?

Finally, why aren’t they using a LLVM to compile the code, from Introduce a tiered JIT (comment)

There were (and likely still are) significant differences in the LLVM support needed for the CLR versus what is needed for Java, both in GC and in EH, and in the restrictions one must place on the optimizer. To cite just one example: the CLRs GC currently cannot tolerate managed pointers that point off the end of objects. Java handles this via a base/derived paired reporting mechanism. We’d either need to plumb support for this kind of paired reporting into the CLR or restrict LLVM’s optimizer passes to never create these kinds of pointers. On top of that, the LLILC jit was slow and we weren’t sure ultimately what kind of code quality it might produce.

So, figuring out how LLILC might fit into a potential multi-tier approach that did not yet exist seemed (and still seems) premature. The idea for now is to get tiering into the framework and use RyuJit for the second-tier jit. As we learn more, we may discover there is indeed room for higher tier jits, or, at least, understand better what else we need to do before such things make sense.

There is more background info in Introduce a tiered JIT

Summary

One of my favourite side-effects of Microsoft making .NET Open Source and developing out in the open is that we can follow along with work-in-progress features. It’s great being able to download the latest code, try them out and see how they work under-the-hood, yay for OSS!!

Discuss this post on Hacker News

Exploring the BBC micro:bit Software Stack

2017-11-28T00:00:00+00:00

If you grew up in the UK and went to school during the 1980’s or 1990’s there’s a good chance that this picture brings back fond memories:

(image courtesy of Classic Acorn)

I’d imagine that for a large amount of computer programmers (currently in their 30’s) the BBC Micro was their first experience of programming. If this applies to you and you want a trip down memory lane, have a read of Remembering: The BBC Micro and The BBC Micro in my education.

Programming the classic Turtle was done in Logo, with code like this:

FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90

Of course, once you knew what you were doing, you would re-write it like so:

REPEAT 4 [FORWARD 100 LEFT 90]

BBC micro:bit

The original Micro was launched as an education tool, as part of the BBC’s Computer Literacy Project and by most accounts was a big success. As a follow-up, in March 2016 the micro:bit was launched as part of the BBC’s ‘Make it Digital’ initiative and 1 million devices were given out to schools and libraries in the UK to ‘help develop a new generation of digital pioneers’ (i.e. get them into programming!)

Aside: I love the difference in branding across 30 years, ‘BBC Micro’ became ‘BBC micro:bit’ (you must include the colon) and ‘Computer Literacy Project’ changed to the ‘Make it Digital Initiative’.

A few weeks ago I walked into my local library, picked up a nice starter kit and then spent a fun few hours watching my son play around with it (I’m worried about how quickly he picked up the basics of programming, I think I might be out of a job in a few years time!!)

However once he’d gone to bed it was all mine! The result of my ‘playing around’ is this post, in it I will be exploring the software stack that makes up the micro:bit, what’s in it, what it does and how it all fits together.

If you want to learn about how to program the micro:bit, its hardware or anything else, take a look at this excellent list of resources.

Slightly off-topic, but if you enjoy reading source code you might like these other posts:

BBC micro:bit Software Stack

If we take a high-level view at the stack, it divides up into 3 discrete software components that all sit on top of the hardware itself:

If you would like to build this stack for yourself take a look at the Building with Yotta guide. I also found this post describing The First Video Game on the BBC micro:bit [probably] very helpful.

Runtimes

There are several high-level runtimes available, these are useful because they let you write code in a language other than C/C++ or even create programs by dragging blocks around on a screen. The main ones that I’ve come across are below (see ‘Programming’ for a full list):

Python via MicroPython
JavaScript with Microsoft Programming Experience Toolkit (PXT)
- well actually it’s TypeScript, which is good, we wouldn’t want to rot the brains of impressionable young children with the horrors of Javascript - Wat!!

They both work in a similar way, the users code (Python or TypeScript) is bundled up along with the C/C++ code of the runtime itself and then the entire binary (hex) file is deployed to the micro:bit. When the device starts up, the runtime then looks for the users code at a known location in memory and starts interpreting it.

Update It turns out that I was wrong about the Microsoft PXT, it actually compiles your TypeScript program to native code, very cool! Interestingly, they did it that way because:

Compared to a typical dynamic JavaScript engine, PXT compiles code statically, giving rise to significant time and space performance improvements:

user programs are compiled directly to machine code, and are never in any byte-code form that needs to be interpreted; this results in much faster execution than a typical JS interpreter

there is no RAM overhead for user-code - all code sits in flash; in a dynamic VM there are usually some data-structures representing code

due to lack of boxing for small integers and static class layout the memory consumption for objects is around half the one you get in a dynamic VM (not counting the user-code structures mentioned above)

while there is some runtime support code in PXT, it’s typically around 100KB smaller than a dynamic VM, bringing down flash consumption and leaving more space for user code

The execution time, RAM and flash consumption of PXT code is as a rule of thumb 2x of compiled C code, making it competitive to write drivers and other user-space libraries.

Memory Layout

Just before we go onto the other parts of the software stack I want to take a deeper look at the memory layout. This is important because memory is so constrained on the micro:bit, there is only 16KB of RAM. To put that into perspective, we’ll use the calculation from this StackOverflow question How many bytes of memory is a tweet?

Twitter uses UTF-8 encoded messages. UTF-8 code points can be up to six four octets long, making the maximum message size 140 x 4 = 560 8-bit bytes.

If we re-calculate for the newer, longer tweets 280 x 4 = 1,120 bytes. So we could only fit 10 tweets into the available RAM on the micro:bit (it turns out that only ~11K out of the total 16K is available for general use). Which is why it’s worth using a custom version of atoi() to save 350 bytes of RAM!

The memory layout is specified by the linker at compile-time using NRF51822.ld, there is a sample output available if you want to take a look. Because it’s done at compile-time you run into build errors such as “region RAM overflowed with stack” if you configure it incorrectly.

The table below shows the memory layout from the ‘no SD’ version of a ‘Hello World’ app, i.e. with the maximum amount of RAM available as the Bluetooth (BLE) Soft-Device (SD) support has been removed. By comparison with BLE enabled, you instantly have 8K less RAM available, so things start to get tight!

Name	Start Address	End Address	Size	Percentage
.data	0x20000000	0x20000098	152 bytes	0.93%
.bss	0x20000098	0x20000338	672 bytes	4.10%
Heap (mbed)	0x20000338	0x20000b38	2,048 bytes	12.50%
Empty	0x20000b38	0x20003800	11,464 bytes	69.97%
Stack	0x20003800	0x20004000	2,048 bytes	12.50%

For more info on the column names see the Wikipedia pages for .data and .bss as well as text, data and bss: Code and Data Size Explained

As a comparison there is a nice image of the micro:bit RAM Layout in this article. It shows what things look like when running MicroPython and you can clearly see the main Python heap in the centre taking up all the remaining space.

microbit-dal

Sitting in the stack below the high-level runtime is the device abstraction layer (DAL), created at Lancaster University in the UK, it’s made up of 4 main components:

core
- High-level components, such as Device, Font, HeapAllocator, Listener and Fiber, often implemented on-top of 1 or more driver classes
types
- Helper types such as ManagedString, Image, Event and PacketBuffer
drivers
- For control of a specific hardware component, such as Accelerometer, Button, Compass, Display, Flash, IO, Serial and Pin
bluetooth
- All the code for the Bluetooth Low Energy (BLE) stack that is shipped with the micro:bit
asm
- Just 4 functions are implemented in assembly, they are swap_context, save_context, save_register_context and restore_register_context. As the names suggest, they handle the ‘context switching’ necessary to make the MicroBit Fiber scheduler work

The image below shows the distribution of ‘Lines of Code’ (LOC), as you can see the majority of the code is in the drivers and bluetooth components.

In addition to providing nice helper classes for working with the underlying devices, the DAL provides the Fiber abstraction to allows asynchronous functions to work. This is useful because you can asynchronously display text on the LED display and your code won’t block whilst it’s scrolling across the screen. In addition the Fiber class is used to handle the interrupts that signal when the buttons on the micro:bit are pushed. This comment from the code clearly lays out what the Fiber scheduler does:

This lightweight, non-preemptive scheduler provides a simple threading mechanism for two main purposes:

1) To provide a clean abstraction for application languages to use when building async behaviour (callbacks). 2) To provide ISR decoupling for EventModel events generated in an ISR context.

Finally the high-level classes MicroBit.cpp and MicroBit.h are housed in the microbit repository. These classes define the API of the MicroBit runtime and setup the default configuration, as shown in the Constructor of MicroBit.cpp:

/**
  * Constructor.
  *
  * Create a representation of a MicroBit device, which includes member variables
  * that represent various device drivers used to control aspects of the micro:bit.
  */
MicroBit::MicroBit() :
    serial(USBTX, USBRX),
    resetButton(MICROBIT_PIN_BUTTON_RESET),
    storage(),
    i2c(I2C_SDA0, I2C_SCL0),
    messageBus(),
    display(),
    buttonA(MICROBIT_PIN_BUTTON_A, MICROBIT_ID_BUTTON_A),
    buttonB(MICROBIT_PIN_BUTTON_B, MICROBIT_ID_BUTTON_B),
    buttonAB(MICROBIT_ID_BUTTON_A,MICROBIT_ID_BUTTON_B, MICROBIT_ID_BUTTON_AB),
    accelerometer(i2c),
    compass(i2c, accelerometer, storage),
    compassCalibrator(compass, accelerometer, display),
    thermometer(storage),
    io(MICROBIT_ID_IO_P0,MICROBIT_ID_IO_P1,MICROBIT_ID_IO_P2,
       MICROBIT_ID_IO_P3,MICROBIT_ID_IO_P4,MICROBIT_ID_IO_P5,
       MICROBIT_ID_IO_P6,MICROBIT_ID_IO_P7,MICROBIT_ID_IO_P8,
       MICROBIT_ID_IO_P9,MICROBIT_ID_IO_P10,MICROBIT_ID_IO_P11,
       MICROBIT_ID_IO_P12,MICROBIT_ID_IO_P13,MICROBIT_ID_IO_P14,
       MICROBIT_ID_IO_P15,MICROBIT_ID_IO_P16,MICROBIT_ID_IO_P19,
       MICROBIT_ID_IO_P20),
    bleManager(storage),
    radio(),
    ble(NULL)
{
...
}

mbed-classic

The software at the bottom of the stack is making use of the ARM mbed OS which is:

.. an open-source embedded operating system designed for the “things” in the Internet of Things (IoT). mbed OS includes the features you need to develop a connected product using an ARM Cortex-M microcontroller.

mbed OS provides a platform that includes:

Security foundations.

Cloud management services.

Drivers for sensors, I/O devices and connectivity.

mbed OS is modular, configurable software that you can customize it to your device and to reduce memory requirements by excluding unused software.

We can see this from the layout of it’s source, it’s based around common components, which can be combined with a hal (Hardware Abstraction Layers) and a target specific to the hardware you are running on.

More specifically the micro:bit uses the yotta target bbc-microbit-classic-gcc, but it can also use others targets as needed.

For reference, here are the files from the common section of mbed that are used by the micro:bit-dal:

And here are the hardware specific files, targeting the NORDIC - MCU NRF51822:

End-to-end (or top-to-bottom)

Finally, lets look a few examples of how the different components within the stack are used in specific scenarios

Writing to the Display

microbit-dal
- MicroBitDisplay.cpp, handles scrolling, asynchronous updates and other high-level tasks, before handing off to:
mbed-classic
- void port_write(port_t *obj, int value) in port_api.c (‘NORDIC NRF51822’ version), via a call to void write(int value) in PortOut.h, using info from PinNames.h

Storing files on the Flash memory

microbit-dal
- Provides the high-level abstractions, such as:
- FileSystem
- File
- Flash
mbed-classic
- Allows low-level control of the hardware, such as writing to the flash itself either directly or via the SoftDevice (SD) layer

In addition, this comment from MicroBitStorage.h gives a nice overview of how the file system is implemented on-top of the raw flash storage:

* The first 8 bytes are reserved for the KeyValueStore struct which gives core
* information such as the number of KeyValuePairs in the store, and whether the
* store has been initialised.
*
* After the KeyValueStore struct, KeyValuePairs are arranged contiguously until
* the end of the block used as persistent storage.
*
* |-------8-------|--------48-------|-----|---------48--------|
* | KeyValueStore | KeyValuePair[0] | ... | KeyValuePair[N-1] |
* |---------------|-----------------|-----|-------------------|

Summary

All-in-all the micro:bit is a very nice piece of kit and hopefully will achieve its goal ‘to help develop a new generation of digital pioneers’. However, it also has a really nice software stack, one that is easy to understand and find your way around.

Microsoft & Open Source a 'Brave New World' - CORESTART 2.0

2017-11-14T00:00:00+00:00

Recently I was fortunate enough to be invited to the CORESTART 2.0 conference to give a talk on Microsoft & Open Source a ‘Brave New World’. It was a great conference, well organised by Tomáš Herceg and the teams from .NET College and Riganti and I had a great time.

I encourage you to attend next years ‘Update’ conference if you can and as bonus you’ll get to see the sights of Prague! Including the Head of Franz Kafka as well as the amazing buildings, castles and bridges that all the guide-books will tell you about!

I’ve not been ‘invited’ to speak at a conference before, so I wasn’t sure what to expect, but there was a great audience and they seemed happy to learn about the Open Source projects that Microsoft are running and what is being done to encourage us (the ‘Community’) to contribute.

The slides for my talk are embedded below and you can also ‘watch’ the entire recording (audio and slides only, no video).

Microsoft & open source a 'brave new world' - CORESTART 2.0 from Matt Warren

Talk Outline

But if you don’t fancy sitting through the whole thing, you can read the summary below and jump straight to the relevant parts

Before

[jump to slide] [direct video link]

Wait, didn’t that happen before? direct link
.NET goes ‘Open Source’ and onto Hacker News direct link
What did they Open Source? direct link
CoreFX, CoreCLR, CoreFX Labs, Roslyn direct link
TypeScript, VS Code and Kestrel direct link

During

[jump to slide] [direct video link]

First PR direct link
Comedy PRs direct link
Good direct link
Bad (‘we got to see how the sausage was made’) direct link
Ugly direct link

After

[jump to slide] [direct video link]

Do .NET Developers Care? direct link
Microsoft the organisation on GitHub direct link
Over 60% of Contributions to .NET Core come from the Community direct link
Are Microsoft telling the Truth? direct link
Analysis of GitHub Repositories - ‘Community v. Microsoft’ direct link
Issues Opened direct link
Pull Requests Created direct link
Do .NET Developers Care? - Conclusions direct link

What Now?

[jump to slide] [direct video link]

How do I Contribute? direct link
Domino Chain Reaction direct link
First CoreFX PR by Ben Adams direct link
First CoreCLR PR by Ben Adams direct link
My main Contributions to the CoreCLR direct link
Will I get told to RTM? direct link

Domino Chain Reaction

Finally, if you’re wondering what the section on ‘Domino Chain Reaction’ is all about, you’ll have to listen to that part of the talk, but the video itself is embedded below:

(Based on actual research, see The Curious Mathematics of Domino Chain Reactions)

A DoS Attack against the C# Compiler

2017-11-08T00:00:00+00:00

Generics in C# are certainly very useful and I find it amazing that we almost didn’t get them:

What would the cost of inaction have been? What would the cost of failure have been? No generics in C# 2.0? No LINQ in C# 3.0? No TPL in C# 4.0? No Async in C# 5.0? No F#? Ultimately, an erasure model of generics would have been adopted, as for Java, since the CLR team would never have pursued a in-the-VM generics design without external help.

So a big thanks is due to Don Syme and the rest of the team at Microsoft Research in Cambridge!

But as well as being useful, I also find some usages of generics mind-bending, for instance I’m still not sure what this code actually means or how to explain it in words:

class Blah<T> where T : Blah<T>

As always, reading an Eric Lippert post helps a lot, but even he recommends against using this specific ‘circular’ pattern.

Recently I spoke at the CORESTART 2.0 conference in Prague, giving a talk on ‘Microsoft and Open-Source – A ‘Brave New World’. Whilst I was there I met the very knowledgeable Jiri Cincura, who blogs at tabs ↹ over ␣ ␣ ␣ spaces. He was giving a great talk on ‘C# 7.1 and 7.2 features’, but also shared with me an excellent code snippet that he called ‘Crazy Class’:

class Class<A, B, C, D, E, F>
{
    class Inner : Class<Inner, Inner, Inner, Inner, Inner, Inner>
    {
        Inner.Inner.Inner.Inner.Inner.Inner.Inner.Inner.Inner inner;
    }
}

He said:

this is the class that takes crazy amount of time to compile. You can add more Inner.Inner.Inner... to make it even longer (and also generic parameters).

After a big of digging around I found that someone else had noticed this, see the StackOverflow question Why does field declaration with duplicated nested type in generic class results in huge source code increase? Helpfully the ‘accepted answer’ explains what is going on:

When you combine these two, the way you have done, something interesting happens. The type Outer<T>.Inner is not the same type as Outer<T>.Inner.Inner. Outer<T>.Inner is a subclass of Outer<Outer<T>.Inner> while Outer<T>.Inner.Inner is a subclass of Outer<Outer<Outer<T>.Inner>.Inner>, which we established before as being different from Outer<T>.Inner. So Outer<T>.Inner.Inner and Outer<T>.Inner are referring to different types.

When generating IL, the compiler always uses fully qualified names for types. You have cleverly found a way to refer to types with names whose lengths that grow at exponential rates. That is why as you increase the generic arity of Outer or add additional levels .Y to the field field in Inner the output IL size and compile time grow so quickly.

Clear? Good!!

You probably have to be Jon Skeet, Eric Lippert or a member of the C# Language Design Team (yay, ‘Matt Warren’) to really understand what’s going on here, but that doesn’t stop the rest of us having fun with the code!!

I can’t think of any reason why you’d actually want to write code like this, so please don’t!! (or at least if you do, don’t blame me!!)

For a simple idea of what’s actually happening, lets take this code (with only 2 ‘Levels’):

class Class<A, B, C, D, E, F>
{
    class Inner : Class<Inner, Inner, Inner, Inner, Inner, Inner>
    {
        Inner.Inner inner;
    }
}

The ‘decompiled’ version actually looks like this:

internal class Class<A, B, C, D, E, F>
{
    private class Inner : Class<Class<A, B, C, D, E, F>.Inner, 
                                Class<A, B, C, D, E, F>.Inner, 
                                Class<A, B, C, D, E, F>.Inner, 
                                Class<A, B, C, D, E, F>.Inner, 
                                Class<A, B, C, D, E, F>.Inner, 
                                Class<A, B, C, D, E, F>.Inner>
    {
        private Class<Class<Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner>.Inner, 
                        Class<Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner>.Inner, 
                        Class<Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner>.Inner, 
                        Class<Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner>.Inner, 
                        Class<Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner>.Inner, 
                        Class<Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner, 
                            Class<A, B, C, D, E, F>.Inner>.Inner>.Inner inner;
    }
}

Wow, no wonder things go wrong quickly!!

Exponential Growth

Firstly let’s check the claim of exponential growth, if you don’t remember your Big O notation you can also think of this as O(very, very bad)!!

To test this out, I’m going to compile the code above, but vary the ‘level’ each time by adding a new .Inner, so ‘Level 5’ looks like this:

Inner.Inner.Inner.Inner.Inner inner;

‘Level 6’ like this, and so on

Inner.Inner.Inner.Inner.Inner.Inner inner;

We then get the following results:

Level	Compile Time (secs)	Working set (KB)	Binary Size (Bytes)
5	1.15	54,288	135,680
6	1.22	59,500	788,992
7	2.00	70,728	4,707,840
8	6.43	121,852	28,222,464
9	33.23	405,472	169,310,208
10	202.10	2,141,272	CRASH

If we look at these results in graphical form, it’s very obvious what’s going on

(the dotted lines are a ‘best fit’ trend-line and they are exponential)

If I compile the code with dotnet build (version 2.0.0), things go really wrong at ‘Level 10’ and the compiler throws an error (full stack trace):

System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.

However your mileage may vary, when I ran the code in Visual Studio 2015 it threw an OutOfMemoryException instead and then promptly restarted itself!! I assume this is because VS is a 32-bit application and it runs out of memory before it can go really wrong!

Mono Compiler

As a comparison, here are the results from the Mono compiler, thanks to Egor Bogatov for putting them together.

Level	Compile Time (secs)	Memory Usage (Bytes)
5	0.480	134,144
6	0.502	786,944
7	0.745	4,706,304
8	2.053	28,220,928
9	10.134	169,308,672
10	57.307	1,015,835,136

At ‘Level 10’ it produced a 968.78 Mb binary!!

Profiling the Compiler

Finally, I want to look at just where the compiler is spending all it’s time. From the results above we saw that it was taking over 3 minutes to compile a simple program, with a peak memory usage of 2.14 GB, so what was it actually doing??

Well clearly there’s lots of Types involved and the Compiler seems happy for you to write this code, so I guess it needs to figure it all out. Once it’s done that, it then needs to write all this Type metadata out to a .dll or .exe, which can be 100’s of MB in size.

At a high-level the profiling summary produce by VS looks like this (click for full-size image):

However if we take a bit of a close look, we can see the ‘hot-path’ is inside the SerializeTypeReference(..) method in Compilers/Core/Portable/PEWriter/MetadataWriter.cs

Summary

I’m a bit torn about this, it is clearly an ‘abuse’ of generics!!

In some ways I think that it shouldn’t be fixed, it seems better that the compiler encourages you to not write code like this, rather than making is possible!!

So if it takes 3 mins to compile your code, allocates 2GB of memory and then crashes, take that as a warning!!

Discuss this post on Hacker News, /r/programming and /r/csharp

The post A DoS Attack against the C# Compiler first appeared on my blog Performance is a Feature!

DotNetAnywhere: An Alternative .NET Runtime

2017-10-19T00:00:00+00:00

Recently I was listening to the excellent DotNetRocks podcast and they had Steven Sanderson (of Knockout.js fame) talking about ‘WebAssembly and Blazor’.

In case you haven’t heard about it, Blazor is an attempt to bring .NET to the browser, using the magic of WebAssembly. If you want more info, Scott Hanselmen has done a nice write-up of the various .NET/WebAssembly projects.

However, as much as the mention of WebAssembly was pretty cool, what interested me even more how Blazor was using DotNetAnywhere as the underlying .NET runtime. This post will look at what DotNetAnywhere is, what you can do with it and how it compares to the full .NET framework.

DotNetAnywhere

Firstly it’s worth pointing out that DotNetAnywhere (DNA) is designed to be a fully compliant .NET runtime, which means that it can run .NET dlls/exes that have been compiled to run against the full framework. On top of that (at least in theory) it supports all the following .NET runtime features, which is a pretty impressive list!

Generics

Garbage collection and finalization

Weak references

Full exception handling - try/catch/finally

PInvoke

Interfaces

Delegates

Events

Nullable types

Single-dimensional arrays

Multi-threading

In addition there is some partial support for Reflection

Very limited read-only reflection

typeof(), .GetType(), Type.Name, Type.Namespace, Type.IsEnum(), <object>.ToString() only

Finally, there are a few features that are currently unsupported:

Attributes

Most reflection

Multi-dimensional arrays

Unsafe code

There are various bugs or missing functionality that might prevent your code running under DotNetAnywhere, however several of these have been fixed since Blazor came along, so it’s worth checking against the Blazor version of DotNetAnywhere.

At this point in time the original DotNetAnywhere repo is no longer active (the last sustained activity was in Jan 2012), so it seems that any future development or bugs fixes will likely happen in the Blazor repo. If you have ever fixed something in DotNetAnywhere, consider sending a P.R there, to help the effort.

Update: In addition there are other forks with various bug fixes and enhancements:

Source Code Layout

What I find most impressive about the DotNetAnywhere runtime is that it was developed by one person and is less that 40,000 lines of code!! For a comparison the .NET framework Garbage Collector is almost 37,000 lines on it’s own (more info available in my previous post A Hitchhikers Guide to the CoreCLR Source Code).

This makes DotNetAnywhere an ideal learning resource!

Firstly, lets take a look at the Top-10 largest source files, to see where the complexity is:

Native Code - 17,710 lines in total

LOC	File
3,164	JIT_Execute.c
1,778	JIT.c
1,109	PInvoke_CaseCode.h
630	Heap.c
618	MetaData.c
563	MetaDataTables.h
517	Type.c
491	MetaData_Fill.c
467	MetaData_Search.c
452	JIT_OpCodes.h

Managed Code - 28,783 lines in total

LOC	File
2393	corlib/System.Globalization/CalendricalCalculations.cs
2314	corlib/System/NumberFormatter.cs
1582	System.Drawing/System.Drawing/Pens.cs
1443	System.Drawing/System.Drawing/Brushes.cs
1405	System.Core/System.Linq/Enumerable.cs
745	corlib/System/DateTime.cs
693	corlib/System.IO/Path.cs
632	corlib/System.Collections.Generic/Dictionary.cs
598	corlib/System/String.cs
467	corlib/System.Text/StringBuilder.cs

Main areas of functionality

Next, lets look at the key components in DotNetAnywhere as this gives us a really good idea about what you need to implement a .NET compatible runtime. Along the way, we will also see how they differ from the implementation found in Microsoft’s .NET Framework.

Reading .NET dlls

The first thing DotNetAnywhere has to do is read/understand/parse the .NET Metadata and Code that’s contained in a .dll/.exe. This all takes place in MetaData.c, primarily within the LoadSingleTable(..) function. By adding some debugging code, I was able to get a summary of all the different types of Metadata that are read in from a typical .NET dll, it’s quite an interesting list:

MetaData contains     1 Assemblies (MD_TABLE_ASSEMBLY)
MetaData contains     1 Assembly References (MD_TABLE_ASSEMBLYREF)
MetaData contains     0 Module References (MD_TABLE_MODULEREF)

MetaData contains    40 Type References (MD_TABLE_TYPEREF)
MetaData contains    13 Type Definitions (MD_TABLE_TYPEDEF)
MetaData contains    14 Type Specifications (MD_TABLE_TYPESPEC)
MetaData contains     5 Nested Classes (MD_TABLE_NESTEDCLASS)

MetaData contains    11 Field Definitions (MD_TABLE_FIELDDEF)
MetaData contains     0 Field RVA's (MD_TABLE_FIELDRVA)
MetaData contains     2 Propeties (MD_TABLE_PROPERTY)
MetaData contains    59 Member References (MD_TABLE_MEMBERREF)
MetaData contains     2 Constants (MD_TABLE_CONSTANT)

MetaData contains    35 Method Definitions (MD_TABLE_METHODDEF)
MetaData contains     5 Method Specifications (MD_TABLE_METHODSPEC)
MetaData contains     4 Method Semantics (MD_TABLE_PROPERTY)
MetaData contains     0 Method Implementations (MD_TABLE_METHODIMPL)
MetaData contains    22 Parameters (MD_TABLE_PARAM)

MetaData contains     2 Interface Implementations (MD_TABLE_INTERFACEIMPL)
MetaData contains     0 Implementation Maps? (MD_TABLE_IMPLMAP)

MetaData contains     2 Generic Parameters (MD_TABLE_GENERICPARAM)
MetaData contains     1 Generic Parameter Constraints (MD_TABLE_GENERICPARAMCONSTRAINT)

MetaData contains    22 Custom Attributes (MD_TABLE_CUSTOMATTRIBUTE)
MetaData contains     0 Security Info Items? (MD_TABLE_DECLSECURITY)

For more information on the Metadata see Introduction to CLR metadata, Anatomy of a .NET Assembly – PE Headers and the ECMA specification itself.

Executing .NET IL

Another large piece of functionality within DotNetAnywhere is the ‘Just-in-Time’ Compiler (JIT), i.e. the code that is responsible for executing the IL, this takes place initially in JIT_Execute.c and then JIT.c. The main ‘execution loop’ is in the JITit(..) function which contains an impressive 1,374 lines of code and over 200 case statements within a single switch!!

Taking a higher level view, the overall process that it goes through looks like this:

Where the .NET IL Op-Codes (CIL_XXX) are defined in CIL_OpCodes.h and the DotNetAnywhere JIT Op-Codes (JIT_XXX) are defined in JIT_OpCodes.h

Interesting enough, the JIT is the only place in DotNetAnywhere that uses assembly code and even then it’s only for win32. It is used to allow a ‘jump’ or a goto to labels in the C source code, so as IL instructions are executed it never actually leaves the JITit(..) function, control is just moved around without having to make a full method call.

#ifdef __GNUC__

#define GET_LABEL(var, label) var = &&label

#define GO_NEXT() goto **(void**)(pCurOp++)

#else
#ifdef WIN32

#define GET_LABEL(var, label) \
	{ __asm mov edi, label \
	__asm mov var, edi }

#define GO_NEXT() \
	{ __asm mov edi, pCurOp \
	__asm add edi, 4 \
	__asm mov pCurOp, edi \
	__asm jmp DWORD PTR [edi - 4] }

#endif

Differences with the .NET Framework

In the full .NET framework all IL code is turned into machine code by the Just-in-Time Compiler (JIT) before being executed by the CPU.

However as we’ve already seen, DotNetAnywhere ‘interprets’ the IL, instruction-by-instruction and even through it’s done in a file called JIT.c no machine code is emitted, so the naming seems strange!?

Maybe it’s just a difference of perspective, but it’s not clear to me at what point you move from ‘interpreting’ code to ‘JITting’ it, even after reading the following links I’m not sure!! (can someone enlighten me?)

Garbage Collector

All the code for the DotNetAnywhere Garbage Collector (GC) is contained in Heap.c and is a very readable 600 lines of code. To give you an overview of what it does, here is the list of functions that it exposes:

void Heap_Init();
void Heap_SetRoots(tHeapRoots *pHeapRoots, void *pRoots, U32 sizeInBytes);
void Heap_UnmarkFinalizer(HEAP_PTR heapPtr);
void Heap_GarbageCollect();
U32 Heap_NumCollections();
U32 Heap_GetTotalMemory();

HEAP_PTR Heap_Alloc(tMD_TypeDef *pTypeDef, U32 size);
HEAP_PTR Heap_AllocType(tMD_TypeDef *pTypeDef);
void Heap_MakeUndeletable(HEAP_PTR heapEntry);
void Heap_MakeDeletable(HEAP_PTR heapEntry);

tMD_TypeDef* Heap_GetType(HEAP_PTR heapEntry);

HEAP_PTR Heap_Box(tMD_TypeDef *pType, PTR pMem);
HEAP_PTR Heap_Clone(HEAP_PTR obj);

U32 Heap_SyncTryEnter(HEAP_PTR obj);
U32 Heap_SyncExit(HEAP_PTR obj);

HEAP_PTR Heap_SetWeakRefTarget(HEAP_PTR target, HEAP_PTR weakRef);
HEAP_PTR* Heap_GetWeakRefAddress(HEAP_PTR target);
void Heap_RemovedWeakRefTarget(HEAP_PTR target);

Differences with the .NET Framework

However, like the JIT/Interpreter, the GC has some fundamental differences when compared to the .NET Framework

Conservative Garbage Collection

Firstly DotNetAnywhere implements what is knows as a Conservative GC. In simple terms this means that is does not know (for sure) which areas of memory are actually references/pointers to objects and which are just a random number (that looks like a memory address). In the Microsoft .NET Framework the JIT calculates this information and stores it in the GCInfo structure so the GC can make use of it. But DotNetAnywhere doesn’t do this.

Instead, during the Mark phase the GC gets all the available ‘roots’, but it will consider all memory addresses within an object as ‘potential’ references (hence it is ‘conservative’). It then has to lookup each possible reference, to see if it really points to an ‘object reference’. It does this by keeping track of all memory/heap references in a balanced binary search tree (ordered by memory address), which looks something like this:

However, this means that all objects references have to be stored in the binary tree when they are allocated, which adds some overhead to allocation. In addition extra memory is needed, 20 bytes per heap entry. We can see this by looking at the tHeapEntry data structure (all pointers are 4 bytes, U8 = 1 byte and padding is ignored), tHeapEntry *pLink[2] is the extra data that is needed just to enable the binary tree lookup.

struct tHeapEntry_ {
    // Left/right links in the heap binary tree
    tHeapEntry *pLink[2];
    // The 'level' of this node. Leaf nodes have lowest level
    U8 level;
    // Used to mark that this node is still in use.
    // If this is set to 0xff, then this heap entry is undeletable.
    U8 marked;
    // Set to 1 if the Finalizer needs to be run.
    // Set to 2 if this has been added to the Finalizer queue
    // Set to 0 when the Finalizer has been run (or there is no Finalizer in the first place)
    // Only set on types that have a Finalizer
    U8 needToFinalize;
    
    // unused
    U8 padding;

    // The type in this heap entry
    tMD_TypeDef *pTypeDef;

    // Used for locking sync, and tracking WeakReference that point to this object
    tSync *pSync;

    // The user memory
    U8 memory[0];
};

But why does DotNetAnywhere work like this? Fortunately Chris Bacon the author of DotNetAnywhere explains

Mind you, the whole heap code really needs a rewrite to reduce per-object memory overhead, and to remove the need for the binary tree of allocations. Not really thinking of a generational GC, that would probably add to much code. This was something I vaguely intended to do, but never got around to. The current heap code was just the simplest thing to get GC working quickly. The very initial implementation did no GC at all. It was beautifully fast, but ran out of memory rather too quickly.

For more info on ‘Conservative’ and ‘Precise’ GCs see:

GC only does ‘Mark-Sweep’, it doesn’t Compact

Another area in which the GC behaviour differs is that it doesn’t do any Compaction of memory after it’s cleaned up, as Steve Sanderson found out when working on Blazor

.. During server-side execution we don’t actually need to pin anything, because there’s no interop outside .NET. During client-side execution, everything is (in effect) pinned regardless, because DNA’s GC only does mark-sweep - it doesn’t have any compaction phase.

In addition, when an object is allocated DotNetAnywhere just makes a call to malloc(), see the code that does this is in the Heap_Alloc(..) function. So there is no concept of ‘Generations’ or ‘Segments’ that you have in the .NET Framework GC, i.e. no ‘Gen 0’, ‘Gen 1’, or ‘Large Object Heap’.

Threading Model

Finally, lets take a look at the threading model, which is fundamentally different from the one found in the .NET Framework.

Differences with the .NET Framework

Whilst DotNetAnywhere will happily create new threads and execute them for you, it’s only providing the illusion of true multi-threading. In reality it only runs on one thread, but context switches between the different threads that your program creates:

You can see this in action in the code below, (from the Thread_Execute() function), note the call to JIT_Execute(..) with numInst set to 100:

for (;;) {
    U32 minSleepTime = 0xffffffff;
    I32 threadExitValue;

    status = JIT_Execute(pThread, 100);
    switch (status) {
        ....
    }
}

An interesting side-effect is that the threading code in the DotNetAnywhere corlib implementation is really simple. For instance the internal implementation of the Interlocked.CompareExchange() function looks like the following, note the lack of synchronisation that you would normally expect:

tAsyncCall* System_Threading_Interlocked_CompareExchange_Int32(
            PTR pThis_, PTR pParams, PTR pReturnValue) {
    U32 *pLoc = INTERNALCALL_PARAM(0, U32*);
    U32 value = INTERNALCALL_PARAM(4, U32);
    U32 comparand = INTERNALCALL_PARAM(8, U32);

    *(U32*)pReturnValue = *pLoc;
    if (*pLoc == comparand) {
        *pLoc = value;
    }

    return NULL;
}

Benchmarks

As a simple test, I ran some benchmarks from The Computer Language Benchmarks Game - binary-trees, using the simplest C# version

Note: DotNetAnywhere was designed to run on low-memory devices, so it was not meant to have the same performance as the full .NET Framework. Please bear that in mind when looking at the results!!

.NET Framework, 4.6.1 - 0.36 seconds

Invoked=TestApp.exe 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

Exit code      : 0
Elapsed time   : 0.36
Kernel time    : 0.06 (17.2%)
User time      : 0.16 (43.1%)
page fault #   : 6604
Working set    : 25720 KB
Paged pool     : 187 KB
Non-paged pool : 24 KB
Page file size : 31160 KB

DotNetAnywhere - 54.39 seconds

Invoked=dna TestApp.exe 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

Total execution time = 54288.33 ms
Total GC time = 36857.03 ms
Exit code      : 0
Elapsed time   : 54.39
Kernel time    : 0.02 (0.0%)
User time      : 54.15 (99.6%)
page fault #   : 5699
Working set    : 15548 KB
Paged pool     : 105 KB
Non-paged pool : 8 KB
Page file size : 13144 KB

So clearly DotNetAnywhere doesn’t work as fast in this benchmark (0.36 seconds v 54 seconds). However if we look at other benchmarks from the same site, it performs a lot better. It seems that DotNetAnywhere has a significant overhead when allocating objects (a class), which is less obvious when using structs.

	Benchmark 1 (using `classes`)	Benchmark 2 (using `structs`)
Elapsed Time (secs)	3.1	2.0
GC Collections	96	67
Total GC time (msecs)	983.59	439.73

Finally, I really want to thank Chris Bacon, DotNetAnywhere is a great code base and gives a fantastic insight into what needs to happen for a .NET runtime to work.

Discuss this post on Hacker News and /r/programming

The post DotNetAnywhere: An Alternative .NET Runtime first appeared on my blog Performance is a Feature!

Analysing C# code on GitHub with BigQuery

2017-10-12T00:00:00+00:00

Just over a year ago Google made all the open source code on GitHub available for querying within BigQuery and as if that wasn’t enough you can run a terabyte of queries each month for free!

So in this post I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery you are charged per byte read), called fh-bigquery:github_extracts.contents_net_cs and has

5,885,933 unique ‘.cs’ files
792,166,632 lines of code (LOC)
37.17 GB of data

Which is a pretty comprehensive set of C# source code!

The rest of this post will attempt to answer the following questions:

Tabs or Spaces?
regions: ‘should be banned’ or ‘okay in some cases’?
‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
Do C# developers like writing functional code?

Then moving onto some less controversial C# topics:

Which using statements are most widely used?
What NuGet packages are most often included in a .NET project
How many lines of code (LOC) are in a typical C# file?
What is the most widely thrown Exception?
‘async/await all the things’ or not?
Do C# developers like using the var keyword? (Updated)

Before we end up looking at repositories, not just individual C# files:

What is the most popular repository with C# code in it?
Just how many files should you have in a repository?
What are the most popular C# class names?
‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?

If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Tabs or Spaces?

In the entire data-set there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space

Tabs	Tabs %	Spaces	Spaces %	Total
799,055	17.15%	3,859,528	82.85%	4,658,583

Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default)

If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?.

`regions`: ‘should be banned’ or ‘okay in some cases’?

It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)

‘K&R’ or ‘Allman’, where do C# devs like to put their braces?

C# developers overwhelmingly prefer putting an opening brace { on it’s own line (query used)

separate line	same line	same line (initializer)		total (with brace)	total (all code)
81,306,320 (67%)	40,044,603 (33%)	3,631,947 (2.99%)		121,350,923 (15.32%)	792,166,632

(‘same line initializers’ include code like new { Name = "", .. }, new [] { 1, 2, 3.. })

Do C# developers like writing functional code?

This is slightly unscientific, but I wanted to see how widely the Lambda Operator => is used in C# code (query). Yes, I know, if you want to write functional code on .NET you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.

Here’s the raw percentiles:

Percentile	% of lines using lambdas
10	0.51
25	1.14
50	2.50
75	5.26
90	9.95
95	14.29
99	28.00

So we can say that:

50% of all the C# code on GitHub uses => on 2.44% (or less) of their lines.
10% of all C# files have lambdas on almost 1 in 10 of their lines (9.95%)
5% use => on 1 in 7 lines (14.29%)
1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!

Which `using` statements are most widely used?

Now on to some a bit more substantial, what are the most widely used using statements in C# code?

The top 10 looks like this (the full results are available):

using statement	count
using System.Collections.Generic;	1,780,646
using System;	1,477,019
using System.Linq;	1,319,830
using System.Text;	902,165
using System.Threading.Tasks;	628,195
using System.Runtime.InteropServices;	431,867
using System.IO;	407,848
using System.Runtime.CompilerServices;	338,686
using System.Collections;	289,867
using System.Reflection;	218,369

However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices’ and ‘System.Runtime.CompilerServices’ which are include in ‘AssemblyInfo.cs` by default.

So if we adjust the list to take account of this, the top 10 looks like so:

using statement	count
using System.IO;	407,848
using System.Collections;	289,867
using System.Reflection;	218,369
using System.Diagnostics;	201,341
using System.Threading;	179,168
using System.ComponentModel;	160,681
using System.Web;	160,323
using System.Windows.Forms;	137,003
using System.Globalization;	132,113
using System.Drawing;	127,033

Finally, an interesting list is the top 10 using statements that aren’t System, Microsoft or Windows namespaces:

using statement	count
using NUnit.Framework;	119,463
using UnityEngine;	117,673
using Xunit;	99,099
using Newtonsoft.Json;	81,675
using Newtonsoft.Json.Linq;	29,416
using Moq;	23,546
using UnityEngine.UI;	20,355
using UnityEditor;	19,937
using Amazon.Runtime;	18,941
using log4net;	17,297

What NuGet packages are most often included in a .NET project?

It turns out that there is also a separate dataset containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this we can see that Json.Net is the clear winner!!

package	count
Newtonsoft.Json	45,055
Microsoft.Web.Infrastructure	16,022
Microsoft.AspNet.Razor	15,109
Microsoft.AspNet.WebPages	14,495
Microsoft.AspNet.Mvc	14,236
EntityFramework	14,191
Microsoft.AspNet.WebApi.Client	13,480
Microsoft.AspNet.WebApi.Core	12,210
Microsoft.Net.Http	11,625
jQuery	10,646
Microsoft.Bcl.Build	10,641
Microsoft.Bcl	10,349
NUnit	10,341
Owin	9,681
Microsoft.Owin	9,202
Microsoft.AspNet.WebApi.WebHost	9,007
WebGrease	8,743
Microsoft.AspNet.Web.Optimization	8,721
Microsoft.AspNet.WebApi	8,179

How many lines of code (LOC) are in a typical C# file?

Are C# developers prone to creating huge files that go one for 1000’s of lines? Well some are but fortunately it’s the minority of us!!

Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.

Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:

And in case you’re wondering, here’s the Top 10 longest C# files!!

File	Lines
MarMot/Input/test.marmot.cs	92663
src/CodenameGenerator/WordRepos/LastNamesRepository.cs	88810
cs_inputtest/cs_02_7000.cs	63004
cs_inputtest/cs_02_6000.cs	54004
src/ML NET20/Utility/UserName.cs	52014
MWBS/Dictionary/DefaultWordDictionary.cs	48912
Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs	48407
UrduProofReader/UrduLibs/Utils.cs	48255
cs_inputtest/cs_02_5000.cs	45004
css/style.cs	44366

What is the most widely thrown `Exception`?

There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions were thrown and NotSupportedException being so high up the list is a bit worrying!!

Exception	count
throw new ArgumentNullException	699,526
throw new ArgumentException	361,616
throw new NotImplementedException	340,361
throw new InvalidOperationException	260,792
throw new ArgumentOutOfRangeException	160,640
throw new NotSupportedException	110,019
throw new HttpResponseException	74,498
throw new ValidationException	35,615
throw new ObjectDisposedException	31,129
throw new ApplicationException	30,849
throw new UnauthorizedException	21,133
throw new FormatException	19,510
throw new SerializationException	17,884
throw new IOException	15,779
throw new IndexOutOfRangeException	14,778
throw new NullReferenceException	12,372
throw new InvalidDataException	12,260
throw new ApiException	11,660
throw new InvalidCastException	10,510

‘async/await all the things’ or not?

The addition of the async and await keywords to the C# language makes writing asynchronous code much easier:

public async Task<int> GetDotNetCountAsync()
{
    // Suspends GetDotNetCountAsync() to allow the caller (the web server)
    // to accept another request, rather than blocking on this one.
    var html = await _httpClient.DownloadStringAsync("http://dotnetfoundation.org");

    return Regex.Matches(html, ".NET").Count;
}

But how much is it used? Using the query below:

SELECT Count(*) count
FROM
  [fh-bigquery:github_extracts.contents_net_cs]
WHERE
  REGEXP_MATCH(content, r'\sasync\s|\sawait\s')

I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async or await in them.

Do C# developers like using the `var` keyword?

~~Less that they use async and await, there are 130,590 files that have at least one usage of the var keyword~~

Update: thanks for jairbubbles for pointing out that my var regex was wrong and supplying a fixed version!

More than they use async and await, there are 1,457,154 files that have at least one usage of the var keyword

Just how many files should you have in a repository?

90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.

(again the Y-axis (# files) is logarithmic)

The top 10 largest repositories, by number of C# files are shown below:

Repository	# Files
https://github.com/xen2/mcs	23389
https://github.com/mater06/LEGOChimaOnlineReloaded	14241
https://github.com/Microsoft/referencesource	13051
https://github.com/dotnet/corefx	10652
https://github.com/apo-j/Projects_Working	10185
https://github.com/Microsoft/CodeContracts	9338
https://github.com/drazenzadravec/nequeo	8060
https://github.com/ClearCanvas/ClearCanvas	7946
https://github.com/mwilliamson-firefly/aws-sdk-net	7860
https://github.com/151706061/MacroMedicalSystem	7765

What is the most popular repository with C# code in it?

This time we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):

repo	stars	files
https://github.com/grpc/grpc	11075	237
https://github.com/dotnet/coreclr	8576	6503
https://github.com/dotnet/roslyn	8422	6351
https://github.com/facebook/yoga	8046	73
https://github.com/bazelbuild/bazel	7123	132
https://github.com/dotnet/corefx	7115	10652
https://github.com/SeleniumHQ/selenium	7024	512
https://github.com/Microsoft/WinObjC	6184	81
https://github.com/qianlifeng/Wox	5674	207
https://github.com/Wox-launcher/Wox	5674	142
https://github.com/ShareX/ShareX	5336	766
https://github.com/Microsoft/Windows-universal-samples	5130	1501
https://github.com/NancyFx/Nancy	3701	957
https://github.com/chocolatey/choco	3432	248
https://github.com/JamesNK/Newtonsoft.Json	3340	650

Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)

What are the most popular C# `class` names?

Assuming that I got the regex correct, the most popular C# class names are the following:

Class name	Count
class C	182480
class Program	163462
class Test	50593
class Settings	40841
class Resources	39345
class A	34687
class App	28462
class B	24246
class Startup	18238
class Foo	15198

Yay for Foo, just sneaking into the Top 10!!

‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?

Finally lets look at the different class names used, as with the using statement they are dominated by the default ones used in the Visual Studio templates:

File	Count
AssemblyInfo.cs	386822
Program.cs	105280
Resources.Designer.cs	40881
Settings.Designer.cs	35392
App.xaml.cs	21928
Global.asax.cs	16133
Startup.cs	14564
HomeController.cs	13574
RouteConfig.cs	11278
MainWindow.xaml.cs	11169

Discuss this post on Hacker News and /r/csharp

More Information

As always, if you’ve read this far your present is yet more blog posts to read, enjoy!!

How BigQuery Works (only put in at the end of the blog post)

BigQuery analysis of other Programming Languages

The post Analysing C# code on GitHub with BigQuery first appeared on my blog Performance is a Feature!

A look at the internals of 'boxing' in the CLR

2017-08-02T00:00:00+00:00

It’s a fundamental part of .NET and can often happen without you knowing, but how does it actually work? What is the .NET Runtime doing to make boxing possible?

Note: this post won’t be discussing how to detect boxing, how it can affect performance or how to remove it (speak to Ben Adams about that!). It will only be talking about how it works.

As an aside, if you like reading about CLR internals you may find these other posts interesting:

Boxing in the CLR Specification

Firstly it’s worth pointing out that boxing is mandated by the CLR specification ‘ECMA-335’, so the runtime has to provide it:

This means that there are a few key things that the CLR needs to take care of, which we will explore in the rest of this post.

Creating a ‘boxed’ Type

The first thing that the runtime needs to do is create the corresponding reference type (‘boxed type’) for any struct that it loads. You can see this in action, right at the beginning of the ‘Method Table’ creation where it first checks if it’s dealing with a ‘Value Type’, then behaves accordingly. So the ‘boxed type’ for any struct is created up front, when your .dll is imported, then it’s ready to be used by any ‘boxing’ that happens during program execution.

The comment in the linked code is pretty interesting, as it reveals some of the low-level details the runtime has to deal with:

// Check to see if the class is a valuetype; but we don't want to mark System.Enum
// as a ValueType. To accomplish this, the check takes advantage of the fact
// that System.ValueType and System.Enum are loaded one immediately after the
// other in that order, and so if the parent MethodTable is System.ValueType and
// the System.Enum MethodTable is unset, then we must be building System.Enum and
// so we don't mark it as a ValueType.

CPU-specific code-generation

But to see what happens during program execution, let’s start with a simple C# program. The code below creates a custom struct or Value Type, which is then ‘boxed’ and ‘unboxed’:

public struct MyStruct
{
    public int Value;
}

var myStruct = new MyStruct();

// boxing
var boxed = (object)myStruct;

// unboxing
var unboxed = (MyStruct)boxed;

This gets turned into the following IL code, in which you can see the box and unbox.any IL instructions:

L_0000: ldloca.s myStruct
L_0002: initobj TestNamespace.MyStruct
L_0008: ldloc.0 
L_0009: box TestNamespace.MyStruct
L_000e: stloc.1 
L_000f: ldloc.1 
L_0010: unbox.any TestNamespace.MyStruct

Runtime and JIT code

So what does the JIT do with these IL op codes? Well in the normal case it wires up and then inlines the optimised, hand-written, assembly code versions of the ‘JIT Helper Methods’ provided by the runtime. The links below take you to the relevant lines of code in the CoreCLR source:

CPU specific, optimised versions (which are wired-up at run-time):
- JIT_BoxFastMP_InlineGetThread (AMD64 - multi-proc or Server GC, implicit TLS)
- JIT_BoxFastMP (AMD64 - multi-proc or Server GC)
- JIT_BoxFastUP (AMD64 - single-proc and Workstation GC)
- JIT_TrialAlloc::GenBox(..) (x86), which is independently wired-up
JIT inlines the helper function call in the common case, see Compiler::impImportAndPushBox(..)
Generic, less-optimised version, used as a fall-back MethodTable::Box(..)
- Eventually calls into CopyValueClassUnchecked(..)
- Which ties in with the answer to this Stack Overflow question Why is struct better with being less than 16 bytes?

Interesting enough, the only other ‘JIT Helper Methods’ that get this special treatment are object, string or array allocations, which goes to show just how performance sensitive boxing is.

In comparison, there is only one helper method for ‘unboxing’, called JIT_Unbox(..), which falls back to JIT_Unbox_Helper(..) in the uncommon case and is wired up here (CORINFO_HELP_UNBOX to JIT_Unbox). The JIT will also inline the helper call in the common case, to save the cost of a method call, see Compiler::impImportBlockCode(..).

Note that the ‘unbox helper’ only fetches a reference/pointer to the ‘boxed’ data, it has to then be put onto the stack. As we saw above, when the C# compiler does unboxing it uses the ‘Unbox_Any’ op-code not just the ‘Unbox’ one, see Unboxing does not create a copy of the value for more information.

Unboxing Stub Creation

As well as ‘boxing’ and ‘unboxing’ a struct, the runtime also needs to help out during the time that a type remains ‘boxed’. To see why, let’s extend MyStruct and override the ToString() method, so that it displays the current Value:

public struct MyStruct
{
    public int Value;
	
    public override string ToString()
    {
        return "Value = " + Value.ToString();
    }
}

Now, if we look at the ‘Method Table’ the runtime creates for the boxed version of MyStruct (remember, value types have no ‘Method Table’), we can see something strange going on. Note that there are 2 entries for MyStruct::ToString, one of which I’ve labelled as an ‘Unboxing Stub’

 Method table summary for 'MyStruct':
 Number of static fields: 0
 Number of instance fields: 1
 Number of static obj ref fields: 0
 Number of static boxed fields: 0
 Number of declared fields: 1
 Number of declared methods: 1
 Number of declared non-abstract methods: 1
 Vtable (with interface dupes) for 'MyStruct':
   Total duplicate slots = 0

 SD: MT::MethodIterator created for MyStruct (TestNamespace.MyStruct).
   slot  0: MyStruct::ToString  0x000007FE41170C10 (slot =  0) (Unboxing Stub)
   slot  1: System.ValueType::Equals  0x000007FEC1194078 (slot =  1) 
   slot  2: System.ValueType::GetHashCode  0x000007FEC1194080 (slot =  2) 
   slot  3: System.Object::Finalize  0x000007FEC14A30E0 (slot =  3) 
   slot  5: MyStruct::ToString  0x000007FE41170C18 (slot =  4) 
   <-- vtable ends here

(full output is available)

So what is this ‘unboxing stub’ and why is it needed?

It’s there because if you call ToString() on a boxed version of MyStruct, it calls the overridden method declared within MyStruct itself (which is what you’d want it to do), not the Object::ToString() version. But, MyStruct::ToString() expects to be able to access any fields within the struct, such as Value in this case. To make that possible, the runtime/JIT has to adjust the this pointer before MyStruct::ToString() is called, as shown in the diagram below:

1. MyStruct:         [0x05 0x00 0x00 0x00]

                     |   Object Header   |   MethodTable  |   MyStruct    |
2. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
                                          ^
                    object 'this' pointer | 

                     |   Object Header   |   MethodTable  |   MyStruct    |
3. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
                                                           ^
                                   adjusted 'this' pointer | 

Key to the diagram

Original struct, on the stack
The struct being boxed into an object that lives on the heap
Adjustment made to this pointer so MyStruct::ToString() will work

(If you want more information on .NET object internals, see this useful article)

We can see this in action in the the code linked below, note that the stub only consists of a few assembly instructions (it’s not as heavy-weight as a method call) and there are CPU-specific versions:

MethodDesc::DoPrestub(..) (calls MakeUnboxingStubWorker(..))
MakeUnboxingStubWorker(..) (calls EmitUnboxMethodStub(..) to create the stub)
- i386
- arm
- arm64

The runtime/JIT has to do these tricks to help maintain the illusion that a struct can behave like a class, even though under-the-hood they are very different. See Eric Lipperts answer to How do ValueTypes derive from Object (ReferenceType) and still be ValueTypes? for a bit more on this.

Hopefully this post has given you some idea of what happens under-the-hood when ‘boxing’ takes place.

Memory Usage Inside the CLR

2017-07-10T00:00:00+00:00

Have you ever wondered where and why the .NET Runtime (CLR) allocates memory? I don’t mean the ‘managed’ memory that your code allocates, e.g. via new MyClass(..) and the Garbage Collector (GC) then cleans up. I mean the memory that the CLR itself allocates, all the internal data structures that it needs to make is possible for your code to run.

Note just to clarify, this post will not be telling you how you can analyse the memory usage of your code, for that I recommend using one of the excellent .NET Profilers available such as dotMemory by JetBrains or the ANTS Memory Profiler from Redgate (I’ve personally used both and they’re great)

The high-level view

Fortunately there’s a fantastic tool that makes it very easy for us to get an overview of memory usage within the CLR itself. It’s called VMMap and it’s part of the excellent Sysinternals Suite.

For the post I will just be using a simple HelloWorld program, so that we can observe what the CLR does in the simplest possible scenario, obviously things may look a bit different in a more complex app.

Firstly, lets look at the data over time, in 1 second intervals. The HelloWorld program just prints to the Console and then waits until you press <ENTER>, so once the memory usage has reached it’s peak it remains there till the program exits. (Click for a larger version)

However, to get a more detailed view, we will now look at the snapshot from 2 seconds into the timeline, when the memory usage has stabilised.

Note: If you want to find out more about memory usage in general, but also specifically how measure it in .NET applications, I recommend reading this excellent series of posts by Sasha Goldshtein

Also, if like me you always get the different types of memory mixed-up, please read this Stackoverflow answer first What is private bytes, virtual bytes, working set?

‘Image’ Memory

Now we’ve seen the high-level view, lets take a close look at the individual chucks, the largest of which is labelled Image, which according to the VMMap help page (see here for all info on all memory types):

… represents an executable file such as a .exe or .dll and has been loaded into a process by the image loader. It does not include images mapped as data files, which would be included in the Mapped File memory type. Image mappings can include shareable memory like code. When data regions, like initialized data, is modified, additional private memory is created in the process.

At this point, it’s worth pointing out a few things:

This memory is takes up a large amount of the total process memory because I’m using a simple HelloWorld program, in other types of programs it wouldn’t dominate the memory usage as much
I was using a DEBUG version of the CoreCLR, so the CLR specific files System.Private.CoreLib.dll, coreclr.dll, clrjit.dll and CoreRun.exe may well be larger than if they were compiled in RELEASE mode
Some of this memory is potentially ‘shared’ with other processes, compare the numbers in the ‘Total WS’, ‘Private WS’, ‘Shareable WS’ and ‘Shared WS’ columns to see this in action.

‘Managed Heaps’ created by the Garbage Collector

The next largest usage of memory is the GC itself, it pre-allocates several heaps that it can then give out whenever your program allocates an object, for example via code such as new MyClass() or new byte[].

The main thing to note about the image above is that you can clearly see the different heap, there is 256 MB allocated for Generations (Gen 0, 1, 2) and 128 MB for the ‘Large Object Heap’. In addition, note the difference between the amounts in the Size and the Committed columns. Only the Committed memory is actually being used, the total Size is what the GC pre-allocates or reserves up front from the address space.

If you’re interested, the rules for heap or more specifically segment sizes are helpfully explained in the Microsoft Docs, but simply put, it varies depending on the GC mode (Workstation v Server), whether the process is 32/64-bit and ‘Number of CPUs’.

Internal CLR ‘Heap’ memory

However the part that I’m going to look at for the rest of this post is the memory that is allocated by the CLR itself, that is unmanaged memory that is uses for all its internal data structures.

But if we just look at the VMMap UI view, it doesn’t really tell us that much!

However, using the excellent PerfView tool we can capture the full call-stack of any memory allocations, that is any calls to VirtualAlloc() or RtlAllocateHeap() (obviously these functions only apply when running the CoreCLR on Windows). If we do this, PerfView gives us the following data (yes, it’s not pretty, but it’s very powerful!!)

So lets explore this data in more detail.

Notable memory allocations

There are a few places where the CLR allocates significant chunks of memory up-front and then uses them through its lifetime, they are listed below:

GC related allocations (see gc.cpp)
- Mark List - 1,052,672 Bytes (1,028 K) in WKS::make_mark_list(..). using during the ‘mark’ phase of the GC, see Back To Basics: Mark and Sweep Garbage Collection
- Card Table - 397,312 Bytes (388 K) in WKS::gc_heap::make_card_table(..), see Marking the ‘Card Table’
- Overall Heap Creation/Allocation - 204,800 Bytes (200 K) in WKS::gc_heap::make_gc_heap(..)
- S.O.H Segment creation - 65,536 Bytes (64 K) in WKS::gc_heap::allocate(..), triggered by the first object allocation
- L.O.H Segment creation - 65,536 Bytes (64 K) in WKS::gc_heap::allocate_large_object(..), triggered by the first ‘large’ object allocation
- Handle Table - 20,480 Bytes (20 K) in HndCreateHandleTable(..)
Stress Log - 4,194,304 Bytes (4,096 K) in StressLog::Initialize(..). Only if the ‘stress log’ is activated, see this comment for more info
‘Watson’ error reporting - 65,536 Bytes (64 K) in EEStartupHelper routine
Virtual Call Stub Manager - 36,864 Bytes (36 K) in VirtualCallStubManager::InitStatic(), which in turn creates the DispatchCache. See ‘Virtual Stub Dispatch’ in the BOTR for more info
Debugger Heap and Control-Block - 28,672 Bytes (28K) (only if debugging support is needed) in DebuggerHeap::Init(..) and DebuggerRCThread::Init(..), both called via InitializeDebugger(..)

Execution Engine Heaps

However another technique that it uses is to allocated ‘heaps’, often 64K at a time and then perform individual allocations within the heaps as needed. These heaps are split up into individual use-cases, the most common being for ‘frequently accessed’ data and it’s counter-part, data that is ‘rarely accessed’, see the explanation from this comment in loaderallocator.hpp for more. This is done to ensure that the CLR retains control over any memory allocations and can therefore prevent ‘fragmentation’.

These heaps are together known as ‘Loader Heaps’ as explained in Drill Into .NET Framework Internals to See How the CLR Creates Runtime Objects (wayback machine version):

LoaderHeaps LoaderHeaps are meant for loading various runtime CLR artifacts and optimization artifacts that live for the lifetime of the domain. These heaps grow by predictable chunks to minimize fragmentation. LoaderHeaps are different from the garbage collector (GC) Heap (or multiple heaps in case of a symmetric multiprocessor or SMP) in that the GC Heap hosts object instances while LoaderHeaps hold together the type system. Frequently accessed artifacts like MethodTables, MethodDescs, FieldDescs, and Interface Maps get allocated on a HighFrequencyHeap, while less frequently accessed data structures, such as EEClass and ClassLoader and its lookup tables, get allocated on a LowFrequencyHeap. The StubHeap hosts stubs that facilitate code access security (CAS), COM wrapper calls, and P/Invoke.

One of the main places you see this high/low-frequency of access is in the heart of the Type system, where different data items are either classified as ‘hot’ (high-frequency) or ‘cold’ (low-frequency), from the ‘Key Data Structures’ section of the BOTR page on ‘Type Loader Design’:

EEClass

MethodTable data are split into “hot” and “cold” structures to improve working set and cache utilization. MethodTable itself is meant to only store “hot” data that are needed in program steady state. EEClass stores “cold” data that are typically only needed by type loading, JITing or reflection. Each MethodTable points to one EEClass.

Further to this, listed below are some specific examples of when each heap type is used:

List of all Low-Frequency Heap usages
- EEClass::operator new (the ‘cold’ scenario above)
- MscorlibBinder::AttachModule(..)
- EETypeHashTable::Create(..)
- COMNlsHashProvider::InitializeDefaultSeed()
- ClassLoader::CreateTypeHandleForTypeKey(..) (when creating function pointers)
List of all High-Frequency Heap usages
- MethodTableBuilder::AllocateNewMT(..) (the ‘hot’ scenario mentioned above)
- ArrayClass::GenerateArrayAccessorCallSig(..)
- ClassLoader::CreateTypeHandleForNonCanonicalGenericInstantiation(..)
- ECall::GetFCallImpl(..)
- ComPlusCall::PopulateComPlusCallMethodDesc(..)
List of all Stub Heap usages
- MethodDesc::DoPrestub(..) (triggers JIT-ting of a method)
- UMEntryThunkCache::GetUMEntryThunk(..) (a DLL Import callback)
- ComCall::CreateGenericComCallStub(..)
- MakeUnboxingStubWorker(..)
List of all Precode Heap Usages
List of all Executable Heap usages
- GenerateInitPInvokeFrameHelper(..)
- JIT_TrialAlloc::GenBox(..) (x86 JIT)
- From comment on GetExecutableHeap() ‘The executable heap is intended to only be used by the global loader allocator.’

All the general ‘Loader Heaps’ listed above are allocated in the LoaderAllocator::Init(..) function (link to actual code), the executable and stub heap have the ‘executable’ flag set, all the rest don’t. The size of these heaps is configured in this code, they ‘reserve’ different amounts up front, but they all have a ‘commit’ size that is equivalent to one OS ‘page’.

In addition to the ‘general’ heaps, there are some others that are specifically used by the Virtual Stub Dispatch mechanism, they are known as the indcell_heap, cache_entry_heap, lookup_heap, dispatch_heap and resolve_heap, they’re allocated in this code, using the specified commit/reserve sizes.

Finally, if you’re interested in the mechanics of how the heaps actually work take a look at LoaderHeap.cpp.

JIT Memory Usage

Last, but by no means least, there is one other component in the CLR that extensively allocates memory and that is the JIT. It does so in 2 main scenarios:

‘Transient’ or temporary memory needed when it’s doing the job of converting IL code into machine code
‘Permanent’ memory used when it needs to emit the ‘machine code’ for a method

‘Transient’ Memory

This is needed by the JIT when it is doing the job of converting IL code into machine code for the current CPU architecture. This memory is only needed whilst the JIT is running and can be re-used/discarded later, it is used to hold the internal JIT data structures (e.g. Compiler, BasicBlock, GenTreeStmt, etc).

For example, take a look at the following code from Compiler::fgValueNumber():

...
 // Allocate the value number store.
assert(fgVNPassesCompleted > 0 || vnStore == nullptr);
if (fgVNPassesCompleted == 0)
{
    CompAllocator* allocator = new (this, CMK_ValueNumber) CompAllocator(this, CMK_ValueNumber);
    vnStore                  = new (this, CMK_ValueNumber) ValueNumStore(this, allocator);
}
...

The line vnStore = new (this, CMK_ValueNumber) ... ends up calling the specialised new operator defined in compiler.hpp (code shown below), which as per the comment, uses a customer ‘Arena Allocator’ that is implemented in /src/jit/alloc.cpp

/*****************************************************************************
 *  operator new
 *
 *  Note that compGetMem is an arena allocator that returns memory that is
 *  not zero-initialized and can contain data from a prior allocation lifetime.
 *  it also requires that 'sz' be aligned to a multiple of sizeof(int)
 */

inline void* __cdecl operator new(size_t sz, Compiler* context, CompMemKind cmk)
{
    sz = AlignUp(sz, sizeof(int));
    assert(sz != 0 && (sz & (sizeof(int) - 1)) == 0);
    return context->compGetMem(sz, cmk);
}

This technique (of overriding the new operator) is used in lots of places throughout the CLR, for instance there is a generic one implemented in the CLR Host.

‘Permanent’ Memory

The last type of memory that the JIT uses is ‘permanent’ memory to store the JITted machine code, this is done via calls to Compiler::compGetMem(..), starting from Compiler::compCompile(..) via the call-stack shown below. Note that as before this uses a customer ‘Arena Allocator’ that is implemented in /src/jit/alloc.cpp

+ clrjit!ClrAllocInProcessHeap
 + clrjit!ArenaAllocator::allocateHostMemory
  + clrjit!ArenaAllocator::allocateNewPage
   + clrjit!ArenaAllocator::allocateMemory
    + clrjit!Compiler::compGetMem
     + clrjit!emitter::emitGetMem
      + clrjit!emitter::emitAllocInstr
       + clrjit!emitter::emitNewInstrTiny
        + clrjit!emitter::emitIns_R_R
         + clrjit!emitter::emitInsBinary
          + clrjit!CodeGen::genCodeForStoreLclVar
           + clrjit!CodeGen::genCodeForTreeNode
            + clrjit!CodeGen::genCodeForBBlist
             + clrjit!CodeGen::genGenerateCode
              + clrjit!Compiler::compCompile

Real-world example

Finally, to prove that this investigation matches with more real-world scenarios, we can see similar memory usage breakdowns in this GitHub issue: [Question] Reduce memory consumption of CoreCLR

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Typical profile of CoreCLR’s memory on the GUI applications is the following:

Mapped assembly images - 4.2 megabytes (50%)

JIT-compiler’s memory - 1.7 megabytes (20%)

Execution engine - about 1 megabyte (11%)

Code heap - about 1 megabyte (11%)

Type information - about 0.5 megabyte (6%)

Objects heap - about 0.2 megabyte (2%)

Discuss this post on HackerNews

How the .NET Runtime loads a Type

2017-06-15T00:00:00+00:00

It is something we take for granted every time we run a .NET program, but it turns out that loading a Type or class is a fairly complex process.

So how does the .NET Runtime (CLR) actually load a Type?

If you want the tl;dr it’s done carefully, cautiously and step-by-step

Ensuring Type Safety

One of the key requirements of a ‘Managed Runtime’ is providing Type Safety, but what does it actually mean? From the MSDN page on Type Safety and Security

Type-safe code accesses only the memory locations it is authorized to access. (For this discussion, type safety specifically refers to memory type safety and should not be confused with type safety in a broader respect.) For example, type-safe code cannot read values from another object’s private fields. It accesses types only in well-defined, allowable ways.

So in effect, the CLR has to ensure your Types/Classes are well-behaved and following the rules.

Compiler prevents you from creating an ‘abstract’ class

But lets look at a more concrete example, using the C# code below

public abstract class AbstractClass
{
    public AbstractClass() {  }
}

public class NormalClass : AbstractClass 
{  
    public NormalClass() {  }
}

public static void Main(string[] args)
{
    var test = new AbstractClass();
}

The compiler quite rightly refuses to compile this and gives the following error, because abstract classes can’t be created, you can only inherit from them.

error CS0144: Cannot create an instance of the abstract class or interface 
        'ConsoleApplication.AbstractClass'

So that’s all well and good, but the CLR can’t rely on all code being created via a well-behaved compiler, or in fact via a compiler at all. So it has to check for and prevent any attempt to create an abstract class.

Writing IL code by hand

One way to circumvent the compiler is to write IL code by hand using the IL Assembler tool (ILAsm) which will do almost no checks on the validity of the IL you give it.

For instance the IL below is the equivalent of writing var test = new AbstractClass(); (if the C# compiler would let us):

.method public hidebysig static void Main(string[] args) cil managed
{
    .entrypoint
    .maxstack 1
    .locals init (
        [0] class ConsoleApplication.NormalClass class2)
	
    // System.InvalidOperationException: Instances of abstract classes cannot be created.
    newobj instance void ConsoleApplication.AbstractClass::.ctor()
		
    stloc.0 
    ldloc.0 
    callvirt instance class [mscorlib]System.Type [mscorlib]System.Object::GetType()
    callvirt instance string [mscorlib]System.Reflection.MemberInfo::get_Name()
    call void [mscorlib]Internal.Console::WriteLine(string)
    ret 
}

Fortunately the CLR has got this covered and will throw an InvalidOperationException when you execute the code. This is due to this check which is hit when the JIT compiles the newobj IL instruction.

Creating Types at run-time

One other way that you can attempt to create an abstract class is at run-time, using reflection (thanks to this blog post for giving me some tips on other ways of creating Types).

This is shown in the code below:

var abstractType = Type.GetType("ConsoleApplication.AbstractClass");
Console.WriteLine(abstractType.FullName);

// System.MissingMethodException: Cannot create an abstract class.
var abstractInstance = Activator.CreateInstance(abstractType);

The compiler is completely happy with this, it doesn’t do anything to prevent or warn you and nor should it. However when you run the code, it will throw an exception, strangely enough a MissingMethodException this time, but it does the job!

The call stack is below:

Activator CreateInstance(..) (C# code)
RtType CreateInstanceSlow(..) (C# code)
RuntimeHandles CreateInstance(..) (extern call)
RuntimeTypeHandle::CreateInstance(..) (C++ implementation)
The actual check that throws a MissingMethodException

One final way (unless I’ve missed some out?) is to use GetUninitializedObject(..) in the FormatterServices class like so:

public static object CreateInstance(Type type)
{
    var constructor = type.GetConstructor(new Type[0]);
    if (constructor == null && !type.IsValueType)
    {
        throw new NotSupportedException(
            "Type '" + type.FullName + "' doesn't have a parameterless constructor");
    }

    var emptyInstance = FormatterServices.GetUninitializedObject(type);
	
    if (constructor == null)
        return null;
		
    return constructor.Invoke(emptyInstance, new object[0]) ?? emptyInstance;
}

var abstractType = Type.GetType("ConsoleApplication.AbstractClass");
Console.WriteLine(abstractType.FullName);

// System.MemberAccessException: Cannot create an abstract class.
var abstractInstance = CreateInstance(abstractType);

Again the run-time stops you from doing this, however this time it decides to throw a MemberAccessException?

This happens via the following call stack:

FormatterServices GetUninitializedObject(..) (C# code)
FormatterServices nativeGetUninitializedObject(..) (extern call)
ReflectionSerialization::GetUninitializedObject(..) (C++ implementation)
Actual check that throws a MemberAccessException

Further Type-Safety Checks

These checks are just one example of what the runtime has to validate when creating types, there are many more things is has to deal with. For instance you can’t:

instantiate an interface
create a Function Pointer type
load a type with invalid IL
box a type containing stack pointers
load a type if any of it’s generic argument types failed to load
create a subclass of an Array
create virtual, static methods
have methods in an enum
have a class with a method name that is too long (1024 characters if you’re wondering)
and many, many more (for instance, search classcompat.cpp for BuildMethodTableThrowException and methodtablebuilder.cpp for ThrowTypeLoadException)

Loading Types ‘step-by-step’

So we’ve seen that the CLR has to do multiple checks when it’s loading types, but why does it have to load them ‘step-by-step’?

Well in a nutshell, it’s because of circular references and recursion, particularly when dealing with generics types. If we take the code below from section ‘2.1 Load Levels’ in Type Loader Design (BotR):

classA<T> : C<B<T>>
{ }

classB<T> : C<A<T>>
{ }

classC<T>
{ }

These are valid types and class A depends on class B and vice versa. So we can’t load A until we know that B is valid, but we can’t load B, until we’re sure that A is valid, a classic deadlock!!

How does the run-time get round this, well from the same BotR page:

The loader initially creates the structure(s) representing the type and initializes them with data that can be obtained without loading other types. When this “no-dependencies” work is done, the structure(s) can be referred from other places, usually by sticking pointers to them into another structures. After that the loader progresses in incremental steps and fills the structure(s) with more and more information until it finally arrives at a fully loaded type. In the above example, the base types of A and B will be approximated by something that does not include the other type, and substituted by the real thing later.

(there is also some more info here)

So it loads types in stages, step-by-step, ensuring each dependant type has reached the same stage before continuing. These ‘Class Load’ stages are shown in the image below and explained in detail in this very helpful source-code comment (Yay for Open-Sourcing the CoreCLR!!)

The different levels are handled in the ClassLoader::DoIncrementalLoad(..) method, which contains the switch statement that deals with them all in turn.

However this is part of a bigger process, which controls loading an entire file, also known as a Module or Assembly in .NET terminology. The entire process for that is handled in by another dispatch loop (switch statement), that works with the FileLoadLevel enum (definition). So in reality the whole process for loading an Assembly looks like this (the loading of one or more Types happens as sub-steps once the Module had reached the FILE_LOADED stage)

FILE_LOAD_CREATE - DomainFile ctor()
FILE_LOAD_BEGIN - Begin()
FILE_LOAD_FIND_NATIVE_IMAGE - FindNativeImage()
FILE_LOAD_VERIFY_NATIVE_IMAGE_DEPENDENCIES - VerifyNativeImageDependencies()
FILE_LOAD_ALLOCATE - Allocate()
FILE_LOAD_ADD_DEPENDENCIES - AddDependencies()
FILE_LOAD_PRE_LOADLIBRARY - PreLoadLibrary()
FILE_LOAD_LOADLIBRARY - LoadLibrary()
FILE_LOAD_POST_LOADLIBRARY - PostLoadLibrary()
FILE_LOAD_EAGER_FIXUPS - EagerFixups()
FILE_LOAD_VTABLE_FIXUPS - VtableFixups()
FILE_LOAD_DELIVER_EVENTS - DeliverSyncEvents()
FILE_LOADED - FinishLoad()
1. CLASS_LOAD_BEGIN
2. CLASS_LOAD_UNRESTOREDTYPEKEY
3. CLASS_LOAD_UNRESTORED
4. CLASS_LOAD_APPROXPARENTS
5. CLASS_LOAD_EXACTPARENTS
6. CLASS_DEPENDENCIES_LOADED
7. CLASS_LOADED
FILE_LOAD_VERIFY_EXECUTION - VerifyExecution()
FILE_ACTIVE - Activate()
- calls MethodTable::CheckRunClassInitThrowing() and Module::ExpandAll() which trigger/run the static constructors of all the classes in the file/module

We can see this in action if we build a Debug version of the CoreCLR and enable the relevant configuration knobs. For a simple ‘Hello World’ program we get the log output shown below, where LOADER: messages correspond to FILE_LOAD_XXX stages and PHASEDLOAD: messages indicate which CLASS_LOAD_XXX step we are on.

You can also see some of the other events that happen at the same time, these include creation of static variables (STATICS:), thread-statics (THREAD STATICS:) and PreStubWorker which indicates methods being prepared for the JITter.

-------------------------------------------------------------------------------------------------------
This is NOT the full output, it's only the parts that reference 'Program.exe' and it's modules/classses
-------------------------------------------------------------------------------------------------------

PEImage: Opened HMODULE C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe
StoreFile: Add cached entry (000007FE65174540) with PEFile 000000000040D6E0
Assembly C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe: bits=0x2
LOADER: 439e30:***Program*	>>>Load initiated, LOADED/LOADED
LOADER: 0000000000439E30:***Program*	   loading at level BEGIN
LOADER: 0000000000439E30:***Program*	   loading at level FIND_NATIVE_IMAGE
LOADER: 0000000000439E30:***Program*	   loading at level VERIFY_NATIVE_IMAGE_DEPENDENCIES
LOADER: 0000000000439E30:***Program*	   loading at level ALLOCATE
STATICS: Allocating statics for module Program
Loaded pModule: "C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe".
Module Program: bits=0x2
STATICS: Allocating 72 bytes for precomputed statics in module C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe in LoaderAllocator 000000000043AA18
StoreFile (StoreAssembly): Add cached entry (000007FE65174F28) with PEFile 000000000040D6E0Completed Load Level ALLOCATE for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program*	   loading at level ADD_DEPENDENCIES
Completed Load Level ADD_DEPENDENCIES for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program*	   loading at level PRE_LOADLIBRARY
LOADER: 0000000000439E30:***Program*	   loading at level LOADLIBRARY
LOADER: 0000000000439E30:***Program*	   loading at level POST_LOADLIBRARY
LOADER: 0000000000439E30:***Program*	   loading at level EAGER_FIXUPS
LOADER: 0000000000439E30:***Program*	   loading at level VTABLE FIXUPS
LOADER: 0000000000439E30:***Program*	   loading at level DELIVER_EVENTS
DRCT::IsReady - wait(0x100)=258, GetLastError() = 42424
DRCT::IsReady - wait(0x100)=258, GetLastError() = 42424
D::LA: Load Assembly Asy:0x000000000040D8C0 AD:0x0000000000439E30 which:C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe
Completed Load Level DELIVER_EVENTS for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program*	   loading at level LOADED
Completed Load Level LOADED for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 439e30:***Program*	<<<Load completed, LOADED
In PreStubWorker for System.Environment::SetCommandLineArgs
Prestubworker: method 000007FEC2AE1160M
DoRunClassInit: Request to init 000007FEC3BACCF8T in appdomain 0000000000439E30
RunClassInit: Calling class contructor for type 000007FEC3BACCF8T
In PreStubWorker for System.Environment::.cctor
Prestubworker: method 000007FEC2AE1B10M
DoRunClassInit: Request to init 000007FEC3BACCF8T in appdomain 0000000000439E30
DoRunClassInit: returning SUCCESS for init 000007FEC3BACCF8T in appdomain 0000000000439E30
RunClassInit: Returned Successfully from class contructor for type 000007FEC3BACCF8T
DoRunClassInit: returning SUCCESS for init 000007FEC3BACCF8T in appdomain 0000000000439E30
PHASEDLOAD: LoadTypeHandleForTypeKey for type ConsoleApplication.Program to level LOADED
PHASEDLOAD: table contains:
LoadTypeHandle: Loading Class from Module 000007FE65174718 token 2000002
PHASEDLOAD: Creating loading entry for type ConsoleApplication.Program
PHASEDLOAD: About to do incremental load of type ConsoleApplication.Program (0000000000000000) from level BEGIN
Looking up System.Object by name.
Loading class "ConsoleApplication.Program" from module "C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe" in domain 0x0000000000439E30 
SD: MT::MethodIterator created for System.Object.
EEC::IMD: pNewMD:0x65175178 for tok:0x6000001 (ConsoleApplication.Program::.cctor)
EEC::IMD: pNewMD:0x651751a8 for tok:0x6000002 (ConsoleApplication.Program::.ctor)
EEC::IMD: pNewMD:0x651751d8 for tok:0x6000003 (ConsoleApplication.Program::Main)
STATICS: Placing statics for ConsoleApplication.Program
STATICS: Field placed at non GC offset 0x38
Offset of staticCounter1: 56
STATICS: Field placed at non GC offset 0x40
Offset of staticCounter2: 64
STATICS: Static field bytes needed (0 is normal for non dynamic case)0
STATICS: Placing ThreadStatics for ConsoleApplication.Program
THREAD STATICS: Field placed at non GC offset 0x20
Offset of threadStaticCounter1: 32
THREAD STATICS: Field placed at non GC offset 0x28
Offset of threadStaticCounter2: 40
STATICS: ThreadStatic field bytes needed (0 is normal for non dynamic case)0
CLASSLOADER: AppDomainAgileAttribute for ConsoleApplication.Program is 0
MethodTableBuilder: finished method table for module 000007FE65174718 token 2000002 = 000007FE65175230T 
PHASEDLOAD: About to do incremental load of type ConsoleApplication.Program (000007FE65175230) from level APPROXPARENTS
Notify: 000007FE65175230 ConsoleApplication.Program
Successfully loaded class ConsoleApplication.Program
PHASEDLOAD: Completed full dependency load of type (000007FE65175230)+ConsoleApplication.Program
PHASEDLOAD: Completed full dependency load of type (000007FE65175230)+ConsoleApplication.Program
LOADER: 439e30:***Program*	>>>Load initiated, ACTIVE/ACTIVE
LOADER: 0000000000439E30:***Program*	   loading at level VERIFY_EXECUTION
LOADER: 0000000000439E30:***Program*	   loading at level ACTIVE
Completed Load Level ACTIVE for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 439e30:***Program*	<<<Load completed, ACTIVE
In PreStubWorker for ConsoleApplication.Program::Main
Prestubworker: method 000007FE651751D8M
    In PreStubWorker, calling MakeJitWorker
CallCompileMethodWithSEHWrapper called...
D::gV: cVars=0, extendOthers=1
Looking up System.Console by name.
SD: MT::MethodIterator created for System.Console.
JitComplete completed successfully
Got through CallCompile MethodWithSEHWrapper
MethodDesc::MakeJitWorker finished. Stub is 000007fe`652d0480 
DoRunClassInit: Request to init 000007FE65175230T in appdomain 0000000000439E30
RunClassInit: Calling class contructor for type 000007FE65175230T
In PreStubWorker for ConsoleApplication.Program::.cctor
Prestubworker: method 000007FE65175178M
    In PreStubWorker, calling MakeJitWorker
CallCompileMethodWithSEHWrapper called...
D::gV: cVars=0, extendOthers=1
JitComplete completed successfully
Got through CallCompile MethodWithSEHWrapper
MethodDesc::MakeJitWorker finished. Stub is 000007fe`652d04c0

So there you have it, the CLR loads your classes/Types carefully, cautiously and step-by-step!!

Discuss this post on HackerNews and /r/programming

As always, here’s some more links if you’d like to find out further information:

Type Loader Design (BotR)
Type System Overview (BotR)
JIT compiler and type constructors (.cctors) (i.e. ‘When do class constructors (.cctor) get run’?)
Why Do Initializers Run In The Opposite Order As Constructors? Part Two
Disallow statics of spans and class instance members of span (PR)
Span: Add tests to verify type loader checks for ref-like types #8516
Back to Basics: When does a .NET Assembly Dependency get loaded

The post How the .NET Runtime loads a Type first appeared on my blog Performance is a Feature!

Lowering in the C# Compiler (and what happens when you misuse it)

2017-05-25T00:00:00+00:00

Turns out that what I’d always thought of as “Compiler magic” or “Syntactic sugar” is actually known by the technical term ‘Lowering’ and the C# compiler (a.k.a Roslyn) uses it extensively.

But what is it? Well this quote from So You Want To Write Your Own Language? gives us some idea:

Lowering One semantic technique that is obvious in hindsight (but took Andrei Alexandrescu to point out to me) is called “lowering.” It consists of, internally, rewriting more complex semantic constructs in terms of simpler ones. For example, while loops and foreach loops can be rewritten in terms of for loops. Then, the rest of the code only has to deal with for loops. This turned out to uncover a couple of latent bugs in how while loops were implemented in D, and so was a nice win. It’s also used to rewrite scope guard statements in terms of try-finally statements, etc. Every case where this can be found in the semantic processing will be win for the implementation.

– by Walter Bright (author of the D programming language)

But if you’re still not sure what it means, have a read of Eric Lippert’s post on the subject, Lowering in language design, which contains this quote:

A common technique along the way though is to have the compiler “lower” from high-level language features to low-level language features in the same language.

As an aside, if you like reading about the Roslyn compiler source you may like these other posts that I’ve written:

What does ‘Lowering’ look like?

The C# compiler has used lowering for a while, one of the oldest or most recognised examples is when this code:

using System.Collections.Generic;
public class C {
    public IEnumerable<int> M() 
    {
        foreach (var value in new [] { 1, 2, 3, 4, 5 })
        {
            yield return value;
        }
    }
}

is turned into this

public class C
{
    [CompilerGenerated]
    private sealed class <M>d__0 : IEnumerable<int>, IEnumerable, IEnumerator<int>, IDisposable, IEnumerator
    {
        private int <>1__state;
        private int <>2__current;
        private int <>l__initialThreadId;
        public C <>4__this;
        private int[] <>s__1;
        private int <>s__2;
        private int <value>5__3;
        int IEnumerator<int>.Current
        {
            [DebuggerHidden]
            get
            {
                return this.<>2__current;
            }
        }
        object IEnumerator.Current
        {
            [DebuggerHidden]
            get
            {
                return this.<>2__current;
            }
        }
        [DebuggerHidden]
        public <M>d__0(int <>1__state)
        {
            this.<>1__state = <>1__state;
            this.<>l__initialThreadId = Environment.CurrentManagedThreadId;
        }
        [DebuggerHidden]
        void IDisposable.Dispose()
        {
        }
        bool IEnumerator.MoveNext()
        {
            int num = this.<>1__state;
            if (num != 0)
            {
                if (num != 1)
                {
                    return false;
                }
                this.<>1__state = -1;
                this.<>s__2++;
            }
            else
            {
                this.<>1__state = -1;
                this.<>s__1 = new int[] { 1, 2, 3, 4, 5 };
                this.<>s__2 = 0;
            }
            if (this.<>s__2 >= this.<>s__1.Length)
            {
                this.<>s__1 = null;
                return false;
            }
            this.<value>5__3 = this.<>s__1[this.<>s__2];
            this.<>2__current = this.<value>5__3;
            this.<>1__state = 1;
            return true;
        }
        [DebuggerHidden]
        void IEnumerator.Reset()
        {
            throw new NotSupportedException();
        }
        [DebuggerHidden]
        IEnumerator<int> IEnumerable<int>.GetEnumerator()
        {
            C.<M>d__0 <M>d__;
            if (this.<>1__state == -2 && this.<>l__initialThreadId == Environment.CurrentManagedThreadId)
            {
                this.<>1__state = 0;
                <M>d__ = this;
            }
            else
            {
                <M>d__ = new C.<M>d__0(0);
                <M>d__.<>4__this = this.<>4__this;
            }
            return <M>d__;
        }
        [DebuggerHidden]
        IEnumerator IEnumerable.GetEnumerator()
        {
            return this.System.Collections.Generic.IEnumerable<System.Int32>.GetEnumerator();
        }
    }
    [IteratorStateMachine(typeof(C.<M>d__0))]
    public IEnumerable<int> M()
    {
        C.<M>d__0 expr_07 = new C.<M>d__0(-2);
        expr_07.<>4__this = this;
        return expr_07;
    }
}

Yikes, I’m glad we don’t have to write that code ourselves!! There’s an entire state-machine in there, built to allow our original code to be halted/resumed each time round the loop (at the ‘yield’ statement).

The C# compiler and ‘Lowering’

But it turns out that the Roslyn compiler does a lot more ‘lowering’ than you might think. If you take a look at the code under ‘/src/Compilers/CSharp/Portable/Lowering’ (VB.NET equivalent here), you see the following folders:

Which correspond to some C# language features you might be familar with, such as ‘lambdas’, i.e. x => x.Name > 5, ‘iterators’ used by yield (above) and the async keyword.

However if we look at bit deeper, under the ‘LocalRewriter’ folder we can see lots more scenarios that we might never have considered ‘lowering’, such as:

So a big thank-you is due to all the past and present C# language developers and designers, they did all this work for us. Imagine that C# didn’t have all these high-level features, we’d be stuck writing them by hand.

It would be like writing Java :-)

What happens when you misuse it

But of course the real fun part is ‘misusing’ or outright ‘abusing’ the compiler. So I set up a little twitter competition just how much ‘lowering’ could we get the compiler to do for us (i.e the highest ratio of ‘input’ lines of code to ‘output’ lines).

It had the following rules (see this gist for more info):

You can have as many lines as you want within method M()
No single line can be longer than 100 chars
To get your score, divide the ‘# of expanded lines’ by the ‘# of original line(s)’
1. Based on the default output formatting of https://sharplab.io, no re-formatting allowed!!
2. But you can format the intput however you want, i.e. make use of the full 100 chars
Must compile with no warnings on https://sharplab.io (allows C# 7 features)
1. But doesn’t have to do anything sensible when run
You cannot modify the code that is already there, i.e. public class C {} and public void M()
1. Cannot just add async to public void M(), that’s too easy!!
You can add new using ... declarations, these do not count towards the line count

For instance with the following code (interactive version available on sharplab.io):

using System;
public class C {
    public void M() {
        Func<string> test = () => "blah"?.ToString();
    }
}

This counts as 1 line of original code (only code inside method M() is counted)

This expands to 23 lines (again only lines of code inside the braces ({, }) of class C are counted.

Giving a total score of 23 (23 / 1)

....
public class C
{
    [CompilerGenerated]
    [Serializable]
    private sealed class <>c
    {
        public static readonly C.<>c <>9;
        public static Func<string> <>9__0_0;
        static <>c()
        {
            // Note: this type is marked as 'beforefieldinit'.
            C.<>c.<>9 = new C.<>c();
        }
        internal string <M>b__0_0()
        {
            return "blah";
        }
    }
    public void M()
    {
        if (C.<>c.<>9__0_0 == null)
        {
            C.<>c.<>9__0_0 = new Func<string>(C.<>c.<>9.<M>b__0_0);
        }
    }
}

Results

The first place entry was the following entry from Schabse Laks, which contains 9 lines-of-code inside the M() method:

using System.Linq;
using Y = System.Collections.Generic.IEnumerable<dynamic>;

public class C {
    public void M() {
((Y)null).Select(async x => await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await x.x()());
    }
}

this expands to an impressive 7964 lines of code (yep you read that right!!) for a score of 885 (7964 / 9). The main trick he figured out was that adding more lines to the input increased the score, i.e is scales superlinearly. Although it you take things too far the compiler bails out with a pretty impressive error message:

error CS8078: An expression is too long or complex to compile

Here’s the Top 6 top results:

Submitter	Entry	Score
Schabse Laks	link	885 (7964 / 9)
Andrey Dyatlov	link	778 (778 / 1)
alrz	link	755 (755 / 1)
Andy Gocke *	link	633 (633 / 1)
Jared Parsons *	link	461 (461 / 1)
Jonathan Chambers	link	384 (384 / 1)

* = member of the Roslyn compiler team (they’re not disqualified, but maybe they should have some kind of handicap applied to ‘even out’ the playing field?)

Honourable mentions

However there were some other entries that whilst they didn’t make it into the Top 6, are still worth a mention due to the ingenuity involved:

Uncovering a complier bug, kudos to @a_tessenr
- GitHub bug report and fix in the compiler that was done within a few hours!!
Hitting an internal compiler limit, nice work by @Schabse
The most elegant attempt featuring a Y combinator by @NickPalladinos
Using VB.NET (hint: it didn’t end well!!), but still a valiant attempt by @AdamSpeight2008
The most astheticially pleasing entry by @leppie

Discuss this post on HackerNews, /r/programming or /r/csharp (whichever takes your fancy!!)

The post Lowering in the C# Compiler (and what happens when you misuse it) first appeared on my blog Performance is a Feature!

Adding a new Bytecode Instruction to the CLR

2017-05-19T00:00:00+00:00

Now that the CoreCLR is open-source we can do fun things, for instance find out if it’s possible to add new IL (Intermediate Language) instruction to the runtime.

TL;DR it turns out that it’s easier than you might think!! Here are the steps you need to go through:

Step 0 - Introduction and Background
Step 1 - Add the new IL instruction to the runtime
Step 2 - Make the Interpreter work
Step 3 - Ensure the JIT can recognise the new op-code
Step 4 - Runtime code generation via Reflection.Emit
Step 5 - Future Improvements

Update: turns out that I wasn’t the only person to have this idea, see Beachhead implements new opcode on CLR JIT for another implementation by Kouji Matsui.

Step 0

But first a bit of background information. Adding a new IL instruction to the CLR is a pretty rare event, that last time is was done for real was in .NET 2.0 when support for generics was added. This is part of the reason why .NET code had good backwards-compatibility, from Backward compatibility and the .NET Framework 4.5:

The .NET Framework 4.5 and its point releases (4.5.1, 4.5.2, 4.6, 4.6.1, 4.6.2, and 4.7) are backward-compatible with apps that were built with earlier versions of the .NET Framework. In other words, apps and components built with previous versions will work without modification on the .NET Framework 4.5.

Side note: The .NET framework did break backwards compatibility when moving from 1.0 to 2.0, precisely so that support for generics could be added deep into the runtime, i.e. with support in the IL. Java took a different decision, I guess because it had been around longer, breaking backwards-comparability was a bigger issue. See the excellent blog post Comparing Java and C# Generics for more info.

Step 1

For this exercise I plan to add a new IL instruction (op-code) to the CoreCLR runtime and because I’m a raving narcissist (not really, see below) I’m going to name it after myself. So let me introduce the matt IL instruction, that you can use like so:

.method private hidebysig static int32 TestMattOpCodeMethod(int32 x, int32 y) 
        cil managed noinlining
{
    .maxstack 2
    ldarg.0
    ldarg.1
    matt  // yay, my name as an IL op-code!!!!
    ret
}

But because I’m actually a bit-British (i.e. I don’t like to ‘blow my own trumpet’), I’m going to make the matt op-code almost completely pointless, it’s going to do exactly the same thing as calling Math.Max(x, y), i.e. just return the largest of the 2 numbers.

The other reason for naming it matt is that I’d really like someone to make a version of the C# (Roslyn) compiler that allows you to write code like this:

Console.WriteLine("{0} m@ {1} = {2}", 1, 7, 1 m@ 7)); // prints '1 m@ 7 = 7'

I definitely want the m@ operator to be a thing (pronounced ‘matt’, not ‘m-at’), maybe the other ‘Matt Warren’ who works at Microsoft on the C# Language Design Team can help out!! Seriously though, if anyone reading this would like to write a similar blog post, showing how you’d add the m@ operator to the Roslyn compiler, please let me know I’d love to read it.

Update: Thanks to Marcin Juraszek (@mmjuraszek) you can now use the m@ in a C# program, see Adding Matt operator to Roslyn - Syntax, Lexer and Parser, Adding Matt operator to Roslyn - Binder and Adding Matt operator to Roslyn - Emitter for the full details.

Now we’ve defined the op-code, the first step is to ensure that the run-time and tooling can recognise it. In particular we need the IL Assembler (a.k.a ilasm) to be able to take the IL code above (TestMattOpCodeMethod(..)) and produce a .NET executable.

As the .NET runtime source code is nicely structured (+1 to the runtime devs), to make this possible we only need to makes changes in opcode.def:

--- a/src/inc/opcode.def
+++ b/src/inc/opcode.def
@@ -154,7 +154,7 @@ OPDEF(CEE_NEWOBJ,                     "newobj",           VarPop,             Pu
 OPDEF(CEE_CASTCLASS,                  "castclass",        PopRef,             PushRef,     InlineType,         IObjModel,   1,  0xFF,    0x74,    NEXT)
 OPDEF(CEE_ISINST,                     "isinst",           PopRef,             PushI,       InlineType,         IObjModel,   1,  0xFF,    0x75,    NEXT)
 OPDEF(CEE_CONV_R_UN,                  "conv.r.un",        Pop1,               PushR8,      InlineNone,         IPrimitive,  1,  0xFF,    0x76,    NEXT)
-OPDEF(CEE_UNUSED58,                   "unused",           Pop0,               Push0,       InlineNone,         IPrimitive,  1,  0xFF,    0x77,    NEXT)
+OPDEF(CEE_MATT,                       "matt",             Pop1+Pop1,          Push1,       InlineNone,         IPrimitive,  1,  0xFF,    0x77,    NEXT)
 OPDEF(CEE_UNUSED1,                    "unused",           Pop0,               Push0,       InlineNone,         IPrimitive,  1,  0xFF,    0x78,    NEXT)
 OPDEF(CEE_UNBOX,                      "unbox",            PopRef,             PushI,       InlineType,         IPrimitive,  1,  0xFF,    0x79,    NEXT)
 OPDEF(CEE_THROW,                      "throw",            PopRef,             Push0,       InlineNone,         IObjModel,   1,  0xFF,    0x7A,    THROW)

I just picked the first available unused slot and added matt in there. It’s defined as Pop1+Pop1 because it takes 2 values from the stack as input and Push1 because after is has executed, a single result is pushed back onto the stack.

Note: all the changes I made are available in one-place on GitHub if you’d rather look at them like that.

Once this change was done ilasm will successfully assembly the test code file HelloWorld.il that contains TestMattOpCodeMethod(..) as shown above:

λ ilasm /EXE /OUTPUT=HelloWorld.exe -NOLOGO HelloWorld.il

Assembling 'HelloWorld.il'  to EXE --> 'HelloWorld.exe'
Source file is ANSI

Assembled method HelloWorld::Main
Assembled method HelloWorld::TestMattOpCodeMethod

Creating PE file

Emitting classes:
Class 1:        HelloWorld

Emitting fields and methods:
Global
Class 1 Methods: 2;
Resolving local member refs: 1 -> 1 defs, 0 refs, 0 unresolved

Emitting events and properties:
Global
Class 1
Resolving local member refs: 0 -> 0 defs, 0 refs, 0 unresolved
Writing PE file
Operation completed successfully

Step 2

However at this point the matt op-code isn’t actually executed, at runtime the CoreCLR just throws an exception because it doesn’t know what to do with it. As a first (simpler) step, I just wanted to make the .NET Interpreter work, so I made the following changes to wire it up:

--- a/src/vm/interpreter.cpp
+++ b/src/vm/interpreter.cpp
@@ -2726,6 +2726,9 @@ void Interpreter::ExecuteMethod(ARG_SLOT* retVal, __out bool* pDoJmpCall, __out
         case CEE_REM_UN:
             BinaryIntOp<BIO_RemUn>();
             break;
+        case CEE_MATT:
+            BinaryArithOp<BA_Matt>();
+            break;
         case CEE_AND:
             BinaryIntOp<BIO_And>();
             break;

--- a/src/vm/interpreter.hpp
+++ b/src/vm/interpreter.hpp
@@ -298,10 +298,14 @@ void Interpreter::BinaryArithOpWork(T val1, T val2)
         {
             res = val1 / val2;
         }
-        else 
+        else if (op == BA_Rem)
         {
             res = RemFunc(val1, val2);
         }
+        else if (op == BA_Matt)
+        {
+            res = MattFunc(val1, val2);
+        }
     }

and then I added the methods that would actually implement the interpreted code:

--- a/src/vm/interpreter.cpp
+++ b/src/vm/interpreter.cpp
@@ -10801,6 +10804,26 @@ double Interpreter::RemFunc(double v1, double v2)
     return fmod(v1, v2);
 }
 
+INT32 Interpreter::MattFunc(INT32 v1, INT32 v2)
+{
+	return v1 > v2 ? v1 : v2;
+}
+
+INT64 Interpreter::MattFunc(INT64 v1, INT64 v2)
+{
+	return v1 > v2 ? v1 : v2;
+}
+
+float Interpreter::MattFunc(float v1, float v2)
+{
+	return v1 > v2 ? v1 : v2;
+}
+
+double Interpreter::MattFunc(double v1, double v2)
+{
+	return v1 > v2 ? v1 : v2;
+}

So fairly straight-forward and the bonus is that at this point the matt operator is fully operational, you can actually write IL using it and it will run (interpreted only).

Step 3

However not everyone wants to re-compile the CoreCLR just to enable the Interpreter, so I want to also make it work for real via the Just-in-Time (JIT) compiler.

The full changes to make this work were spread across multiple files, but were mostly housekeeping so I won’t include them all here, check-out the full diff if you’re interested. But the significant parts are below:

--- a/src/jit/importer.cpp
+++ b/src/jit/importer.cpp
@@ -11112,6 +11112,10 @@ void Compiler::impImportBlockCode(BasicBlock* block)
                 oper = GT_UMOD;
                 goto MATH_MAYBE_CALL_NO_OVF;
 
+            case CEE_MATT:
+                oper = GT_MATT;
+                goto MATH_MAYBE_CALL_NO_OVF;
+
             MATH_MAYBE_CALL_NO_OVF:
                 ovfl = false;
             MATH_MAYBE_CALL_OVF:

--- a/src/vm/jithelpers.cpp
+++ b/src/vm/jithelpers.cpp
@@ -341,6 +341,14 @@ HCIMPL2(UINT32, JIT_UMod, UINT32 dividend, UINT32 divisor)
 HCIMPLEND
 
 /*********************************************************************/
+HCIMPL2(INT32, JIT_Matt, INT32 x, INT32 y)
+{
+    FCALL_CONTRACT;
+    return x > y ? x : y;
+}
+HCIMPLEND
+
+/*********************************************************************/
 HCIMPL2_VV(INT64, JIT_LDiv, INT64 dividend, INT64 divisor)
 {
     FCALL_CONTRACT;

In summary, these changes mean that during the JIT’s ‘Morph phase’ the IL containing the matt op code is converted from:

fgMorphTree BB01, stmt 1 (before)
       [000004] ------------             ▌  return    int   
       [000002] ------------             │  ┌──▌  lclVar    int    V01 arg1        
       [000003] ------------             └──▌  m@        int   
       [000001] ------------                └──▌  lclVar    int    V00 arg0               

into this:

fgMorphTree BB01, stmt 1 (after)
       [000004] --C--+------             ▌  return    int   
       [000003] --C--+------             └──▌  call help int    HELPER.CORINFO_HELP_MATT
       [000001] -----+------ arg0 in rcx    ├──▌  lclVar    int    V00 arg0         
       [000002] -----+------ arg1 in rdx    └──▌  lclVar    int    V01 arg1                 

Note the call to HELPER.CORINFO_HELP_MATT

When this is finally compiled into assembly code it ends up looking like so:

// Assembly listing for method HelloWorld:TestMattOpCodeMethod(int,int):int             
// Emitting BLENDED_CODE for X64 CPU with AVX                                           
// optimized code                                                                       
// rsp based frame                                                                      
// partially interruptible                                                              
// Final local variable assignments                                                     
//                                                                                      
//  V00 arg0         [V00,T00] (  3,  3   )     int  ->  rcx                            
//  V01 arg1         [V01,T01] (  3,  3   )     int  ->  rdx                            
//  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+0x00]                     
//                                                                                      
// Lcl frame size = 40                                    
                                                                                       
G_M9261_IG01:                                                                          
       4883EC28             sub      rsp, 40                                           
                                                                                       
G_M9261_IG02:                                                                          
       E8976FEB5E           call     CORINFO_HELP_MATT                                 
       90                   nop                                                        
                                                                                       
G_M9261_IG03:                                                                          
       4883C428             add      rsp, 40                                           
       C3                   ret                                                        

I’m not entirely sure why there is a nop instruction in there? But it works, which is the main thing!!

Step 4

In the CLR you can also dynamically emit code at runtime using the methods that sit under the ‘System.Reflection.Emit’ namespace, so the last task is to add the OpCodes.Matt field and have it emit the correct values for the matt op-code.

--- a/src/mscorlib/src/System/Reflection/Emit/OpCodes.cs
+++ b/src/mscorlib/src/System/Reflection/Emit/OpCodes.cs
@@ -139,6 +139,7 @@ internal enum OpCodeValues
         Castclass = 0x74,
         Isinst = 0x75,
         Conv_R_Un = 0x76,
+        Matt = 0x77,
         Unbox = 0x79,
         Throw = 0x7a,
         Ldfld = 0x7b,
@@ -1450,6 +1451,16 @@ private OpCodes()
             (0 << OpCode.StackChangeShift)
         );
 
+        public static readonly OpCode Matt = new OpCode(OpCodeValues.Matt,
+            ((int)OperandType.InlineNone) |
+            ((int)FlowControl.Next << OpCode.FlowControlShift) |
+            ((int)OpCodeType.Primitive << OpCode.OpCodeTypeShift) |
+            ((int)StackBehaviour.Pop1_pop1 << OpCode.StackBehaviourPopShift) |
+            ((int)StackBehaviour.Push1 << OpCode.StackBehaviourPushShift) |
+            (1 << OpCode.SizeShift) |
+            (-1 << OpCode.StackChangeShift)
+        );
+
         public static readonly OpCode Unbox = new OpCode(OpCodeValues.Unbox,
             ((int)OperandType.InlineType) |
             ((int)FlowControl.Next << OpCode.FlowControlShift) |

This lets us write the code shown below, which emits, compiles and then executes the matt op-code:

DynamicMethod method = new DynamicMethod(
		"TestMattOpCode", 
		returnType: typeof(int),
		parameterTypes: new [] { typeof(int), typeof(int) }, 
		m: typeof(TestClass).Module);

// Emit the IL
var generator = method.GetILGenerator();
generator.Emit(OpCodes.Ldarg_0);
generator.Emit(OpCodes.Ldarg_1);
generator.Emit(OpCodes.Matt); // Use the new 'matt' IL OpCode
generator.Emit(OpCodes.Ret);

// Compile the IL into a delegate (uses the JITter under-the-hood)
var mattOpCodeInvoker = 
    (Func<int, int, int>)method.CreateDelegate(typeof(Func<int, int, int>));

// prints "1 m@ 7 = 7"
Console.WriteLine("{0} m@ {1} = {2} (via IL Emit)", 1, 7, mattOpCodeInvoker(1, 7));
   
// prints "12 m@ 9 = 12"
Console.WriteLine("{0} m@ {1} = {2} (via IL Emit)", 12, 9, mattOpCodeInvoker(12, 9)); 

Step 5

Finally, you may have noticed that I cheated a little bit in Step 3 when I made changes to the JIT. Even though what I did works, it is not the most efficient way due to the extra method call to CORINFO_HELP_MATT. Also the JIT generally doesn’t use helper functions in this way, instead prefering to emit assembly code directly.

As a future exercise for anyone who has read this far (any takers?), it would be nice if the JIT emitted more efficient code. For instance if you write C# code like this (which does the same thing as the matt op-code):

private static int MaxMethod(int x, int y)
{
    return x > y ? x : y;
}

It’s turned into the following IL by the C# compiler

IL to import:
IL_0000  02                ldarg.0     
IL_0001  03                ldarg.1     
IL_0002  30 02             bgt.s        2 (IL_0006)
IL_0004  03                ldarg.1     
IL_0005  2a                ret         
IL_0006  02                ldarg.0     
IL_0007  2a                ret         

Then when the JIT runs it’s processed as 3 basic-blocks (BB01, BB02 and BB03):

Importing BB01 (PC=000) of 'TestNamespace.TestClass:MaxMethod(int,int):int'
    [ 0]   0 (0x000) ldarg.0
    [ 1]   1 (0x001) ldarg.1
    [ 2]   2 (0x002) bgt.s
           [000005] ------------             ▌  stmtExpr  void  (IL 0x000...  ???)
           [000004] ------------             └──▌  jmpTrue   void  
           [000002] ------------                │  ┌──▌  lclVar    int    V01 arg1         
           [000003] ------------                └──▌  >         int   
           [000001] ------------                   └──▌  lclVar    int    V00 arg0         

Importing BB03 (PC=006) of 'TestNamespace.TestClass:MaxMethod(int,int):int'
    [ 0]   6 (0x006) ldarg.0
    [ 1]   7 (0x007) ret
           [000009] ------------             ▌  stmtExpr  void  (IL 0x006...  ???)
           [000008] ------------             └──▌  return    int   
           [000007] ------------                └──▌  lclVar    int    V00 arg0         

Importing BB02 (PC=004) of 'TestNamespace.TestClass:MaxMethod(int,int):int'
    [ 0]   4 (0x004) ldarg.1
    [ 1]   5 (0x005) ret
           [000013] ------------             ▌  stmtExpr  void  (IL 0x004...  ???)
           [000012] ------------             └──▌  return    int   
           [000011] ------------                └──▌  lclVar    int    V01 arg1         

Before finally being turned into the following assembly code, which is way more efficient. It contains just a cmp, a jg and a couple of mov instructions, but crucially it’s all done in-line, it doesn’t need call out to another method.

// Assembly listing for method TestNamespace.TestClass:MaxMethod(int,int):int
// Emitting BLENDED_CODE for X64 CPU with AVX
// optimized code
// rsp based frame
// partially interruptible
// Final local variable assignments
//
//   V00 arg0         [V00,T00] (  4,  3.50)     int  ->  rcx
//   V01 arg1         [V01,T01] (  4,  3.50)     int  ->  rdx
// # V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]
//
// Lcl frame size = 0

G_M32709_IG01:

G_M32709_IG02:
       3BCA                 cmp      ecx, edx
       7F03                 jg       SHORT G_M32709_IG04
       8BC2                 mov      eax, edx

G_M32709_IG03:
       C3                   ret

G_M32709_IG04:
       8BC1                 mov      eax, ecx

G_M32709_IG05:
       C3                   ret

Disclaimer/Credit

I got the idea for doing this from the Appendix of the excellent book Shared Source CLI Essentials - Amazon, you can also download a copy of the 2nd edition if you don’t want to purchase the print one.

In Appendix B the authors of the book reproduced the work that Peter Drayton did to add an Exponentiation op-code to the SSCLI, which inspired this entire post, so thanks for that!!

Discuss this post on HackerNews and /r/programming

The post Adding a new Bytecode Instruction to the CLR first appeared on my blog Performance is a Feature!

Arrays and the CLR - a Very Special Relationship

2017-05-08T00:00:00+00:00

A while ago I wrote about the ‘special relationship’ that exists between Strings and the CLR, well it turns out that Arrays and the CLR have an even deeper one, the type of closeness where you hold hands on your first meeting

As an aside, if you like reading about CLR internals you may find these other posts interesting:

Fundamental to the Common Language Runtime (CLR)

Arrays are such a fundamental part of the CLR that they are included in the ECMA specification, to make it clear that the runtime has to implement them:

In addition, there are several IL (Intermediate Language) instructions that specifically deal with arrays:

newarr <etype>
- Create a new array with elements of type etype.
ldelem.ref
- Load the element at index onto the top of the stack as an O. The type of the O is the same as the element type of the array pushed on the CIL stack.
stelem <typeTok>
- Replace array element at index with the value on the stack (also stelem.i, stelem.i1, stelem.i2, stelem.r4 etc)
ldlen
- Push the length (of type native unsigned int) of array on the stack.

This makes sense because arrays are the building blocks of so many other data types, you want them to be available, well defined and efficient in a modern high-level language like C#. Without arrays you can’t have lists, dictionaries, queues, stacks, trees, etc, they’re all built on-top of arrays which provided low-level access to contiguous pieces of memory in a type-safe way.

Memory and Type Safety

This memory and type-safety is important because without it .NET couldn’t be described as a ‘managed runtime’ and you’d be left having to deal with the types of issues you get when you are writing code in a more low-level language.

More specifically, the CLR provides the following protections when you are using arrays (from the section on Memory and Type Safety in the BOTR ‘Intro to the CLR’ page):

While a GC is necessary to ensure memory safety, it is not sufficient. The GC will not prevent the program from indexing off the end of an array or accessing a field off the end of an object (possible if you compute the field’s address using a base and offset computation). However, if we do prevent these cases, then we can indeed make it impossible for a programmer to create memory-unsafe programs.

While the common intermediate language (CIL) does have operators that can fetch and set arbitrary memory (and thus violate memory safety), it also has the following memory-safe operators and the CLR strongly encourages their use in most programming:

Field-fetch operators (LDFLD, STFLD, LDFLDA) that fetch (read), set and take the address of a field by name.

Array-fetch operators (LDELEM, STELEM, LDELEMA) that fetch, set and take the address of an array element by index. All arrays include a tag specifying their length. This facilitates an automatic bounds check before each access.

Also, from the section on Verifiable Code - Enforcing Memory and Type Safety in the same BOTR page

In practice, the number of run-time checks needed is actually very small. They include the following operations:

Casting a pointer to a base type to be a pointer to a derived type (the opposite direction can be checked statically)

Array bounds checks (just as we saw for memory safety)

Assigning an element in an array of pointers to a new (pointer) value. This particular check is only required because CLR arrays have liberal casting rules (more on that later…)

However you don’t get this protection for free, there’s a cost to pay:

Note that the need to do these checks places requirements on the runtime. In particular:

All memory in the GC heap must be tagged with its type (so the casting operator can be implemented). This type information must be available at runtime, and it must be rich enough to determine if casts are valid (e.g., the runtime needs to know the inheritance hierarchy). In fact, the first field in every object on the GC heap points to a runtime data structure that represents its type.

All arrays must also have their size (for bounds checking).

Arrays must have complete type information about their element type.

Implementation Details

It turns out that large parts of the internal implementation of arrays is best described as magic, this Stack Overflow comment from Marc Gravell sums it up nicely

Arrays are basically voodoo. Because they pre-date generics, yet must allow on-the-fly type-creation (even in .NET 1.0), they are implemented using tricks, hacks, and sleight of hand.

Yep that’s right, arrays were parametrised (i.e. generic) before generics even existed. That means you could create arrays such as int[] and string[], long before you were able to write List<int> or List<string>, which only became possible in .NET 2.0.

Special helper classes

All this magic or sleight of hand is made possible by 2 things:

The CLR breaking all the usual type-safety rules
A special array helper class called SZArrayHelper

But first the why, why were all these tricks needed? From .NET Arrays, IList<T>, Generic Algorithms, and what about STL?:

When we were designing our generic collections classes, one of the things that bothered me was how to write a generic algorithm that would work on both arrays and collections. To drive generic programming, of course we must make arrays and generic collections as seamless as possible. It felt that there should be a simple solution to this problem that meant you shouldn’t have to write the same code twice, once taking an IList<T> and again taking a T[]. The solution that dawned on me was that arrays needed to implement our generic IList. We made arrays in V1 implement the non-generic IList, which was rather simple due to the lack of strong typing with IList and our base class for all arrays (System.Array). What we needed was to do the same thing in a strongly typed way for IList<T>.

But it was only done for the common case, i.e. ‘single dimensional’ arrays:

There were some restrictions here though – we didn’t want to support multidimensional arrays since IList<T> only provides single dimensional accesses. Also, arrays with non-zero lower bounds are rather strange, and probably wouldn’t mesh well with IList<T>, where most people may iterate from 0 to the return from the Count property on that IList. So, instead of making System.Array implement IList<T>, we made T[] implement IList<T>. Here, T[] means a single dimensional array with 0 as its lower bound (often called an SZArray internally, but I think Brad wanted to promote the term ‘vector’ publically at one point in time), and the element type is T. So Int32[] implements IList<Int32>, and String[] implements IList<String>.

Also, this comment from the array source code sheds some further light on the reasons:

//----------------------------------------------------------------------------------
// Calls to (IList<T>)(array).Meth are actually implemented by SZArrayHelper.Meth<T>
// This workaround exists for two reasons:
//
//    - For working set reasons, we don't want insert these methods in the array 
//      hierachy in the normal way.
//    - For platform and devtime reasons, we still want to use the C# compiler to 
//      generate the method bodies.
//
// (Though it's questionable whether any devtime was saved.)
//
// ....
//----------------------------------------------------------------------------------

So it was done for convenience and efficiently, as they didn’t want every instance of System.Array to carry around all the code for the IEnumerable<T> and IList<T> implementations.

This mapping takes places via a call to GetActualImplementationForArrayGenericIListOrIReadOnlyListMethod(..), which wins the prize for the best method name in the CoreCLR source!! It’s responsible for wiring up the corresponding method from the SZArrayHelper class, i.e. IList<T>.Count -> SZArrayHelper.Count<T> or if the method is part of the IEnumerator<T> interface, the SZGenericArrayEnumerator<T> is used.

But this has the potential to cause security holes, as it breaks the normal C# type system guarantees, specifically regarding the this pointer. To illustrate the problem, here’s the source code of the Count property, note the call to JitHelpers.UnsafeCast<T[]>:

internal int get_Count<T>()
{
    //! Warning: "this" is an array, not an SZArrayHelper. See comments above
    //! or you may introduce a security hole!
    T[] _this = JitHelpers.UnsafeCast<T[]>(this);
    return _this.Length;
}

Yikes, it has to remap this to be able to call Length on the correct object!!

And just in case those comments aren’t enough, there is a very strongly worded comment at the top of the class that further spells out the risks!!

Generally all this magic is hidden from you, but occasionally it leaks out. For instance if you run the code below, SZArrayHelper will show up in the StackTrace and TargetSite of properties of the NotSupportedException:

try {
    int[] someInts = { 1, 2, 3, 4 };
    IList<int> collection = someInts;
    // Throws NotSupportedException 'Collection is read-only'
    collection.Clear(); 		
} catch (NotSupportedException nsEx) {				
    Console.WriteLine("{0} - {1}", nsEx.TargetSite.DeclaringType, nsEx.TargetSite);
    Console.WriteLine(nsEx.StackTrace);
}

Removing Bounds Checks

The runtime also provides support for arrays in more conventional ways, the first of which is related to performance. Array bounds checks are all well and good when providing memory-safety, but they have a cost, so where possible the JIT removes any checks that it knows are redundant.

It does this by calculating the range of values that a for loop access and compares those to the actual length of the array. If it determines that there is never an attempt to access an item outside the permissible bounds of the array, the run-time checks are then removed.

For more information, the links below take you to the areas of the JIT source code that deal with this:

JIT trying to remove range checks
RangeCheck::OptimizeRangeCheck(..)
- In turn calls RangeCheck::GetRange(..)
- Also call Compiler::optRemoveRangeCheck(..) to actually remove the range-check
Really informative source code comment explaining the range check removal logic

And if you are really keen, take a look at this gist that I put together to explore the scenarios where bounds checks are ‘removed’ and ‘not removed’.

Allocating an array

Another task that the runtime helps with is allocating arrays, using hand-written assembly code so the methods are as optimised as possible, see:

Run-time treats arrays differently

Finally, because arrays are so intertwined with the CLR, there are lots of places in which they are dealt with as a special-case. For instance a search for ‘IsArray()’ in the CoreCLR source returns over 60 hits, including:

The method table for an array is built differently
- MethodTableBuilder::BuildInteropVTableForArray(..)
When you call ToString() on an array, you get special formatting, i.e. ‘System.Int32[]’ or ‘MyClass[,]’
- TypeString::AppendType(..)

So yes, it’s fair to say that arrays and the CLR have a Very Special Relationship

The CLR Thread Pool 'Thread Injection' Algorithm

2017-04-13T00:00:00+00:00

If you’re near London at the end of April, I’ll be speaking at ProgSCon 2017 on Microsoft and Open-Source – A ‘Brave New World’. ProgSCon is 1-day conference, with talks covering an eclectic range of topics, you’ll learn lots!!

As part of a never-ending quest to explore the CoreCLR source code I stumbled across the intriguing titled ‘HillClimbing.cpp’ source file. This post explains what it does and why.

What is ‘Hill Climbing’

It turns out that ‘Hill Climbing’ is a general technique, from the Wikipedia page on the Hill Climbing Algorithm:

In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution by incrementally changing a single element of the solution. If the change produces a better solution, an incremental change is made to the new solution, repeating until no further improvements can be found.

But in the context of the CoreCLR, ‘Hill Climbing’ (HC) is used to control the rate at which threads are added to the Thread Pool, from the MSDN page on ‘Parallel Tasks’:

Thread Injection

The .NET thread pool automatically manages the number of worker threads in the pool. It adds and removes threads according to built-in heuristics. The .NET thread pool has two main mechanisms for injecting threads: a starvation-avoidance mechanism that adds worker threads if it sees no progress being made on queued items and a hill-climbing heuristic that tries to maximize throughput while using as few threads as possible. … A goal of the hill-climbing heuristic is to improve the utilization of cores when threads are blocked by I/O or other wait conditions that stall the processor …. The .NET thread pool has an opportunity to inject threads every time a work item completes or at 500 millisecond intervals, whichever is shorter. The thread pool uses this opportunity to try adding threads (or taking them away), guided by feedback from previous changes in the thread count. If adding threads seems to be helping throughput, the thread pool adds more; otherwise, it reduces the number of worker threads. This technique is called the hill-climbing heuristic.

For more specifics on what the algorithm is doing, you can read the research paper Optimizing Concurrency Levels in the .NET ThreadPool published by Microsoft, although it you want a brief outline of what it’s trying to achieve, this summary from the paper is helpful:

In addition the controller should have:

short settling times so that cumulative throughput is maximized

minimal oscillations since changing control settings incurs overheads that reduce throughput

fast adaptation to changes in workloads and resource characteristics.

So reduce throughput, don’t add and then remove threads too fast, but still adapt quickly to changing work-loads, simple really!!

As an aside, after reading (and re-reading) the research paper I found it interesting that a considerable amount of it was dedicated to testing, as the following excerpt shows:

In fact the approach to testing was considered so important that they wrote an entire follow-up paper that discusses it, see Configuring Resource Managers Using Model Fuzzing.

Why is it needed?

Because, in short, just adding new threads doesn’t always increase throughput and ultimately having lots of threads has a cost. As this comment from Eric Eilebrecht, one of the authors of the research paper explains:

Throttling thread creation is not only about the cost of creating a thread; it’s mainly about the cost of having a large number of running threads on an ongoing basis. For example:

More threads means more context-switching, which adds CPU overhead. With a large number of threads, this can have a significant impact.

More threads means more active stacks, which impacts data locality. The more stacks a CPU is having to juggle in its various caches, the less effective those caches are.

The advantage of more threads than logical processors is, of course, that we can keep the CPU busy if some of the threads are blocked, and so get more work done. But we need to be careful not to “overreact” to blocking, and end up hurting performance by having too many threads.

Or in other words, from Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool

As opposed to what may be intuitive, concurrency control is about throttling and reducing the number of work items that can be run in parallel in order to improve the worker ThreadPool throughput (that is, controlling the degree of concurrency is preventing work from running).

So the algorithm was designed with all these criteria in mind and was then tested over a large range of scenarios, to ensure it actually worked! This is why it’s often said that you should just leave the .NET ThreadPool alone, not try and tinker with it. It’s been heavily tested to work across a multiple situations and it was designed to adapt over time, so it should have you covered! (although of course, there are times when it doesn’t work perfectly!!)

The Algorithm in Action

As the source in now available, we can actually play with the algorithm and try it out in a few scenarios to see what it does. It needs very few dependences and therefore all the relevant code is contained in the following files:

(For comparison, there’s an implementation of the same algorithm in the Mono source code)

I have a project up on my GitHub page that allows you to test the hill-climbing algorithm in a self-contained console app. If you’re interested you can see the changes/hacks I had to do to get it building, although in the end it was pretty simple! (Update Kudos to Christian Klutz who ported my self-contained app to C#, nice job!!)

The algorithm is controlled via the following HillClimbing_XXX settings:

Setting	Default Value	Notes
HillClimbing_WavePeriod	4
HillClimbing_TargetSignalToNoiseRatio	300
HillClimbing_ErrorSmoothingFactor	1
HillClimbing_WaveMagnitudeMultiplier	100
HillClimbing_MaxWaveMagnitude	20
HillClimbing_WaveHistorySize	8
HillClimbing_Bias	15	The ‘cost’ of a thread. 0 means drive for increased throughput regardless of thread count; higher values bias more against higher thread counts
HillClimbing_MaxChangePerSecond	4
HillClimbing_MaxChangePerSample	20
HillClimbing_MaxSampleErrorPercent	15
HillClimbing_SampleIntervalLow	10
HillClimbing_SampleIntervalHigh	200
HillClimbing_GainExponent	200	The exponent to apply to the gain, times 100. 100 means to use linear gain, higher values will enhance large moves and damp small ones

Because I was using the code in a self-contained console app, I just hard-coded the default values into the source, but in the CLR it appears that you can modify these values at runtime.

Working with the Hill Climbing code

There are several things I discovered when implementing a simple test app that works with the algorithm:

The calculation is triggered by calling the function HillClimbingInstance.Update(currentThreadCount, sampleDuration, numCompletions, &threadAdjustmentInterval) and the return value is the new ‘maximum thread count’ that the algorithm is proposing.
It calculates the desired number of threads based on the ‘current throughput’, which is the ‘# of tasks completed’ (numCompletions) during the current time-period (sampleDuration in seconds).
It also takes the current thread count (currentThreadCount) into consideration.
The core calculations (excluding error handling and house-keeping) are only just over 100 LOC, so it’s not too hard to follow.
It works on the basis of ‘transitions’ (HillClimbingStateTransition), first Warmup, then Stabilizing and will only recommend a new value once it’s moved into the ClimbingMove state.
The real .NET Thread Pool only increases the thread-count by one thread every 500 milliseconds. It keeps doing this until the ‘# of threads’ has reached the amount that the hill-climbing algorithm suggests. See ThreadpoolMgr::ShouldAdjustMaxWorkersActive() and ThreadpoolMgr::AdjustMaxWorkersActive() for the code that handles this.
If it hasn’t got enough samples to do a ‘statistically significant’ calculation this algorithm will indicate this via the threadAdjustmentInterval variable. This means that you should not call HillClimbingInstance.Update(..) until another threadAdjustmentInterval milliseconds have elapsed. (link to source code that calculates this)
The current thread count is only decreased when threads complete their current task. At that point the current count is compared to the desired amount and if necessary a thread is ‘retired’
The algorithm with only returns values that respect the limits specified by ThreadPool.SetMinThreads(..) and ThreadPool.SetMaxThreads(..) (link to the code that handles this)
In addition, it will only recommend increasing the thread count if the CPU Utilization is below 95%

First lets look at the graphs that were published in the research paper from Microsoft (Optimizing Concurrency Levels in the .NET ThreadPool):

They clearly show the thread-pool adapting the number of threads (up and down) as the throughput changes, so it appears the algorithm is doing what it promises.

Now for a similar image using the self-contained test app I wrote. Now, my test app only pretends to add/remove threads based on the results for the Hill Climbing algorithm, so it’s only an approximation of the real behaviour, but it does provide a nice way to see it in action outside of the CLR.

In this simple scenario, the work-load that we are asking the thread-pool to do is just moving up and then down (click for full-size image):

Finally, we’ll look at what the algorithm does in a more noisy scenario, here the current ‘work load’ randomly jumps around, rather than smoothly changing:

So with a combination of a very detailed MSDN article, a easy-to-read research paper and most significantly having the source code available, we are able to get an understanding of what the .NET Thread Pool is doing ‘under-the-hood’!

References

Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool (I recommend reading this article before reading the research papers)
Optimizing Concurrency Levels in the .NET ThreadPool: A case study of controller design and implementation
- direct link to PDF file
Configuring Resource Managers Using Model Fuzzing: A Case Study of the .NET Thread Pool
- direct link to PDF file
MSDN page on ‘Parallel Tasks’ (see section on ‘Thread Injection’)
Patent US20100083272 - Managing pools of dynamic resources

The .NET IL Interpreter

2017-03-30T00:00:00+00:00

Whilst writing a previous blog post I stumbled across the .NET Interpreter, tucked away in the source code. Although, it I’d made even the smallest amount of effort to look for it, I’d have easily found it via the GitHub ‘magic’ file search:

Usage Scenarios

Before we look at how to use it and what it does, it’s worth pointing out that the Interpreter is not really meant for production code. As far as I can tell, its main purpose is to allow you to get the CLR up and running on a new CPU architecture. Without the interpreter you wouldn’t be able to test any C# code until you had a fully functioning JIT that could emit machine code for you. For instance see ‘[ARM32/Linux] Initial bring up of FEATURE_INTERPRETER’ and ‘[aarch64] Enable the interpreter on linux as well.

Also it doesn’t have a few key features, most notable debugging support, that is you can’t debug through C# code that has been interpreted, although you can of course debug the interpreter itself. From ‘Tiered Compilation step 1’:

…. - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do).

You can see an example of this in ‘Interpreter: volatile ldobj appears to have incorrect semantics?’ (thanks to alexrp for telling me about this issue). There is also a fair amount of TODO comments in the code, although I haven’t verified what (if any) specific C# code breaks due to the missing functionality.

However, I think another really useful scenario for the Interpreter is to help you learn about the inner workings of the CLR. It’s only 8,000 lines long, but it’s all in one file and most significantly it’s written in C++. The code that the CLR/JIT uses when compiling for real is in multiple several files (the JIT on it’s own is over 200,000 L.O.C, spread across 100’s of files) and there are large amounts hand-written written in raw assembly.

In theory the Interpreter should work in the same way as the full runtime, albeit not as optimised. This means that it much simpler and those of us who aren’t CLR and/or assembly experts can have a chance of working out what’s going on!

Enabling the Interpreter

The Interpreter is disabled by default, so you have to build the CoreCLR from source to make it work (it used to be the fallback for ARM64 but that’s no longer the case), here’s the diff of the changes you need to make:

--- a/src/inc/switches.h
+++ b/src/inc/switches.h
@@ -233,5 +233,8 @@
 #define FEATURE_STACK_SAMPLING
 #endif // defined (ALLOW_SXS_JIT)

+// Let's test the .NET Interpreter!!
+#define FEATURE_INTERPRETER
+
 #endif // !defined(CROSSGEN_COMPILE)

You also need to enable some environment variables, the ones that I used are in the table below. For the full list, take a look at Host Configuration Knobs and search for ‘Interpreter’.

Name	Description
Interpret	Selectively uses the interpreter to execute the specified methods
InterpreterDoLoopMethods	If set, don’t check for loops, start by interpreting all methods
InterpreterPrintPostMortem	Prints summary information about the execution to the console
DumpInterpreterStubs	Prints all interpreter stubs that are created to the console
TraceInterpreterEntries	Logs entries to interpreted methods to the console
TraceInterpreterIL	Logs individual instructions of interpreted methods to the console
TraceInterpreterVerbose	Logs interpreter progress with detailed messages to the console
TraceInterpreterJITTransition	Logs when the interpreter determines a method should be JITted

To test out the Interpreter, I will be using the code below:

public static void Main(string[] args)
{
    var max = 1000 * 1000;
    if (args.Length > 0)
        int.TryParse(args[0], out max);
    var timer = Stopwatch.StartNew();
    for (int i = 1; i <= max; i++)
    {
        if (i % (1000 * 100) == 0)
            Console.WriteLine(string.Format("Completed {0,10:N0} iterations", i));
    }
    timer.Stop();
    Console.WriteLine(string.Format("Performed {0:N0} iterations, max);
    Console.WriteLine(string.Format("Took {0:N0} msecs", timer.ElapsedMilliseconds));
    Console.WriteLine();
}

which on my machine, gives the following results for 100,000 iterations:

Run	Compiled (msecs)	Interpreted (msecs)
1	11	4,393
2	11	4,089
3	9	4,416

So yeah, you don’t want to be using the interpreter for any performance sensitive code!!

Diagnostic Output

In addition, a diagnostic output is produced. Note, this is from a single iteration of the loop, otherwise it becomes too verbose to read.

Generating interpretation stub (# 1 = 0x1, hash = 0x91b7d02e) for ConsoleApplication.Program:Main.
Skipping ConsoleApplication.Program:.cctor
Entering method #1 (= 0x1): ConsoleApplication.Program:Main(class).
 arguments:
         0:      class: 0x0000000002C50568 (System.String[]) [...]

START 1, ConsoleApplication.Program:Main(class)
     0: nop
   0x1: call
Skipping ConsoleApplication.Stopwatch:.cctor
Skipping DomainBoundILStubClass:IL_STUB_PInvoke
Skipping ConsoleApplication.Stopwatch:StartNew
Skipping ConsoleApplication.Stopwatch:.ctor
Skipping ConsoleApplication.Stopwatch:Reset
Skipping ConsoleApplication.Stopwatch:Start
Skipping ConsoleApplication.Stopwatch:GetTimestamp
  Returning to method ConsoleApplication.Program:Main(class), stub num 1.
   0x6: stloc.0
      loc0   :      class: 0x0000000002C50580 (ConsoleApplication.Stopwatch) [...]
      loc1   :        int: 0
      loc2   :       bool: false
   0x7: ldc.i4.1
   0x8: stloc.1
      loc0   :      class: 0x0000000002C50580 (ConsoleApplication.Stopwatch) [...]
      loc1   :        int: 1
      loc2   :       bool: false
   0x9: br.s
  0x27: ldloc.1
  0x28: ldc.i4.2
  0x29: clt
  0x2b: stloc.2
      loc0   :      class: 0x0000000002C50580 (ConsoleApplication.Stopwatch) [...]
      loc1   :        int: 1
      loc2   :       bool: true
  0x2c: ldloc.2
  0x2d: brtrue.s
   0xb: nop
   0xc: ldstr
  0x11: ldloc.1
  0x12: box
  0x17: call
  Returning to method ConsoleApplication.Program:Main(class), stub num 1.
  0x1c: call
Completed          1 iterations
  Returning to method ConsoleApplication.Program:Main(class), stub num 1.
  0x21: nop
  0x22: nop
  0x23: ldloc.1
  0x24: ldc.i4.1
  0x25: add
  0x26: stloc.1
      loc0   :      class: 0x0000000002C50580 (ConsoleApplication.Stopwatch) [...]
      loc1   :        int: 2
      loc2   :       bool: true
  0x27: ldloc.1
  0x28: ldc.i4.2
  0x29: clt
  0x2b: stloc.2
      loc0   :      class: 0x0000000002C50580 (ConsoleApplication.Stopwatch) [...]
      loc1   :        int: 2
      loc2   :       bool: false
  0x2c: ldloc.2
  0x2d: brtrue.s
  0x2f: ldloc.0
  0x30: callvirt
Skipping ConsoleApplication.Stopwatch:Stop
  Returning to method ConsoleApplication.Program:Main(class), stub num 1.
  0x35: nop
  0x36: ldstr
  0x3b: ldloc.0
  0x3c: callvirt
Skipping ConsoleApplication.Stopwatch:get_ElapsedMilliseconds
Skipping ConsoleApplication.Stopwatch:GetElapsedDateTimeTicks
Skipping ConsoleApplication.Stopwatch:GetRawElapsedTicks
  Returning to method ConsoleApplication.Program:Main(class), stub num 1.
  0x41: box
  0x46: call
  Returning to method ConsoleApplication.Program:Main(class), stub num 1.
  0x4b: call
Took 33 msecs
  Returning to method ConsoleApplication.Program:Main(class), stub num 1.
  0x50: nop
  0x51: ret

So you can clearly see the interpreter in action, executing the individual IL instructions and showing the current values of any local variables as it goes along. Then, once the entire program has run, you also get some nice summary statistics (this time from a full-run, with 100,000 iterations):

IL instruction profiling:

Instructions (24000085 total, 20000083 1-byte):
Instruction  |   execs   |       % |   cum %
-------------------------------------------
     ldloc.1 |   3000011 |  12.50% |  12.50%
         ceq |   3000001 |  12.50% |  25.00%
    ldc.i4.0 |   3000001 |  12.50% |  37.50%
         nop |   2000013 |   8.33% |  45.83%
     stloc.2 |   2000001 |   8.33% |  54.17%
      ldc.i4 |   2000001 |   8.33% |  62.50%
    brtrue.s |   2000001 |   8.33% |  70.83%
     ldloc.2 |   2000001 |   8.33% |  79.17%
    ldc.i4.1 |   1000001 |   4.17% |  83.33%
         cgt |   1000001 |   4.17% |  87.50%
     stloc.1 |   1000001 |   4.17% |  91.67%
         rem |   1000000 |   4.17% |  95.83%
         add |   1000000 |   4.17% | 100.00%
        call |        23 |   0.00% | 100.00%
       ldstr |        11 |   0.00% | 100.00%
         box |        11 |   0.00% | 100.00%
     ldloc.0 |         2 |   0.00% | 100.00%
    callvirt |         2 |   0.00% | 100.00%
        br.s |         1 |   0.00% | 100.00%
     stloc.0 |         1 |   0.00% | 100.00%
         ret |         1 |   0.00% | 100.00%                                        

Main sections of the Interpreter code

Now we’ve seen it in action, let’s take a look at the code within the Interpreter and see how it works

Top-level dispatcher

At the heart of the Interpreter is a giant switch statement (in Interpreter::ExecuteMethod(..)), that is almost 1,200 lines long! In it you’ll find lots of code like this:

switch (*m_ILCodePtr)
{
case CEE_NOP:
    m_ILCodePtr++;
    continue;
case CEE_BREAK:     // TODO: interact with the debugger?
    m_ILCodePtr++;
    continue;
case CEE_LDARG_0:
    LdArg(0);
    break;
case CEE_LDARG_1:
    LdArg(1);
    break;
    ...
}

In total, there are 199 case statements, corresponding to all the available CLR Intermediate Language (IL) op-codes, in all their different combinations, for instance CEE_LDC_??, i.e. CEE_LDC_I4, CEE_LDC_I8, CEE_LDC_R4 and CEE_LDC_R8. The large majority of the case statements just call out to another function that does the actual work, although there are some exceptions, such as CEE_RET.

Method calls

The other task that takes up lots of code in the interpreter is handling method calls, over 2,500 L.O.C in total! This is spread across several methods, each doing a particular part of the work:

void Interpreter::DoCallWork(..)
- CALL Calls the method indicated by the passed method descriptor
- CALLVIRT Calls a late-bound method on an object, pushing the return value onto the evaluation stack.
- Also via Interpreter::NewObj(), i.e the NEWOBJ IL op-code
void Interpreter::CallI()
- CALLI Calls the method indicated on the evaluation stack (as a pointer to an entry point) with arguments described by a calling convention
CorJitResult Interpreter::GenerateInterpreterStub(..)
- The external entry point, i.e. the JIT inserts a stub to this method
- Also called via Interpreter::InterpretMethodBody(..)
- Actually emits assembly code!!
void InterpreterMethodInfo::InitArgInfo(..)
- Called via Interpreter::GenerateInterpreterStub(..)

In summary, this work involves dynamically generating stubs and ensuring that method arguments are in the right registers (hence the assembly code). It handles virtual methods, static and instance calls, delegates, intrinsics and probably a few other scenarios as well! In addition, if the method being called needs to be interpreted, it also has to make sure that happens.

Creating objects and arrays

The interpreter needs to handle some of the key functionality of a runtime, that is creating and initialising objects. To do this it has to call into the GC, before finally calling the constructor:

Boxing and Unboxing

Another large chuck of code is dedicated to boxing/unboxing, that is converting ‘value types’ (structs) into object references when needed. The .NET IL provides specific op-codes to handle this:

Loading and Storing data

That is, reading/writing fields in an object or elements in an array:

Other Specific IL Op Codes

There is also a significant amount of code (over 1,000 lines) that just deals with low-level operations, that is ‘comparisions’, ‘branching’ and ‘basic arithmetic’:

INT32 Interpreter::CompareOpRes(..)
- CEQ, CGT, CGT_UN, CLT & CLT_UN called via Interpreter::CompareOp()
- BEQ, BGE, BGT, BLE, BLT, BNE_UN, BGE_UN, BGT_UN, BLE_UN, BLT_UN called via Interpreter::BrOnComparison()
void Interpreter::BinaryArithOp()
- ADD, SUB, MUL, DIV and REM
- in turn calls Interpreter::BinaryArithOpWork(..)
void Interpreter::BinaryArithOvfOp()
- ADD_OVF, ADD_OVF_UN, MUL_OVF, MUL_OVF_UN, SUB_OVF, SUB_OVF_UN
- in turn calls Interpreter::BinaryArithOvfOpWork(..)

Working with the Garbage Collector (GC)

In addition, the interpreter has to provide the GC with the information it needs. This happens when the GC calls Interpreter::GCScanRoots(..), with additional work talking place in Interpreter::GCScanRootAtLoc(..). Very simply the interpreter has to let the GC know about any ‘root’ objects that are currently ‘live’. This includes static variables and any local variables in the function that is currently executing.

When the interpreter locates a ‘root’ object, it notifies the GC via a callback (pf(..)):

void Interpreter::GCScanRootAtLoc(Object** loc, InterpreterType it, promote_func* pf, ScanContext* sc, bool pinningRef)
{
    switch (it.ToCorInfoType())
    {
    case CORINFO_TYPE_CLASS:
    case CORINFO_TYPE_STRING:
        {
            DWORD flags = 0;
            if (pinningRef) flags |= GC_CALL_PINNED;
            (*pf)(loc, sc, flags);
        }
        break;
    ....
    }
}

Integration with the Virtual Machine (VM)

Finally, whilst the Interpreter is fairly self-contained, there are times where it needs to work with the rest of the runtime

The Run-time is responsible for starting and stopping the interpreter
The JIT wires up interpreter stubs or uses them as a fall-back if JIT compilation fails. In addition the JIT ‘pre-stubs’ allow for interpreted methods when calling the JIT itself and when the ‘pre-stub’ is executed
Stack-walking takes account of interpreter frames, by utilising InterpreterFrame data structures
When looking up the MethodDesc for a given code address, the interpreter stubs are accounted for

Discuss this post on HackerNews and /r/programming

The post The .NET IL Interpreter first appeared on my blog Performance is a Feature!

A Hitchhikers Guide to the CoreCLR Source Code

2017-03-23T00:00:00+00:00

photo by Alan O’Rourke

Just over 2 years ago Microsoft open-sourced the entire .NET framework, this posts attempts to provide a ‘Hitchhikers Guide’ to the source-code found in the CoreCLR GitHub repository.

To make it easier for you to get to the information you’re interested in, this post is split into several parts

Overall Stats
‘Top 10’ lists
High-level Overview
Deep Dive into Individual Areas
All the rest

It’s worth pointing out that .NET Developers have provided 2 excellent glossaries, the CoreCLR one and the CoreFX one, so if you come across any unfamiliar terms or abbreviations, check these first. Also there is extensive documentation available and if you are interested in the low-level details I really recommend checking out the ‘Book of the Runtime’ (BotR).

Overall Stats

If you take a look at the repository on GitHub, it shows the following stats for the entire repo

But most of the C# code is test code, so if we just look under /src (i.e. ignore any code under /tests) there are the following mix of Source file types, i.e. no ‘.txt’, ‘.dat’, etc:

  - 2,012 .cpp
  - 1,183 .h
  - 956 .cs
  - 113 .inl
  - 98 .hpp
  - 51 .S
  - 43 .py
  - 42 .asm
  - 24 .idl
  - 20 .c

So by far the majority of the code is written in C++, but there is still also a fair amount of C# code (all under ‘mscorlib’). Clearly there are low-level parts of the CLR that have to be written in C++ or Assembly code because they need to be ‘close to the metal’ or have high performance, but it’s interesting that there are large parts of the runtime written in managed code itself.

Note: All stats/lists in the post were calculated using commit 51a6b5c from the 9th March 2017.

Compared to ‘Rotor’

As a comparison here’s what the stats for ‘Rotor’ the Shared Source CLI looked like back in October 2002. Rotor was ‘Shared Source’, not truly ‘Open Source’, so it didn’t have the same community involvements as the CoreCLR.

Note: SSCLI aka ‘Rotor’ includes the fx or base class libraries (BCL), but the CoreCLR doesn’t as they are now hosted separately in the CoreFX GitHub repository

For reference, the equivalent stats for the CoreCLR source in March 2017 look like this:

Packaged as 61.2 MB .zip archive
- Over 10.8 million lines of code (2.6 million of source code, under \src)
- 24,485 Files (7,466 source)
  - 6,626 C# (956 source)
  - 2,074 C and C++
  - 3,701 IL
  - 93 Assembler
  - 43 Python
  - 6 Perl
Over 8.2 million lines of test code
Build output expands to over 1.2 G with tests
- Product binaries 342 MB
- Test binaries 909 MB

Top 10 lists

These lists are mostly just for fun, but they do give some insights into the code-base and how it’s structured.

Top 10 Largest Files

You might have heard about the mammoth source file that is gc.cpp, which is so large that GitHub refuses to display it.

But it turns out it’s not the only large file in the source, there are also several files in the JIT that are around 20K LOC. However it seems that all the large files are C++ source code, so if you’re only interested in C# code, you don’t have to worry!!

File	# Lines of Code	Type	Location
gc.cpp	37,037	.cpp	\src\gc\
flowgraph.cpp	24,875	.cpp	\src\jit\
codegenlegacy.cpp	21,727	.cpp	\src\jit\
importer.cpp	18,680	.cpp	\src\jit\
morph.cpp	18,381	.cpp	\src\jit\
isolationpriv.h	18,263	.h	\src\inc\
cordebug.h	18,111	.h	\src\pal\prebuilt\inc\
gentree.cpp	17,177	.cpp	\src\jit\
debugger.cpp	16,975	.cpp	\src\debug\ee\

Top 10 Longest Methods

The large methods aren’t actually that hard to find, because they’re all have #pragma warning(disable:21000) before them, to keep the compiler happy! There are ~40 large methods in total, here’s the ‘Top 10’

Method	# Lines of Code
MarshalInfo::MarshalInfo(Module* pModule,	1,507
void gc_heap::plan_phase (int condemned_gen_number)	1,505
void CordbProcess::DispatchRCEvent()	1,351
void DbgTransportSession::TransportWorker()	1,238
LPCSTR Exception::GetHRSymbolicName(HRESULT hr)	1,216
BOOL Disassemble(IMDInternalImport pImport, BYTE ILHeader,…	1,081
bool Debugger::HandleIPCEvent(DebuggerIPCEvent * pEvent)	1,050
void LazyMachState::unwindLazyState(LazyMachState* baseState…	901
VOID ParseNativeType(Module* pModule,	886
VOID StubLinkerCPU::EmitArrayOpStub(const ArrayOpScript* pAr…	839

Top 10 files with the Most Commits

Finally, lets look at which files have been changed the most since the initial commit on GitHub back in January 2015 (ignore ‘merge’ commits)

File	# Commits
src\jit\morph.cpp	237
src\jit\compiler.h	231
src\jit\importer.cpp	196
src\jit\codegenxarch.cpp	190
src\jit\flowgraph.cpp	171
src\jit\compiler.cpp	161
src\jit\gentree.cpp	157
src\jit\lower.cpp	147
src\jit\gentree.h	137
src\pal\inc\pal.h	136

High-level Overview

Next we’ll take a look at how the source code is structured and what are the main components.

They say “A picture is worth a thousand words”, so below is a treemap with the source code files grouped by colour into the top-level sections they fall under. You can hover over an individual box to get more detailed information and can click on the different radio buttons to toggle the sizing (LOC/Files/Commits)

Notes and Observations

The ‘# Commits’ only represent the commits made on GitHub, in the 2 1/2 years since the CoreCLR was open-sourced. So they are skewed to the recent work and don’t represent changes made over the entire history of the CLR. However it’s interesting to see which components have had more ‘churn’ in the last few years (i.e ‘jit’) and which have been left alone (e.g. ‘pal’)
From the number of LOC/files it’s clear to see what the significant components are within the CoreCLR source, e.g ‘vm’, ‘jit’, ‘pal’ & ‘mscorlib’ (these are covered in detail in the next part of this post)
In the ‘VM’ section it’s interesting to see how much code is generic ~650K LOC and how much is per-CPU architecture 25K LOC for ‘i386’, 16K for ‘amd64’, 14K for ‘arm’ and 7K for ‘arm64’. This suggests that the code is nicely organised so that the per-architecture work is minimised and cleanly separated out.
It’s surprising (to me) that the ‘GC’ section is as small as it is, I always thought of the GC is a very complex component, but there is way more code in the ‘debugger’ and the ‘pal’.
Likewise, I never really appreciated the complexity if the ‘JIT’, it’s the 2nd largest component, comprising over 370K LOC.

If you’re interested, this raw numbers for the code under ‘/src’ are available in this gist and for the code under ‘/tests/src’ in this gist.

Deep Dive into Individual Areas

As the source code is well organised, the top-level folders (under /src) correspond to the logical components within the CoreCLR. We’ll start off by looking at the most significant components, i.e. the ‘Debugger’, ‘Garbage Collector’ (GC), ‘Just-in-Time compiler’ (JIT), ‘mscorlib’ (all the C# code), ‘Platform Adaptation Layer’ (PAL) and the CLR ‘Virtual Machine’ (VM).

mscorlib

The ‘mscorlib’ folder contains all the C# code within the CoreCLR, so it’s the place that most C# developers would start looking if they wanted to contribute. For this reason it deserves it’s own treemap, so we can see how it’s structured:

So by-far the bulk of the code is at the ‘top-level’, i.e. directly in the ‘System’ namespace, this contains the fundamental types that have to exist for the CLR to run, such as:

AppDomain, WeakReference, Type,
Array, Delegate, Object, String
Boolean, Byte, Char, Int16, Int32, etc
Tuple, Span, ArraySegment, Attribute, DateTime

Where possible the CoreCLR is written in C#, because of the benefits that ‘managed code’ brings, so there is a significant amount of code within the ‘mscorlib’ section. Note that anything under here is not externally exposed, when you write C# code that runs against the CoreCLR, you actually access everything through the CoreFX, which then type-forwards to the CoreCLR where appropriate.

I don’t know the rules for what lives in CoreCLR v CoreFX, but based on what I’ve read on various GitHub issues, it seems that over time, more and more code is moving from CoreCLR -> CoreFX.

However the managed C# code is often deeply entwined with unmanaged C++, for instance several types are implemented across multiple files, e.g.

Arrays - Arrays.cs, array.cpp, array.h
Assemblies - Assembly.cs, assembly.cpp, assembly.hpp

From what I understand this is done for performance reasons, any code that is perf sensitive will end up being implemented in C++ (or even Assembly), unless the JIT can suitable optimise the C# code.

Code shared with CoreRT

Recently there has been a significant amount of work done to moved more and more code over into the ‘shared partition’. This is the area of the CoreCLR source code that is shared with CoreRT (‘the .NET Core runtime optimized for AOT compilation’). Because certain classes are implemented in both runtimes, they’ve ensured that the work isn’t duplicated and any fixes are shared in both locations. You can see how this works by looking at the links below:

CoreCLR
CoreRT

Other parts of mscorlib

All the other sections of mscorlib line up with namespaces available in the .NET runtime and contain functionality that most C# devs will have used at one time or another. The largest ones in there are shown below (click to go directly to the source code):

System.Reflection and System.Reflection.Emit
- FieldInfo, PropertyInfo, MethodInfo, AssemblyBuilder, TypeBuilder, MethodBuilder, ILGenerator
System.Globalization
- CultureInfo, CalendarInfo, DateTimeParse, JulianCalendar, HebrewCalendar
System.Threading and System.Threading.Tasks
- Thread, Timer, Semaphore, Mutex, AsyncLocal<T>, Task, Task<T>, CancellationToken
System.Runtime.CompilerServices and System.Runtime.InteropServices
- Unsafe, [CallerFilePath], [CallerLineNumber], [CallerMemberName], GCHandle, [LayoutKind], [MarshalAs(..)], [StructLayout(LayoutKind ..)]
System.Diagnostics
- Assert, Debugger, Stacktrace
System.Text
- StringBuilder, ASCIIEncoding, UTF8Encoding, UnicodeEncoding
System.Collections
- ArrayList, Hashtable
System.Collections.Generic
- Dictionary<T,U>, List<T>
System.IO
- Stream, MemoryStream, File, TestReader, TestWriter

vm (Virtual Machine)

The VM, not surprisingly, is the largest component of the CoreCLR, with over 640K L.O.C spread across 576 files, and it contains the guts of the runtime. The bulk of the code is OS and CPU independent and written in C++, however there is also a significant amount of architecture-specific assembly code, see the section ‘CPU Architecture-specific code’ for more info.

The VM contains the main start-up routine of the entire runtime EEStartupHelper() in ceemain.cpp, see ‘The 68 things the CLR does before executing a single line of your code’ for all the details. In addition it provides the following functionality:

Type System
- method.cpp, class.cpp, typedesc.cpp
Loading types/classes
- ceeload.cpp methodtable.cpp and methodtablebuilder.cpp
Threading
- threads.cpp, threadstatics.cpp, threadsuspend.cpp and win32threadpool.cpp
Exception Handling and Stack Walking
- exceptionhandling.cpp, excep.cpp, stackwalk.cpp, frames.cpp
Fundamental Types
- object.cpp, array.cpp, appdomain.cpp, safehandle.cpp
Generics
- generics.cpp and genericdict.cpp
An entire Interpreter (yes .NET can run interpreted!!)
- interpreter.cpp and interpreter.hpp
Function calling mechanisms (see BotR for more info)
- ecall.cpp, fcall.cpp and qcall.cpp
Stubs (used for virtual dispatch and delegates amongst other things)
- stubs.cpp, prestub.cpp, stubgen.cpp, stubhelpers.cpp, stubmgr.cpp, virtualcallstub.cpp
Event Tracing
- eventtrace.cpp, eventreporter.cpp, eventstore.cpp and nativeeventsource.cpp
Profiler
- profiler.cpp, profilermetadataemitvalidator.cpp profattach.cpp and profdetach.cpp
P/Invoke
- dllimport.cpp, dllimportcallback.cpp and marshalnative.cpp
Reflection
- reflectioninvocation.cpp, dispatchinfo.cpp and invokeutil.cpp

CPU Architecture-specific code

All the architecture-specific code is kept separately in several sub-folders, amd64, arm, arm64 and i386. For example here’s the various implementations of the WriteBarrier function used by the GC:

amd64 (.asm), there is also a .S version
arm
arm64
i386

jit (Just-in-Time compiler)

Before we look at the actual source code, it’s worth looking at the different ‘flavours’ or the JIT that are available:

Fortunately one of the Microsoft developers has clarified which one should be used

Here’s my guidance on how non-MS contributors should think about contributing to the JIT: If you want to help advance the state of the production code-generators for .NET, then contribute to the new RyuJIT x86/ARM32 backend. This is our long term direction. If instead your interest is around getting the .NET Core runtime working on x86 or ARM32 platforms to do other things, by all means use and contribute bug fixes if necessary to the LEGACY_BACKEND paths in the RyuJIT code base today to unblock yourself. We do run testing on these paths today in our internal testing infrastructure and will do our best to avoid regressing it until we can replace it with something better. We just want to make sure that there will be no surprises or hard feelings for when the time comes to remove them from the code-base.

JIT Phases

The JIT has almost 90 source files, but fortunately they correspond to the different phases it goes through, so it’s not too hard to find your way around. Using the table from ‘Phases of RyuyJIT’, I added the right-hand column so you can jump to the relevant source file(s):

Phase	IR Transformations	File
Pre-import	`Compiler->lvaTable` created and filled in for each user argument and variable. BasicBlock list initialized.	compiler.hpp
Importation	`GenTree` nodes created and linked in to Statements, and Statements into BasicBlocks. Inlining candidates identified.	importer.cpp
Inlining	The IR for inlined methods is incorporated into the flowgraph.	inline.cpp and inlinepolicy.cpp
Struct Promotion	New lvlVars are created for each field of a promoted struct.	morph.cpp
Mark Address-Exposed Locals	lvlVars with references occurring in an address-taken context are marked. This must be kept up-to-date.	compiler.hpp
Morph Blocks	Performs localized transformations, including mandatory normalization as well as simple optimizations.	morph.cpp
Eliminate Qmarks	All `GT_QMARK` nodes are eliminated, other than simple ones that do not require control flow.	compiler.cpp
Flowgraph Analysis	`BasicBlock` predecessors are computed, and must be kept valid. Loops are identified, and normalized, cloned and/or unrolled.	flowgraph.cpp
Normalize IR for Optimization	lvlVar references counts are set, and must be kept valid. Evaluation order of `GenTree` nodes (`gtNext`/`gtPrev`) is determined, and must be kept valid.	compiler.cpp and lclvars.cpp
SSA and Value Numbering Optimizations	Computes liveness (`bbLiveIn` and `bbLiveOut` on `BasicBlocks`), and dominators. Builds SSA for tracked lvlVars. Computes value numbers.	liveness.cpp
Loop Invariant Code Hoisting	Hoists expressions out of loops.	optimizer.cpp
Copy Propagation	Copy propagation based on value numbers.	copyprop.cpp
Common Subexpression Elimination (CSE)	Elimination of redundant subexressions based on value numbers.	optcse.cpp
Assertion Propagation	Utilizes value numbers to propagate and transform based on properties such as non-nullness.	assertionprop.cpp
Range analysis	Eliminate array index range checks based on value numbers and assertions	rangecheck.cpp
Rationalization	Flowgraph order changes from `FGOrderTree` to `FGOrderLinear`. All `GT_COMMA`, `GT_ASG` and `GT_ADDR` nodes are transformed.	rationalize.cpp
Lowering	Register requirements are fully specified (`gtLsraInfo`). All control flow is explicit.	lower.cpp, lowerarm.cpp, lowerarm64.cpp and lowerxarch.cpp
Register allocation	Registers are assigned (`gtRegNum` and/or `gtRsvdRegs`),and the number of spill temps calculated.	regalloc.cpp and register_arg_convention.cp
Code Generation	Determines frame layout. Generates code for each `BasicBlock`. Generates prolog & epilog code for the method. Emit EH, GC and Debug info.	codegenarm.cpp, codegenarm64.cpp, codegencommon.cpp, codegenlegacy.cpp, codegenlinear.cpp and codegenxarch.cpp

pal (Platform Adaptation Layer)

The PAL provides an OS independent layer to give access to common low-level functionality such as:

As .NET was originally written to run on Windows, all the APIs look very similar to the Win32 APIs. However for non-Windows platforms they are actually implemented using the functionality available on that OS. For example this is what PAL code to read/write a file looks like:

int main(int argc, char *argv[])
{
  WCHAR  src[4] = {'f', 'o', 'o', '\0'};
  WCHAR dest[4] = {'b', 'a', 'r', '\0'};
  WCHAR  dir[5] = {'/', 't', 'm', 'p', '\0'};
  HANDLE h;
  unsigned int b;

  PAL_Initialize(argc, (const char**)argv);
  SetCurrentDirectoryW(dir);
  SetCurrentDirectoryW(dir);
  h =  CreateFileW(src, GENERIC_WRITE, FILE_SHARE_READ, NULL, CREATE_NEW, 0, NULL);
  WriteFile(h, "Testing\n", 8, &b, FALSE);
  CloseHandle(h);
  CopyFileW(src, dest, FALSE);
  DeleteFileW(src);
  PAL_Terminate();
  return 0;
}

The PAL does contain some per-CPU assembly code, but it’s only for very low-level functionality, for instance here’s the different implementations of the DebugBreak function:

gc (Garbage Collector)

The GC is clearly a very complex piece of code, lying right at the heart of the CLR, so for more information about what it does I recommend reading the BotR entry on ‘Garbage Collection Design’ and if you’re interested I’ve also written several blog posts looking at its functionality.

However from a source code point-of-view the GC is pretty simple, it’s spread across just 19 .cpp files, but the bulk of the work is in gc.cpp (raw version) all ~37K L.O.C of it!!

If you want to get deeper into the GC code (warning, it’s pretty dense), a good way to start is to search for the occurrences of various ETW events that are fired as the GC moves through the phases outlined in the BotR post above, these events are listed below:

FireEtwGCTriggered(..)
FireEtwGCAllocationTick_V1(..)
FireEtwGCFullNotify_V1(..)
FireEtwGCJoin_V2(..)
FireEtwGCMarkWithType(..)
FireEtwGCPerHeapHistory_V3(..)
FireEtwGCGlobalHeapHistory_V2(..)
FireEtwGCCreateSegment_V1(..)
FireEtwGCFreeSegment_V1(..)
FireEtwBGCAllocWaitBegin(..)
FireEtwBGCAllocWaitEnd(..)
FireEtwBGCDrainMark(..)
FireEtwBGCRevisit(..)
FireEtwBGCOverflow(..)
FireEtwPinPlugAtGCTime(..)
FireEtwGCCreateConcurrentThread_V1(..)
FireEtwGCTerminateConcurrentThread_V1(..)

But the GC doesn’t work in isolation, it also requires help from the Execute Engine (EE), this is done via the GCToEEInterface which is implemented in gcenv.ee.cpp.

Local GC and GC Sample

Finally, there are 2 others ways you can get into the GC code and understand what it does.

Firstly there is a GC sample the lets you use the full GC independent of the rest of the runtime. It shows you how to ‘create type layout information in format that the GC expects’, ‘implement fast object allocator and write barrier’ and ‘allocate objects and work with GC handles’, all in under 250 LOC!!

Also worth mentioning is the ‘Local GC’ project, which is an ongoing effort to decouple the GC from the rest of the runtime, they even have a dashboard so you can track its progress. Currently the GC code is too intertwined with the runtime and vica-versa, so ‘Local GC’ is aiming to break that link by providing a set of clear interfaces, GCToOSInterface and GCToEEInterface. This will help with the CoreCLR cross-platform efforts, making the GC easier to port to new OSes.

debug

The CLR is a ‘managed runtime’ and one of the significant components it provides is a advanced debugging experience, via Visual Studio or WinDBG. This debugging experience is very complex and I’m not going to go into it in detail here, however if you want to learn more I recommend you read ‘Data Access Component (DAC) Notes’.

But what does the source look like, how is it laid out? Well the a several main sub-components under the top-level /debug folder:

dacaccess - the provides the ‘Data Access Component’ (DAC) functionality as outlined in the BotR page linked to above. The DAC is an abstraction layer over the internal structures in the runtime, which the debugger uses to inspect objects/classes
di - this contains the exposed APIs (or entry points) of the debugger, implemented by CoreCLRCreateCordbObject(..) in cordb.cpp
ee - the section of debugger that works with the Execution Engine (EE) to do things like stack-walking
inc - all the interfaces (.h) files that the debugger components implement

All the rest

As well as the main components, there are various other top-level folders in the source, the full list is below:

binder
- The ‘binder’ is responsible for loading assemblies within a .NET program (except the mscorlib binder which is elsewhere). The ‘binder’ comprises low-level code that controls Assemblies, Application Contexts and the all-important Fusion Log for diagnosing why assemblies aren’t loading!
classlibnative
- Code for native implementations of many of the core data types in the CoreCLR, e.g. Arrays, System.Object, String, decimal, float and double.
- Also includes all the native methods exposed in the ‘System.Environment’ namespace, e.g. Environment.ProcessorCount, Environment.TickCount, Environment.GetCommandLineArgs(), Environment.FailFast(), etc
coreclr
- Contains the different tools that can ‘host’ or run the CLR, e.g. corerun, coreconsole or unixcorerun. See How the dotnet CLI tooling runs your code for more info on how these tools work.
corefx
- Several classes under the ‘System.Globalization’ namespace have native implementations, in here you will find the code for Calendar Data, Locales, Text Normalisation and Time Zone information.
dlls
- Wrapper code and build files that control how the various dlls are built. For instance mscoree is the main Execution Engine (EE) and contains the CoreCLR DLL Entrypoint and CoreCLR build definition, likewise mscorrc includes the resource file that houses all the CoreCLR error messages.
gcdump and gcinfo
- Code that will write-out the GCInfo that is produced by the JIT to help the GC do it’s job. This GCInfo includes information about the ‘liveness’ of variables within a section of code and whether the method is fully or partially interruptible, which enables the EE to suspend methods when the GC is working.
ilasm
- IL (Intermediate Language) Assembler is a tool for converting IL code into a .NET executable, see the MSDN page for more info and usage examples.
ildasm
- Tool for disassembling a .NET executable into the corresponding IL source code, again, see the MSDN page for info and usage examples.
inc
- Header files that define the ‘interfaces’ between the sub-components that make up the CoreCLR. For example corjit.h covers all communication between the Execution Engine (EE) and the JIT, that is ‘EE -> JIT’ and corinfo.h is the interface going the other way, i.e. ‘JIT -> EE’
ipcman
- Code that enables the ‘Inter-Process Communication’ (IPC) used in .NET (mostly legacy and probably not cross-platform)
md
- The MetaData (md) code provides the ability to gather information about methods, classes, types and assemblies and is what makes Reflection possible.
nativeresources
- A simple tool that is responsible for converting/extracting resources from a Windows Resource File.
palrt
- The PAL (Platform Adaptation Layer) Run-Time, contains specific parts of the PAL layer.
scripts
- Several Python scripts for auto-generating various files in the source (e.g. ETW events).
strongname
- The code for handling ‘strong-naming’, including the signing keys used by the CoreCLR itself.
ToolBox
- Contains 2 stand-alone tools
  - SOS (son-of-strike) the CLR debugging extension that enables reporting of .NET specific information when using WinDBG
  - SuperPMI which enables testing of the JIT without requiring the full Execution Engine (EE)
tools
- Several cmd-line tools that can be used in conjunction with the CoreCLR, e.g. ‘Runtime Meta Data Dump Utility’ and ‘Native Image Generator’ (also known as ‘crossgen’)
unwinder
- Provides the low-level functionality to make it possible for the debugger and exception handling components to walk or unwind the stack. This is done via 2 functions, GetModuleBase(..) and GetFunctionEntry(..) which are implemented in CPU architecture-specific code, see amd64, arm, arm64 and i386
utilcode
- Shared utility code that is used by the VM, Debugger and JIT
zap
- ‘ZAP’ is the original code name for NGen (Native Image Generator), a tool that creates native images from .NET IL code.

If you’ve read this far ‘So long and thanks for all the fish’ (YouTube)

Discuss this post on Hacker News and /r/programming

The 68 things the CLR does before executing a single line of your code (*)

2017-02-07T00:00:00+00:00

Because the CLR is a managed environment there are several components within the runtime that need to be initialised before any of your code can be executed. This post will take a look at the EE (Execution Engine) start-up routine and examine the initialisation process in detail.

(*) 68 is only a rough guide, it depends on which version of the runtime you are using, which features are enabled and a few other things

‘Hello World’

Imagine you have the simplest possible C# program, what has to happen before the CLR prints ‘Hello World’ out to the console?

using System;

namespace ConsoleApplication
{
    public class Program
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
        }
    }
}

The code path into the EE (Execution Engine)

When a .NET executable runs, control gets into the EE via the following code path:

_CorExeMain() (the external entry point)
- call to _CorExeMainInternal()
_CorExeMainInternal()
- call to EnsureEEStarted()
EnsureEEStarted()
- call to EEStartup()
EEStartup()
- call to EEStartupHelper()
EEStartupHelper()

(if you’re interested in what happens before this, i.e. how a CLR Host can start-up the runtime, see my previous post ‘How the dotnet CLI tooling runs your code’)

And so we end up in EEStartupHelper(), which at a high-level does the following (from a comment in ceemain.cpp):

EEStartup is responsible for all the one time initialization of the runtime.
Some of the highlights of what it does include

Creates the default and shared, appdomains.

Loads mscorlib.dll and loads up the fundamental types (System.Object …)

The main phases in EE (Execution Engine) start-up routine

But let’s look at what it does in detail, the lists below contain all the individual function calls made from EEStartupHelper() (~500 L.O.C). To make them easier to understand, we’ll split them up into separate phases:

Phase 1 - Set-up the infrastructure that needs to be in place before anything else can run
Phase 2 - Initialise the core, low-level components
Phase 3 - Start-up the low-level components, i.e. error handling, profiling API, debugging
Phase 4 - Start the main components, i.e. Garbage Collector (GC), AppDomains, Security
Phase 5 - Final setup and then notify other components that the EE has started

Note some items in the list below are only included if a particular feature is defined at build-time, these are indicated by the inclusion on an ifdef statement. Also note that the links take you to the code for the function being called, not the line of code within EEStartupHelper().

Phase 1 - Set-up the infrastructure that needs to be in place before anything else can run

Wire-up console handling - SetConsoleCtrlHandler(..) (ifndef FEATURE_PAL)
Initialise the internal SString class (everything uses strings!) - SString::Startup()
Make sure the configuration is set-up, so settings that control run-time options can be accessed - EEConfig::Set-up() and InitializeHostConfigFile() (#if !defined(CROSSGEN_COMPILE))
Initialize Numa and CPU group information - NumaNodeInfo::InitNumaNodeInfo() and CPUGroupInfo::EnsureInitialized() (#ifndef CROSSGEN_COMPILE)
Initialize global configuration settings based on startup flags - InitializeStartupFlags()
Set-up the Thread Manager that gives the runtime access to the OS threading functionality (StartThread(), Join(), SetThreadPriority() etc) - InitThreadManager()
Initialize Event Tracing (ETW) and fire off the CLR startup events - InitializeEventTracing() and ETWFireEvent(EEStartupStart_V1) (#ifdef FEATURE_EVENT_TRACE)
Set-up the GS Cookie (Buffer Security Check) to help prevent buffer overruns - InitGSCookie()
Create the data-structures needed to hold the ‘frames’ used for stack-traces - Frame::Init()
Ensure initialization of Apphacks environment variables - GetGlobalCompatibilityFlags() (#ifndef FEATURE_CORECLR)
Create the diagnostic and performance logs used by the runtime - InitializeLogging() (#ifdef LOGGING) and PerfLog::PerfLogInitialize() (#ifdef ENABLE_PERF_LOG)

Phase 2 - Initialise the core, low-level components

Write to the log ===================EEStartup Starting===================
Ensure that the Runtime Library functions (that interact with ntdll.dll) are enabled - EnsureRtlFunctions() (#ifndef FEATURE_PAL)
Set-up the global store for events (mutexes, semaphores) used for synchronisation within the runtime - InitEventStore()
Create the Assembly Binding logging mechanism a.k.a Fusion - InitializeFusion() (#ifdef FEATURE_FUSION)
Then initialize the actual Assembly Binder infrastructure - CCoreCLRBinderHelper::Init() which in turn calls AssemblyBinder::Startup() (#ifdef FEATURE_FUSION is NOT defined)
Set-up the heuristics used to control Monitors, Crsts, and SimpleRWLocks - InitializeSpinConstants()
Initialize the InterProcess Communication with COM (IPC) - InitializeIPCManager() (#ifdef FEATURE_IPCMAN)
Set-up and enable Performance Counters - PerfCounters::Init() (#ifdef ENABLE_PERF_COUNTERS)
Set-up the CLR interpreter - Interpreter::Initialize() (#ifdef FEATURE_INTERPRETER), turns out that the CLR has a mode where your code is interpreted instead of compiled!
Initialise the stubs that are used by the CLR for calling methods and triggering the JIT - StubManager::InitializeStubManagers(), also Stub::Init() and StubLinkerCPU::Init()
Set up the core handle map, used to load assemblies into memory - PEImage::Startup()
Startup the access checks options, used for granting/denying security demands on method calls - AccessCheckOptions::Startup()
Startup the mscorlib binder (used for loading “known” types from mscorlib.dll) - MscorlibBinder::Startup()
Initialize remoting, which allows out-of-process communication - CRemotingServices::Initialize() (#ifdef FEATURE_REMOTING)
Set-up the data structures used by the GC for weak, strong and no-pin references - Ref_Initialize()
Set-up the contexts used to proxy method calls across App Domains - Context::Initialize()
Wire-up events that allow the EE to synchronise shut-down - g_pEEShutDownEvent->CreateManualEvent(FALSE)
Initialise the process-wide data structures used for reader-writer lock implementation - CRWLock::ProcessInit() (#ifdef FEATURE_RWLOCK)
Initialize the debugger manager - CCLRDebugManager::ProcessInit() (#ifdef FEATURE_INCLUDE_ALL_INTERFACES)
Initialize the CLR Security Attribute Manager - CCLRSecurityAttributeManager::ProcessInit() (#ifdef FEATURE_IPCMAN)
Set-up the manager for Virtual call stubs - VirtualCallStubManager::InitStatic()
Initialise the lock that that GC uses when controlling memory pressure - GCInterface::m_MemoryPressureLock.Init(CrstGCMemoryPressure)
Initialize Assembly Usage Logger - InitAssemblyUsageLogManager() (#ifndef FEATURE_CORECLR)

Phase 3 - Start-up the low-level components, i.e. error handling, profiling API, debugging

Set-up the App Domains used by the CLR - SystemDomain::Attach() (also creates the DefaultDomain and the SharedDomain by calling SystemDomain::CreateDefaultDomain() and SharedDomain::Attach())
Start up the ECall interface, a private native calling interface used within the CLR - ECall::Init()
Set-up the caches for the stubs used by delegates - COMDelegate::Init()
Set-up all the global/static variables used by the EE itself - ExecutionManager::Init()
Initialise Watson, for windows error reporting - InitializeWatson(fFlags) (#ifndef FEATURE_PAL)
Initialize the debugging services, this must be done before any EE thread objects are created, and before any classes or modules are loaded - InitializeDebugger() (#ifdef DEBUGGING_SUPPORTED)
Activate the Managed Debugging Assistants that the CLR provides - ManagedDebuggingAssistants::EEStartupActivation() (ifdef MDA_SUPPORTED)
Initialise the Profiling API - ProfilingAPIUtility::InitializeProfiling() (#ifdef PROFILING_SUPPORTED)
Initialise the exception handling mechanism - InitializeExceptionHandling()
Install the CLR global exception filter - InstallUnhandledExceptionFilter()
Ensure that the initial runtime thread is created - SetupThread() in turn calls SetupThread(..)
Initialise the PreStub manager (PreStub’s trigger the JIT) - InitPreStubManager() and the corresponding helpers StubHelpers::Init()
Initialise the COM Interop layer - InitializeComInterop() (#ifdef FEATURE_COMINTEROP)
Initialise NDirect method calls (lazy binding of unmanaged P/Invoke targets) - NDirect::Init()
Set-up the JIT Helper functions, so they are in place before the execution manager runs - InitJITHelpers1() and InitJITHelpers2()
Initialise and set-up the SyncBlock cache - SyncBlockCache::Attach() and SyncBlockCache::Start()
Create the cache used when walking/unwinding the stack - StackwalkCache::Init()

Phase 4 - Start the main components, i.e. Garbage Collector (GC), AppDomains, Security

Start up security system, that handles Code Access Security (CAS) - Security::Start() which in turn calls SecurityPolicy::Start()
Wire-up an event to allow synchronisation of AppDomain unloads - AppDomain::CreateADUnloadStartEvent()
Initialise the ‘Stack Probes’ used to setup stack guards InitStackProbes() (#ifdef FEATURE_STACK_PROBE)
Initialise the GC and create the heaps that it uses - InitializeGarbageCollector()
Initialise the tables used to hold the locations of pinned objects - InitializePinHandleTable()
Inform the debugger about the DefaultDomain, so it can interact with it - SystemDomain::System()->PublishAppDomainAndInformDebugger(..) (#ifdef DEBUGGING_SUPPORTED)
Initialise the existing OOB Assembly List (no idea?) - ExistingOobAssemblyList::Init() (#ifndef FEATURE_CORECLR)
Actually initialise the System Domain (which contains mscorlib), so that it can start executing - SystemDomain::System()->Init()

Phase 5 Final setup and then notify other components that the EE has started

Tell the profiler we’ve stated up - SystemDomain::NotifyProfilerStartup() (#ifdef PROFILING_SUPPORTED)
Pre-create a thread to handle AppDomain unloads - AppDomain::CreateADUnloadWorker() (#ifndef CROSSGEN_COMPILE)
Set a flag to confirm that ‘initialisation’ of the EE succeeded - g_fEEInit = false
Load the System Assemblies (‘mscorlib’) into the Default Domain - SystemDomain::System()->DefaultDomain()->LoadSystemAssemblies()
Set-up all the shared static variables (and String.Empty) in the Default Domain - SystemDomain::System()->DefaultDomain()->SetupSharedStatics(), they are all contained in the internal class SharedStatics.cs
Set-up the stack sampler feature, that identifies ‘hot’ methods in your code - StackSampler::Init() (#ifdef FEATURE_STACK_SAMPLING)
Perform any once-only SafeHandle initialization - SafeHandle::Init() (#ifndef CROSSGEN_COMPILE)
Set flags to indicate that the CLR has successfully started - g_fEEStarted = TRUE, g_EEStartupStatus = S_OK and hr = S_OK
Write to the log ===================EEStartup Completed===================

Once this is all done, the CLR is now ready to execute your code!!

Executing your code

Your code will be executed (after first being ‘JITted’) via the following code flow:

CorHost2::ExecuteAssembly()
- calling ExecuteMainMethod()
Assembly::ExecuteMainMethod()
- calling RunMain()
RunMain() (in assembly.cpp)
- eventually calling into you main() method
- full explanation of the ‘call’ process

Discuss this post on Hacker News and /r/programming

Further information

The CLR provides a huge amount of log information if you create a debug build and then enable the right environment variables. The links below take you to the various logs produced when running a simple ‘hello world’ program (shown at the top of this post), they give you an pretty good idea of the different things that the CLR is doing behind-the-scenes.

All Classes Loaded
All Methods JITted
Entire log (warning ~68K lines long!!)
Log produced during EEStartupHelper() only (only ~48K lines!!)
AppDomain log
Class Loader log
Class loader log for ConsoleApplication only
Code Sharing log
Core Debugging log
Exception Handling log
JIT log
Loader log

The post The 68 things the CLR does before executing a single line of your code (*) first appeared on my blog Performance is a Feature!

How do .NET delegates work?

2017-01-25T00:00:00+00:00

Delegates are a fundamental part of the .NET runtime and whilst you rarely create them directly, they are there under-the-hood every time you use a lambda in LINQ (=>) or a Func<T>/Action<T> to make your code more functional. But how do they actually work and what’s going in the CLR when you use them?

IL of delegates and/or lambdas

Let’s start with a small code sample like this:

public delegate string SimpleDelegate(int x);

class DelegateTest
{
    static int Main()
    {
        // create an instance of the class
        DelegateTest instance = new DelegateTest();
        instance.name = "My instance";

        // create a delegate
        SimpleDelegate d1 = new SimpleDelegate(instance.InstanceMethod);

        // call 'InstanceMethod' via the delegate (compiler turns this into 'd1.Invoke(5)')
        string result = d1(5); // returns "My instance: 5"
    }

    string InstanceMethod(int i)
    {
        return string.Format("{0}: {1}", name, i);
    }
}

If you were to take a look at the IL of the SimpleDelegate class, the ctor and Invoke methods look like so:

[MethodImpl(0, MethodCodeType=MethodCodeType.Runtime)]
public SimpleDelegate(object @object, IntPtr method);

[MethodImpl(0, MethodCodeType=MethodCodeType.Runtime)]
public virtual string Invoke(int x);

It turns out that this behaviour is manadated by the spec, from ECMA 335 Standard - Common Language Infrastructure (CLI):

So the internal implementation of a delegate, the part responsible for calling a method, is created by the runtime. This is because there needs to be complete control over those methods, delegates are a fundamental part of the CLR, any security issues, performance overhead or other inefficiencies would be a big problem.

Methods that are created in this way are technically know as EEImpl methods (i.e. implemented by the ‘Execution Engine’), from the ‘Book of the Runtime’ (BOTR) section ‘Method Descriptor - Kinds of MethodDescs:

EEImpl Delegate methods whose implementation is provided by the runtime (Invoke, BeginInvoke, EndInvoke). See ECMA 335 Partition II - Delegates.

There’s also more information available in these two excellent articles .NET Type Internals - From a Microsoft CLR Perspective (section on ‘Delegates’) and Understanding .NET Delegates and Events, By Practice (section on ‘Internal Delegates Representation’)

How the runtime creates delegates

Inlining of delegate ctors

So we’ve seen that the runtime has responsibility for creating the bodies of delegate methods, but how is this done. It starts by wiring up the delegate constructor (ctor), as per the BOTR page on ‘method descriptors’

FCall Internal methods implemented in unmanaged code. These are methods marked with MethodImplAttribute(MethodImplOptions.InternalCall) attribute, delegate constructors and tlbimp constructors.

At runtime this happens when the JIT compiles a method that contains IL code for creating a delegate. In Compiler::fgOptimizeDelegateConstructor(..), the JIT firstly obtains a reference to the correct delegate ctor, which in the simple case is CtorOpened(Object target, IntPtr methodPtr, IntPtr shuffleThunk) (link to C# code), before finally wiring up the ctor, inlining it if possible for maximum performance.

Creation of the delegate Invoke() method

But what’s more interesting is the process that happens when creating the Invoke() method, using a technique involving ‘stubs’ of code (raw-assembly) that know how to locate the information about the target method and can jump control to it. These ‘stubs’ are actually used in a wide-variety of scenarios, for instance during Virtual Method Dispatch and also by the JITter (when a method is first called it hits a ‘pre-code stub’ that causes the method to be JITted, the ‘stub’ is then replaced by a call to the JITted ‘native code’).

In the particular case of delegates, these stubs are referred to as ‘shuffle thunks’. This is because part of the work they have to do is ‘shuffle’ the arguments that are passed into the Invoke() method, so that are in the correct place (stack/register) by the time the ‘target’ method is called.

To understand what’s going on, it’s helpful to look at the following diagram taken from the BOTR page on Method Descriptors and Precode stubs. The ‘shuffle thunks’ we are discussing are a particular case of a ‘stub’ and sit in the corresponding box in the diagram:

How ‘shuffle thunks’ are set-up

So let’s look at the code flow for the delegate we created in the sample at the beginning of this post, specifically an ‘open’ delegate, calling an instance method (if you are wondering about the difference between open and closed delegates, have a read of ‘Open Delegates vs. Closed Delegates’).

We start off in the impImportCall() method, deep inside the .NET JIT, triggered when a ‘call’ op-code for a delegate is encountered, it then goes through the following functions:

Compiler::impImportCall(..)
Compiler::fgOptimizeDelegateConstructor(..)
COMDelegate::GetDelegateCtor(..)
COMDelegate::SetupShuffleThunk
StubCacheBase::Canonicalize(..)
ShuffleThunkCache::CompileStub()
EmitShuffleThunk (specific assembly code for different CPU architectures)
- arm
- arm64
- i386

Below is the code from the arm64 version (chosen because it’s the shortest one of the three!). You can see that it emits assembly code to fetch the real target address from MethodPtrAux, loops through the method arguments and puts them in the correct register (i.e. ‘shuffles’ them into place) and finally emits a tail-call jump to the target method associated with the delegate.

VOID StubLinkerCPU::EmitShuffleThunk(ShuffleEntry *pShuffleEntryArray)
{
  // On entry x0 holds the delegate instance. Look up the real target address stored in the MethodPtrAux
  // field and save it in x9. Tailcall to the target method after re-arranging the arguments
  // ldr x9, [x0, #offsetof(DelegateObject, _methodPtrAux)]
  EmitLoadStoreRegImm(eLOAD, IntReg(9), IntReg(0), DelegateObject::GetOffsetOfMethodPtrAux());
  //add x11, x0, DelegateObject::GetOffsetOfMethodPtrAux() - load the indirection cell into x11 used by ResolveWorkerAsmStub
  EmitAddImm(IntReg(11), IntReg(0), DelegateObject::GetOffsetOfMethodPtrAux());

  for (ShuffleEntry* pEntry = pShuffleEntryArray; pEntry->srcofs != ShuffleEntry::SENTINEL; pEntry++)
  {
    if (pEntry->srcofs & ShuffleEntry::REGMASK)
    {
      // If source is present in register then destination must also be a register
      _ASSERTE(pEntry->dstofs & ShuffleEntry::REGMASK);

      EmitMovReg(IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), IntReg(pEntry->srcofs & ShuffleEntry::OFSMASK));
    }
    else if (pEntry->dstofs & ShuffleEntry::REGMASK)
    {
      // source must be on the stack
      _ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));

      EmitLoadStoreRegImm(eLOAD, IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), RegSp, pEntry->srcofs * sizeof(void*));
    }
    else
    {
      // source must be on the stack
      _ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));

      // dest must be on the stack
      _ASSERTE(!(pEntry->dstofs & ShuffleEntry::REGMASK));

      EmitLoadStoreRegImm(eLOAD, IntReg(8), RegSp, pEntry->srcofs * sizeof(void*));
      EmitLoadStoreRegImm(eSTORE, IntReg(8), RegSp, pEntry->dstofs * sizeof(void*));
    }
  }

  // Tailcall to target
  // br x9
  EmitJumpRegister(IntReg(9));
}

Other functions that call `SetupShuffleThunk(..)`

The other places in code that also emit these ‘shuffle thunks’ are listed below. They are used in the various scenarios where a delegate is explicitly created, e.g. via `Delegate.CreateDelegate(..).

COMDelegate::BindToMethod(..) - actual call to SetupShuffleThunk(..)
COMDelegate::DelegateConstruct(..) (ECall impl) - actual call to SetupShuffleThunk(..)
COMDelegate::GetDelegateCtor(..) - actual call to SetupShuffleThunk(..)

Different types of delegates

Now that we’ve looked at how one type of delegate works (#2 ‘Instance open non-virt’ in the table below), it will be helpful to see the other different types that the runtime deals with. From the very informative DELEGATE KINDS TABLE in the CLR source:

#	delegate type	_target	_methodPtr	_methodPtrAux
1	Instance closed	‘this’ ptr	target method	null
2	Instance open non-virt	delegate	shuffle thunk	target method
3	Instance open virtual	delegate	Virtual-stub dispatch	method id
4	Static closed	first arg	target method	null
5	Static closed (special sig)	delegate	specialSig thunk	target method
6	Static opened	delegate	shuffle thunk	target method
7	Secure	delegate	call thunk	MethodDesc (frame)

Note: The columns map to the internal fields of a delegate (from System.Delegate)

So we’ve (deliberately) looked at the simple case, but the more complex scenarios all work along similar lines, just using different and more stubs/thunks as needed e.g. ‘virtual-stub dispatch’ or ‘call thunk’.

Delegates are special!!

As well as being responsible for creating delegates, the runtime also treats delegate specially, to enforce security and/or type-safety. You can see how this is implemented in the links below

In MethodTableBuilder.cpp:

In ClassCompat.cpp:

Discuss this post in /r/programming and /r/csharp

Analysing Pause times in the .NET GC

2017-01-13T00:00:00+00:00

Over the last few months there have been several blog posts looking at GC pauses in different programming languages or runtimes. It all started with a post looking at the latency of the Haskell GC, next came a follow-up that compared Haskell, OCaml and Racket, followed by Go GC in Theory and Practice, before a final post looking at the situation in Erlang.

After reading all these posts I wanted to see how the .NET GC compares to the other runtime implementations.

The posts above all use a similar test program to exercise the GC, based on the message-bus scenario that Pusher initially described, fortunately Franck Jeannin had already started work on a .NET version, so this blog post will make us of that.

At the heart of the test is the following code:

for (var i = 0; i < msgCount; i++)
{
    var sw = Stopwatch.StartNew();
    pushMessage(array, i);
    sw.Stop();
    if (sw.Elapsed > worst)
    {
        worst = sw.Elapsed;
    }
}

private static unsafe void pushMessage(byte[][] array, int id)
{
    array[id % windowSize] = createMessage(id);               
}

The full code is available

So we are creating a ‘message’ (that is actually a byte[1024]) and then putting it into a data structure (byte[][]). This is repeated 10 million times (msgCount), but at any one time there are only 200,000 (windowSize) messages in memory, because we overwrite old ‘messages’ as we go along.

We are timing how long it takes to add the message to the array, which should be a very quick operation. It’s not guaranteed that this time will always equate to GC pauses, but it’s pretty likely. However we can also double check the actual GC pause times by using the excellent PerfView tool, to give us more confidence.

Workstation GC vs. Server GC

Unlike the Java GC that is very configurable, the .NET GC really only gives you a few options:

Workstation
Server
Concurrent/Background

So we will be comparing the Server and Workstation modes, but as we want to reduce pauses we are going to always leave Concurrent/Background mode enabled.

As outlined in the excellent post Understanding different GC modes with Concurrency Visualizer, the 2 modes are optimised for different things (emphasis mine):

Workstation GC is designed for desktop applications to minimize the time spent in GC. In this case GC will happen more frequently but with shorter pauses in application threads. Server GC is optimized for application throughput in favor of longer GC pauses. Memory consumption will be higher, but application can process greater volume of data without triggering garbage collection.

Therefore Workstation mode should give us shorter pauses than Server mode and the results bear this out, below is a graph of the pause times at different percentiles, recorded with by HdrHistogram.NET (click for full-size image):

Note that the X-axis scale is logarithmic, the Workstation (WKS) pauses starts increasing at the 99.99%’ile, whereas the Server (SVR) pauses only start at the 99.9999%’ile, although they have a larger maximum.

Another way of looking at the results is the table below, here we can clearly see that Workstation has a-lot more GC pauses, although the max is smaller. But more significantly the total GC pause time is much higher and as a result the overall/elapsed time is twice as long (WKS v. SVR).

Workstation GC (Concurrent) vs. Server GC (Background) (On .NET 4.6 - Array tests - all times in milliseconds)

GC Mode	Max GC Pause	# GC Pauses	Total GC Pause Time	Elapsed Time	Peak Working Set (MB)
Workstation - 1	28.0	1,797	10,266.2	21,688.3	550.37
Workstation - 2	23.2	1,796	9,756.6	21,018.2	543.50
Workstation - 3	19.3	1,800	9,676.0	21,114.6	531.24
Server - 1	104.6	7	646.4	7,062.2	2,086.39
Server - 2	107.2	7	664.8	7,096.6	2,092.65
Server - 3	106.2	6	558.4	7,023.6	2,058.12

Therefore if you only care about the reducing the maximum pause time then Workstation mode is a suitable option, but you will experience more GC pauses overall and so the throughput of your application will be reduced. In addition, the working set is higher for Server mode as it allocates 1 heap per CPU.

Fortunately in .NET we have the choice of which mode we want to use, according to the fantastic article Modern garbage collection the GO runtime has optimised for pause time only:

The reality is that Go’s GC does not really implement any new ideas or research. As their announcement admits, it is a straightforward concurrent mark/sweep collector based on ideas from the 1970s. It is notable only because it has been designed to optimise for pause times at the cost of absolutely every other desirable characteristic in a GC. Go’s tech talks and marketing materials don’t seem to mention any of these tradeoffs, leaving developers unfamiliar with garbage collection technologies to assume that no such tradeoffs exist, and by implication, that Go’s competitors are just badly engineered piles of junk.

Max GC Pause Time compared to Amount of Live Objects

To investigate things further, let’s look at how the maximum pause times vary with the number of live objects. If you refer back to the sample code, we will still be allocating 10,000,000 message (msgCount), but we will vary the amount that are kept around at any one time by changing the windowSize value. Here are the results (click for full-size image):

So you can clearly see that the max pause time is proportional (linearly?) to the amount of live objects, i.e. the amount of objects that survive the GC. Why is this that case, well to get a bit more info we will again use PerfView to help us. If you compare the 2 tables below, you can see that the ‘Promoted MB’ is drastically different, a lot more memory is promoted when we have a larger windowSize, so the GC has more work to do and as a result the ‘Pause MSec’ times go up.

GC Events by Time - windowSize = 100,000
All times are in msec. Hover over columns for help.
GC Index	Gen	Pause MSec	Gen0 Alloc MB	Peak MB	After MB	Promoted MB	Gen0 MB	Gen1 MB	Gen2 MB	LOH MB
2	1N	39.443	1,516.354	1,516.354	108.647	104.831	0.000	107.200	0.031	1.415
3	0N	38.516	1,651.466	0.000	215.847	104.800	0.000	214.400	0.031	1.415
4	1N	42.732	1,693.908	1,909.754	108.647	104.800	0.000	107.200	0.031	1.415
5	0N	35.067	1,701.012	1,809.658	215.847	104.800	0.000	214.400	0.031	1.415
6	1N	54.424	1,727.380	1,943.226	108.647	104.800	0.000	107.200	0.031	1.415
7	0N	35.208	1,603.832	1,712.479	215.847	104.800	0.000	214.400	0.031	1.415

Full PerfView output

GC Events by Time - windowSize = 400,000
All times are in msec. Hover over columns for help.
GC Index	Gen	Pause MSec	Gen0 Alloc MB	Peak MB	After MB	Promoted MB	Gen0 MB	Gen1 MB	Gen2 MB	LOH MB
2	0N	10.319	76.170	76.170	76.133	68.983	0.000	72.318	0.000	3.815
3	1N	47.192	666.089	0.000	708.556	419.231	0.000	704.016	0.725	3.815
4	0N	145.347	1,023.369	1,731.925	868.610	419.200	0.000	864.070	0.725	3.815
5	1N	190.736	1,278.314	2,146.923	433.340	419.200	0.000	428.800	0.725	3.815
6	0N	150.689	1,235.161	1,668.501	862.140	419.200	0.000	857.600	0.725	3.815
7	1N	214.465	1,493.290	2,355.430	433.340	419.200	0.000	428.800	0.725	3.815
8	0N	148.816	1,055.470	1,488.810	862.140	419.200	0.000	857.600	0.725	3.815
9	1N	225.881	1,543.345	2,405.485	433.340	419.200	0.000	428.800	0.725	3.815
10	0N	148.292	1,077.176	1,510.516	862.140	419.200	0.000	857.600	0.725	3.815
11	1N	225.917	1,610.319	2,472.459	433.340	419.200	0.000	428.800	0.725	3.815

Full PerfView output

Going ‘off-heap’

Finally, if we really want to eradicate GC pauses in .NET, we can go off-heap. To do that we can write unsafe code like this:

var dest = array[id % windowSize];
IntPtr unmanagedPointer = Marshal.AllocHGlobal(dest.Length);
byte* bytePtr = (byte *) unmanagedPointer;

// Get the raw data into the bytePtr (byte *) 
// in reality this would come from elsewhere, e.g. a network packet
// but for the test we'll just cheat and populate it in a loop
for (int i = 0; i < dest.Length; ++i)
{
    *(bytePtr + i) = (byte)id;
}

// Copy the unmanaged byte array (byte*) into the managed one (byte[])
Marshal.Copy(unmanagedPointer, dest, 0, dest.Length);

Marshal.FreeHGlobal(unmanagedPointer);

Note: I wouldn’t recommend this option unless you have first profiled and determined that GC pauses are a problem, it’s called unsafe for a reason.

But as the graph shows, it clearly works (the off-heap values are there, honest!!). But it’s not that surprising, we are giving the GC nothing to do (because off-heap memory isn’t tracked by the GC), we get no GC pauses!

To finish let’s get a final work from Maoni Stephens, the main GC dev on the .NET runtime, from GC ETW events – 2 – Maoni’s WebLog:

It doesn’t even mean for the longest individual GC pauses you should always look at full GCs because full GCs can be done concurrently, which means you could have gen2 GCs whose pauses are shorter than ephemeral GCs. And even if full GCs did have longest individual pauses, it still doesn’t necessarily mean you should only look at them because you might be doing these GCs very infrequently, and ephemeral GCs actually contribute to most of the GC pause time if the total GC pauses are your problem.

Note: Ephemeral generations and segments - Because objects in generations 0 and 1 are short-lived, these generations are known as the ephemeral generations.

So if GC pause times are a genuine issue in your application, make sure you analyse them correctly!

Discuss this post in /r/csharp, /r/programming and Hacker News

The post Analysing Pause times in the .NET GC first appeared on my blog Performance is a Feature!

Why Exceptions should be Exceptional

2016-12-20T00:00:00+00:00

According to the NASA ‘Near Earth Object Program’ asteroid ‘101955 Bennu (1999 RQ36)’ has a Cumulative Impact Probability of 3.7e-04, i.e. there is a 1 in 2,700 (0.0370%) chance of Earth impact, but more reassuringly there is a 99.9630% chance the asteroid will miss the Earth completely!

But how does this relate to exceptions in the .NET runtime, well let’s take a look at the official .NET Framework Design Guidelines for Throwing Exceptions (which are based on the excellent book Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .NET Libraries)

So exceptions should be exceptional, unusual or rare, much like a asteroid strike!!

.NET Framework TryXXX() Pattern

In .NET, the recommended was to avoid exceptions in normal code flow is to use the TryXXX() pattern. As pointed out in the guideline section on Exceptions and Performance, rather than writing code like this, which has to catch the exception when the input string isn’t a valid integer:

try
{
    int result = int.Parse("IANAN");
    Console.WriteLine(result);
}
catch (FormatException fEx)
{
    Console.WriteLine(fEx);
}

You should instead use the TryXXX API, in the following pattern:

int result;
if (int.TryParse("IANAN", out result))
{
    // SUCCESS!!
    Console.WriteLine(result);
}
else
{
    // FAIL!!
}

Fortunately large parts of the .NET runtime use this pattern for non-exceptional events, such as parsing a string, creating a URL or adding an item to a Concurrent Dictionary.

The performance costs of exceptions

So onto the performance costs, I was inspired to write this post after reading this tweet from Clemens Vasters:

I also copied/borrowed a large amount of ideas from the excellent post ‘The Exceptional Performance of Lil’ Exception’ by Java performance guru Aleksey Shipilëv (this post is in essence the .NET version of his post, which focuses exclusively on exceptions in the JVM)

So lets start with the full results (click for full-size image):

(Full Benchmark Code and Results)

Rare exceptions v Error Code Handling

Up front I want to be clear that nothing in this post is meant to contradict the best-practices outlined in the .NET Framework Guidelines (above), in fact I hope that it actually backs them up!

Method	Mean	StdErr	StdDev	Scaled
ErrorCodeWithReturnValue	1.4472 ns	0.0088 ns	0.0341 ns	1.00
RareExceptionStackTrace	22.0401 ns	0.0292 ns	0.1132 ns	15.24
RareExceptionMediumStackTrace	61.8835 ns	0.0609 ns	0.2279 ns	42.78
RareExceptionDeepStackTrace	115.3692 ns	0.1795 ns	0.6953 ns	79.76

Here we can see that as long as you follow the guidance and ‘DO NOT use exceptions for the normal flow of control’ then they are actually not that costly. I mean yes, they’re 15 times slower than using error codes, but we’re only talking about 22 nanoseconds, i.e. 22 billionths of a second, you have to be throwing exceptions frequently for it to be noticeable. For reference, here’s what the code for the first 2 results looks like:

public struct ResultAndErrorCode<T>
{
    public T Result;
    public int ErrorCode;
}

[Benchmark(Baseline = true)]
public ResultAndErrorCode<string> ErrorCodeWithReturnValue()
{
    var result = new ResultAndErrorCode<string>();
    result.Result = null;
    result.ErrorCode = 5;
    return result;
}

[Benchmark]
public string RareExceptionStackTrace()
{
    try
    {
        RareLevel20(); // start all the way down
        return null; //Prevent Error CS0161: not all code paths return a value

    }
    catch (InvalidOperationException ioex)
    {
        // Force collection of a full StackTrace
        return ioex.StackTrace;
    }
}

Where the ‘RareLevelXX() functions look like this (i.e. will only trigger an exception once for every 2,700 times it’s called):

[MethodImpl(MethodImplOptions.NoInlining)]
private static void RareLevel1() { RareLevel2(); }
[MethodImpl(MethodImplOptions.NoInlining)]
private static void RareLevel2() { RareLevel3(); }
... // several layers left out!!
[MethodImpl(MethodImplOptions.NoInlining)]
private static void RareLevel19() { RareLevel20(); }
[MethodImpl(MethodImplOptions.NoInlining)]
private static void RareLevel20()
{
    counter++;
    // will *rarely* happen (1 in 2700)
    if (counter % chanceOfAsteroidHit == 1) 
        throw new InvalidOperationException("Deep Stack Trace - Rarely triggered");            
}

Therefore RareExceptionMediumStackTrace() just calls RareLevel10() to get a medium stack trace and RareExceptionDeepStackTrace() calls RareLevel1() which triggers the full/deep one (the full benchmark code is available).

Stack traces

Now that we’ve seen the cost of calling exceptions rarely, we’re going to look at the effect the stack trace depth has on performance. Here are the full, raw results:

Method	Mean	StdErr	StdDev	Gen 0	Allocated
Exception-Message	9,187.9417 ns	13.4824 ns	48.6117 ns	-	148 B
Exception-TryCatch	9,253.0215 ns	13.2496 ns	51.3154 ns	-	148 B
ExceptionMedium-Message	14,911.7999 ns	20.2448 ns	78.4078 ns	-	916 B
ExceptionMedium-TryCatch	15,158.0940 ns	147.4210 ns	737.1049 ns	-	916 B
ExceptionDeep-Message	19,166.3524 ns	30.0539 ns	116.3984 ns	-	916 B
ExceptionDeep-TryCatch	19,581.6743 ns	208.3895 ns	833.5579 ns	-	916 B
CachedException-StackTrace	29,354.9344 ns	34.8932 ns	135.1407 ns	-	1.82 kB
Exception-StackTrace	30,178.7152 ns	41.0362 ns	158.9327 ns	-	1.93 kB
ExceptionMedium-StackTrace	100,121.7951 ns	129.0631 ns	499.8591 ns	0.1953	15.71 kB
ExceptionDeep-StackTrace	154,569.3454 ns	205.2174 ns	794.8034 ns	3.6133	27.42 kB

Note: in these tests we are triggering an exception every-time a method is called, they aren’t the rare cases that we measured previously.

Exception handling without collecting the full StackTrace

First we are going to look at the results measuring the scenario where we don’t explicitly collect the StackTrace after the exception is caught, so the benchmark code looks like this:

[Benchmark]
public string ExceptionMessage()
{
    try
    {
        Level20(); // start *all* the way down the stack
        return null; //Prevent Error CS0161: not all code paths return a value
    }
    catch (InvalidOperationException ioex)
    {
        // Only get the simple message from the Exception 
        // (don't trigger a StackTrace collection)
        return ioex.Message;
    }
}

In the following graphs, shallow stack traces are in blue bars, medium in orange and deep stacks are shown in green

So we clearly see there is an extra cost for exception handling that increases the deeper the stack trace goes. This is because when an exception is thrown the runtime needs to search up the stack until it hits a method than can handle it. The further it has to look up the stack, the more work it has to do.

Exception handling including collection of the full StackTrace

Now for the final results, in which we explicitly ask the run-time to (lazily) fetch the full stack trace, by accessing the StackTrace property. The code looks like this:

[Benchmark]
public string ExceptionStackTrace()
{
    try
    {
        Level20(); // start *all* the way down the stack
        return null; //Prevent Error CS0161: not all code paths return a value
    }
    catch (InvalidOperationException ioex)
    {
        // Force collection of a full StackTrace
        return ioex.StackTrace;
    }
}

Finally we see that fetching the entire stack trace (via StackTrace) dominates the performance of just handling the exception (ie. only accessing the exception message). But again, the deeper the stack trace, the higher the cost.

So thanks goodness we’re in the .NET world, where huge stack traces are rare. Over in Java-land they have to deal with nonesense like this (click to see the full-res version!!):

Conclusion

Rare or Exceptional exceptions are not hugely expensive and they should always be the preferred way of error handling in .NET
If you have code that is expected to fail often (such as parsing a string into an integer), use the TryXXX() pattern
The deeper the stack trace, the more work that has to be done, so the more overhead there is when catching/handling exceptions
This is even more true if you are also fetching the entire stack trace, via the StackTrace property. So if you don’t need it, don’t fetch it.

Discuss this post in /r/programming and /r/csharp

The stack trace of a StackTrace!!

The full call-stack that the CLR goes through when fetching the data for the Exception StackTrace property

The post Why Exceptions should be Exceptional first appeared on my blog Performance is a Feature!

Why is reflection slow?

2016-12-14T00:00:00+00:00

It’s common knowledge that reflection in .NET is slow, but why is that the case? This post aims to figure that out by looking at what reflection does under-the-hood.

CLR Type System Design Goals

But first it’s worth pointing out that part of the reason reflection isn’t fast is that it was never designed to have high-performance as one of its goals, from Type System Overview - ‘Design Goals and Non-goals’:

Goals

Accessing information needed at runtime from executing (non-reflection) code is very fast.

Accessing information needed at compilation time for generating code is straightforward.

The garbage collector/stackwalker is able to access necessary information without taking locks, or allocating memory.

Minimal amounts of types are loaded at a time.

Minimal amounts of a given type are loaded at type load time.

Type system data structures must be storable in NGEN images.

Non-Goals

All information in the metadata is directly reflected in the CLR data structures.

All uses of reflection are fast.

and along the same lines, from Type Loader Design - ‘Key Data Structures’:

EEClass

MethodTable data are split into “hot” and “cold” structures to improve working set and cache utilization. MethodTable itself is meant to only store “hot” data that are needed in program steady state. EEClass stores “cold” data that are typically only needed by type loading, JITing or reflection. Each MethodTable points to one EEClass.

How does Reflection work?

So we know that ensuring reflection was fast was not a design goal, but what is it doing that takes the extra time?

Well there several things that are happening, to illustrate this lets look at the managed and unmanaged code call-stack that a reflection call goes through.

System.Reflection.RuntimeMethodInfo.Invoke(..) - source code link
- calling System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(..)
System.RuntimeMethodHandle.PerformSecurityCheck(..) - link
- calling System.GC.KeepAlive(..)
System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(..) - link
- calling stub for System.RuntimeMethodHandle.InvokeMethod(..)
stub for System.RuntimeMethodHandle.InvokeMethod(..) - link

Even if you don’t click the links and look at the individual C#/cpp methods, you can intuitively tell that there’s alot of code being executed along the way. But to give you an example, the final method, where the bulk of the work is done, System.RuntimeMethodHandle.InvokeMethod is over 400 LOC!

But this is a nice overview, however what is it specifically doing?

Fetching the Method information

Before you can invoke a field/property/method via reflection you have to get the FieldInfo/PropertyInfo/MethodInfo handle for it, using code like this:

Type t = typeof(Person);      
FieldInfo m = t.GetField("Name");

As shown in the previous section there’s a cost to this, because the relevant meta-data has to be fetched, parsed, etc. Interestingly enough the runtime helps us by keeping an internal cache of all the fields/properties/methods. This cache is implemented by the RuntimeTypeCache class and one example of its usage is in the RuntimeMethodInfo class.

You can see the cache in action by running the code in this gist, which appropriately enough uses reflection to inspect the runtime internals!

Before you have done any reflection to obtain a FieldInfo, the code in the gist will print this:

  Type: ReflectionOverhead.Program
  Reflection Type: System.RuntimeType (BaseType: System.Reflection.TypeInfo)
  m_fieldInfoCache is null, cache has not been initialised yet

But once you’ve fetched even just one field, then the following will be printed:

  Type: ReflectionOverhead.Program
  Reflection Type: System.RuntimeType (BaseType: System.Reflection.TypeInfo)
  RuntimeTypeCache: System.RuntimeType+RuntimeTypeCache, 
  m_cacheComplete = True, 4 items in cache
    [0] - Int32 TestField1 - Private
    [1] - System.String TestField2 - Private
    [2] - Int32 <TestProperty1>k__BackingField - Private
    [3] - System.String TestField3 - Private, Static

where ReflectionOverhead.Program looks like this:

class Program
{
    private int TestField1;
    private string TestField2;
    private static string TestField3;

    private int TestProperty1 { get; set; }
}

This means that repeated calls to GetField or GetFields are cheaper as the runtime only has to filter the pre-existing list that’s already been created. The same applies to GetMethod and GetProperty, when you call them the first time the MethodInfo or PropertyInfo cache is built.

Argument Validation and Error Handling

But once you’ve obtained the MethodInfo, there’s still a lot of work to be done when you call Invoke on it. Imagine you wrote some code like this:

PropertyInfo stringLengthField = 
    typeof(string).GetProperty("Length", 
        BindingFlags.Instance | BindingFlags.Public);
var length = stringLengthField.GetGetMethod().Invoke(new Uri(), new object[0]);

If you run it you would get the following exception:

System.Reflection.TargetException: Object does not match target type.
   at System.Reflection.RuntimeMethodInfo.CheckConsistency(..)
   at System.Reflection.RuntimeMethodInfo.InvokeArgumentsCheck(..)
   at System.Reflection.RuntimeMethodInfo.Invoke(..)
   at System.Reflection.RuntimePropertyInfo.GetValue(..)

This is because we have obtained the PropertyInfo for the Length property on the String class, but invoked it with an Uri object, which is clearly the wrong type!

In addition to this, there also has to be validation of any arguments you pass through to the method you are invoking. To make argument passing work, reflection APIs take a parameter that is an array of object’s, one per argument. So if you using reflection to call the method Add(int x, int y), you would invoke it by calling methodInfo.Invoke(.., new [] { 5, 6 }). At run-time checks need to be carried out on the amount and types of the values passed in, in this case to ensure that there are 2 and that they are both int’s. One down-side of all this work is that it often involves boxing which has an additional cost, but hopefully this will be minimised in the future.

Security Checks

The other main task that is happening along the way is multiple security checks. For instance, it turns out that you aren’t allowed to use reflection to call just any method you feel like. There are some restricted or ‘Dangerous Methods’, that can only be called by trusted .NET framework code. In addition to a black-list, there are also dynamic security checks depending on the current Code Access Security permissions that have to be checked during invocation.

How much does Reflection cost?

So now that we know what reflection is doing behind-the-scenes, it’s a good time to look at what it costs us. Please note that these benchmarks are comparing reading/writing a property directly v via reflection. In .NET properties are actually a pair of Get/Set methods that the compiler generates for us, however when the property has just a simple backing field the .NET JIT inlines the method call for performance reasons. This means that using reflection to access a property will show reflection in the worse possible light, but it was chosen as it’s the most common use-case, showing up in ORMs, Json serialisation/deserialisation libraries and object mapping tools.

Below are the raw results as they are displayed by BenchmarkDotNet, followed by the same results displayed in 2 separate tables. (full Benchmark code is available)

Reading a Property (‘Get’)

Method	Mean	StdErr	Scaled	Bytes Allocated/Op
GetViaProperty	0.2159 ns	0.0047 ns	1.00	0.00
GetViaDelegate	1.8903 ns	0.0082 ns	8.82	0.00
GetViaILEmit	2.9236 ns	0.0067 ns	13.64	0.00
GetViaCompiledExpressionTrees	12.3623 ns	0.0200 ns	57.65	0.00
GetViaFastMember	35.9199 ns	0.0528 ns	167.52	0.00
GetViaReflectionWithCaching	125.3878 ns	0.2017 ns	584.78	0.00
GetViaReflection	197.9258 ns	0.2704 ns	923.08	0.01
GetViaDelegateDynamicInvoke	842.9131 ns	1.2649 ns	3,931.17	419.04

Writing a Property (‘Set’)

Method	Mean	StdErr	Scaled	Bytes Allocated/Op
SetViaProperty	1.4043 ns	0.0200 ns	6.55	0.00
SetViaDelegate	2.8215 ns	0.0078 ns	13.16	0.00
SetViaILEmit	2.8226 ns	0.0061 ns	13.16	0.00
SetViaCompiledExpressionTrees	10.7329 ns	0.0221 ns	50.06	0.00
SetViaFastMember	36.6210 ns	0.0393 ns	170.79	0.00
SetViaReflectionWithCaching	214.4321 ns	0.3122 ns	1,000.07	98.49
SetViaReflection	287.1039 ns	0.3288 ns	1,338.99	115.63
SetViaDelegateDynamicInvoke	922.4618 ns	2.9192 ns	4,302.17	390.99

So we can clearly see that regular reflection code (GetViaReflection and SetViaReflection) is considerably slower than accessing the property directly (GetViaProperty and SetViaProperty). But what about the other results, lets explore those in more detail.

Setup

First we start with a TestClass that looks like this:

public class TestClass
{
    public TestClass(String data)
    {
        Data = data;
    }

    private string data;
    private string Data
    {
        get { return data; }
        set { data = value; }
    }
}

and the following common code, that all the options can make use of:

// Setup code, done only once 
TestClass testClass = new TestClass("A String");
Type @class = testClass.GetType();
BindingFlag bindingFlags = BindingFlags.Instance | 
                           BindingFlags.NonPublic | 
                           BindingFlags.Public;

Regular Reflection

First we use regular benchmark code, that acts as out starting point and the ‘worst case’:

[Benchmark]
public string GetViaReflection()
{
    PropertyInfo property = @class.GetProperty("Data", bindingFlags);
    return (string)property.GetValue(testClass, null);
}

Option 1 - Cache PropertyInfo

Next up, we can gain a small speed boost by keeping a reference to the PropertyInfo, rather than fetching it each time. But we’re still much slower than accessing the property directly, which demonstrates that there is a considerable cost in the ‘invocation’ part of reflection.

// Setup code, done only once
PropertyInfo cachedPropertyInfo = @class.GetProperty("Data", bindingFlags);

[Benchmark]
public string GetViaReflection()
{    
    return (string)cachedPropertyInfo.GetValue(testClass, null);
}

Option 2 - Use FastMember

Here we make use of Marc Gravell’s excellent Fast Member library, which as you can see is very simple to use!

// Setup code, done only once
TypeAccessor accessor = TypeAccessor.Create(@class, allowNonPublicAccessors: true);

[Benchmark]
public string GetViaFastMember()
{
    return (string)accessor[testClass, "Data"];
}

Note that it’s doing something slightly different to the other options. It creates a TypeAccessor that allows access to all the Properties on a type, not just one. But the downside is that, as a result, it takes longer to run. This is because internally it first has to get the delegate for the Property you requested (in this case ‘Data’), before fetching it’s value. However this overhead is pretty small, FastMember is still way faster than Reflection and it’s very easy to use, so I recommend you take a look at it first.

This option and all subsequent ones convert the reflection code into a delegate that can be directly invoked without the overhead of reflection every time, hence the speed boost!

Although it’s worth pointing out that the creation of a delegate has a cost (see ‘Further Reading’ for more info). So in short, the speed boost is because we are doing the expensive work once (security checks, etc) and storing a strongly typed delegate that we can use again and again with little overhead. You wouldn’t use these techniques if you were doing reflection once, but if you’re only doing it once it wouldn’t be a performance bottleneck, so you wouldn’t care if it was slow!

The reason that reading a property via a delegate isn’t as fast as reading it directly is because the .NET JIT won’t inline a delegate method call like it will do with a Property access. So with a delegate we have to pay the cost of a method call, which direct access doesn’t.

Option 3 - Create a Delegate

In this option we use the CreateDelegate function to turn our PropertyInfo into a regular delegate:

// Setup code, done only once
PropertyInfo property = @class.GetProperty("Data", bindingFlags);
Func<TestClass, string> getDelegate = 
    (Func<TestClass, string>)Delegate.CreateDelegate(
             typeof(Func<TestClass, string>), 
             property.GetGetMethod(nonPublic: true));

[Benchmark]
public string GetViaDelegate()
{
    return getDelegate(testClass);
}

The drawback is that you to need to know the concrete type at compile-time, i.e. the Func<TestClass, string> part in the code above (no you can’t use Func<object, string>, if you do it’ll thrown an exception!). In the majority of situations when you are doing reflection you don’t have this luxury, otherwise you wouldn’t be using reflection in the first place, so it’s not a complete solution.

For a very interesting/mind-bending way to get round this, see the MagicMethodHelper code in the fantastic blog post from Jon Skeet ‘Making Reflection fly and exploring delegates’ or read on for Options 4 or 5 below.

Option 4 - Compiled Expression Trees

Here we generate a delegate, but the difference is that we can pass in an object, so we get round the limitation of ‘Option 3’. We make use of the .NET Expression tree API that allows dynamic code generation:

// Setup code, done only once
PropertyInfo property = @class.GetProperty("Data", bindingFlags);
ParameterExpression = Expression.Parameter(typeof(object), "instance");
UnaryExpression instanceCast = 
    !property.DeclaringType.IsValueType ? 
        Expression.TypeAs(instance, property.DeclaringType) : 
        Expression.Convert(instance, property.DeclaringType);
Func<object, object> GetDelegate = 
    Expression.Lambda<Func<object, object>>(
        Expression.TypeAs(
            Expression.Call(instanceCast, property.GetGetMethod(nonPublic: true)),
            typeof(object)), 
        instance)
    .Compile();

[Benchmark]
public string GetViaCompiledExpressionTrees()
{
    return (string)GetDelegate(testClass);
}

Full code for the Expression based approach is available in the blog post Faster Reflection using Expression Trees

Option 5 - Dynamic code-gen with IL Emit

Finally we come to the lowest-level approach, emiting raw IL, although ‘with great power, comes great responsibility’:

// Setup code, done only once
PropertyInfo property = @class.GetProperty("Data", bindingFlags);
Sigil.Emit getterEmiter = Emit<Func<object, string>>
    .NewDynamicMethod("GetTestClassDataProperty")
    .LoadArgument(0)
    .CastClass(@class)
    .Call(property.GetGetMethod(nonPublic: true))
    .Return();
Func<object, string> getter = getterEmiter.CreateDelegate();

[Benchmark]
public string GetViaILEmit()
{
    return getter(testClass);
}

Using Expression tress (as shown in Option 4), doesn’t give you as much flexibility as emitting IL codes directly, although it does prevent you from emitting invalid code! Because of this, if you ever find yourself needing to emil IL I really recommend using the excellent Sigil library, as it gives better error messages when you get things wrong!

Conclusion

The take-away is that if (and only if) you find yourself with a performance issue when using reflection, there are several different ways you can make it faster. These speed gains are achieved by getting a delegate that allows you to access the Property/Field/Method directly, without all the overhead of going via reflection every-time.

Discuss this post in /r/programming and /r/csharp

Research papers in the .NET source

2016-12-12T00:00:00+00:00

This post is completely inspired by (or ‘copied from’ depending on your point of view) a recent post titled JAVA PAPERS (also see the HackerNews discussion). However, instead of looking at Java and the JVM, I’ll be looking at references to research papers in the .NET language, runtime and compiler source code.

If I’ve missed any that you know of, please leave a comment below!

Note: I’ve deliberately left out links to specifications, standards documents or RFC’s, instead concentrating only on Research Papers.

‘Left Leaning Red Black trees’ by Robert Sedgewick - CoreCLR source reference

Abstract The red-black tree model for implementing balanced search trees, introduced by Guibas and Sedgewick thirty years ago, is now found throughout our computational infrastructure. Red-black trees are described in standard textbooks and are the underlying data structure for symbol-table implementations within C++, Java, Python, BSD Unix, and many other modern systems. However, many of these implementations have sacrificed some of the original design goals (primarily in order to develop an effective implementation of the delete operation, which was incompletely specified in the original paper), so a new look is worthwhile. In this paper, we describe a new variant of redblack trees that meets many of the original design goals and leads to substantially simpler code for insert/delete, less than one-fourth as much code as in implementations in common use.

‘Hopscotch Hashing’ by Maurice Herlihy, Nir Shavit, and Moran Tzafrir - CoreCLR source reference

Abstract We present a new class of resizable sequential and concur-rent hash map algorithms directed at both uni-processor and multicore machines. The new hopscotch algorithms are based on a novel hopscotch multi-phased probing and displacement technique that has the flavors of chaining, cuckoo hashing, and linear probing, all put together, yet avoids the limitations and overheads of these former approaches. The resulting algorithms provide tables with very low synchronization overheads and high cache hit ratios. In a series of benchmarks on a state-of-the-art 64-way Niagara II multi- core machine, a concurrent version of hopscotch proves to be highly scal-able, delivering in some cases 2 or even 3 times the throughput of today’s most efficient concurrent hash algorithm, Lea’s ConcurrentHashMap from java.concurr.util. Moreover, in tests on both Intel and Sun uni-processor machines, a sequential version of hopscotch consistently outperforms the most effective sequential hash table algorithms including cuckoo hashing and bounded linear probing. The most interesting feature of the new class of hopscotch algorithms is that they continue to deliver good performance when the hash table is more than 90% full, increasing their advantage over other algorithms as the table density grows.

‘Automatic Construction of Inlining Heuristics using Machine Learning’ by Kulkarni, Cavazos, Wimmer, and Simon. - CoreCLR source reference

Abstract Method inlining is considered to be one of the most important optimizations in a compiler. However, a poor inlining heuristic can lead to significant degradation of a program’s running time. Therefore, it is important that an inliner has an effective heuristic that controls whether a method is inlined or not. An important component of any inlining heuristic are the features that characterize the inlining decision. These features often correspond to the caller method and the callee methods. However, it is not always apparent what the most important features are for this problem or the relative importance of these features. Compiler writers developing inlining heuristics may exclude critical information that can be obtained during each inlining decision. In this paper, we use a machine learning technique, namely neuro-evolution [18], to automatically induce effective inlining heuristics from a set of features deemed to be useful for inlining. Our learning technique is able to induce novel heuristics that significantly out-perform manually-constructed inlining heuristics. We evaluate the heuristic constructed by our neuro-evolutionary technique within the highly tuned Java HotSpot server compiler and the Maxine VM C1X compiler, and we are able to obtain speedups of up to 89% and 114%, respectively. In addition, we obtain an average speedup of almost 9% and 11% for the Java HotSpot VM and Maxine VM, respectively. However, the output of neuro-evolution, a neural network, is not human readable. We show how to construct more concise and read-able heuristics in the form of decision trees that perform as well as our neuro-evolutionary approach.

‘A Theory of Objects’ by Luca Cardelli & Martín Abadi - CoreCLR source reference

Abstract Procedural languages are generally well understood. Their foundations have been cast in calculi that prove useful in matters of implementation and semantics. So far, an analogous understanding has not emerged for object-oriented languages. In this book the authors take a novel approach to the understanding of object-oriented languages by introducing object calculi and developing a theory of objects around them. The book covers both the semantics of objects and their typing rules, and explains a range of object-oriented concepts, such as self, dynamic dispatch, classes, inheritance, prototyping, subtyping, covariance and contravariance, and method specialization. Researchers and graduate students will find this an important development of the underpinnings of object-oriented programming.

‘Optimized Interval Splitting in a Linear Scan Register Allocator’ by Wimmer, C. and Mössenböck, D. - CoreCLR source reference

Abstract We present an optimized implementation of the linear scan register allocation algorithm for Sun Microsystems’ Java HotSpot™ client compiler. Linear scan register allocation is especially suitable for just-in-time compilers because it is faster than the common graph-coloring approach and yields results of nearly the same quality.Our allocator improves the basic linear scan algorithm by adding more advanced optimizations: It makes use of lifetime holes, splits intervals if the register pressure is too high, and models register constraints of the target architecture with fixed intervals. Three additional optimizations move split positions out of loops, remove register-to-register moves and eliminate unnecessary spill stores. Interval splitting is based on use positions, which also capture the kind of use and whether an operand is needed in a register or not. This avoids the reservation of a scratch register.Benchmark results prove the efficiency of the linear scan algorithm: While the compilation speed is equal to the old local register allocator that is part of the Sun JDK 5.0, integer benchmarks execute about 15% faster. Floating-point benchmarks show the high impact of the Intel SSE2 extensions on the speed of numeric Java applications: With the new SSE2 support enabled, SPECjvm98 executes 25% faster compared with the current Sun JDK 5.0.

‘Extensible pattern matching via a lightweight language extension’ by Don Syme, Gregory Neverov, James Margetson - Roslyn source reference

Abstract Pattern matching of algebraic data types (ADTs) is a standard feature in typed functional programming languages, but it is well known that it interacts poorly with abstraction. While several partial solutions to this problem have been proposed, few have been implemented or used. This paper describes an extension to the .NET language F# called active patterns, which supports pattern matching over abstract representations of generic heterogeneous data such as XML and term structures, including where these are represented via object models in other .NET languages. Our design is the first to incorporate both ad hoc pattern matching functions for partial decompositions and “views” for total decompositions, and yet remains a simple and lightweight extension. We give a description of the language extension along with numerous motivating examples. Finally we describe how this feature would interact with other reasonable and related language extensions: existential types quantified at data discrimination tags, GADTs, and monadic generalizations of pattern matching.

‘Some approaches to best-match file searching’ by W. A. Burkhard & R. M. Keller - Roslyn source reference

Abstract The problem of searching the set of keys in a file to find a key which is closest to a given query key is discussed. After “closest,” in terms of a metric on the the key space, is suitably defined, three file structures are presented together with their corresponding search algorithms, which are intended to reduce the number of comparisons required to achieve the desired result. These methods are derived using certain inequalities satisfied by metrics and by graph-theoretic concepts. Some empirical results are presented which compare the efficiency of the methods.

For reference, the links below take you straight the the GitHub searches, so you can take a look yourself:

Research produced by work on the .NET Runtime or Compiler

But what about the other way round, are there instances of work being done in .NET that is then turned into a research paper? Well it turns out there is, the first example I came across was from a tweet by Joe Duffy:

(As an aside, I recommend checking out Joe Duffy’s blog, it contains lots of information about Midori the research project to build a managed OS!)

‘Applying Control Theory in the Real World - Experience With Building a Controller for the .NET Thread Pool’ by Joseph L. Hellerstein, Vance Morrison, Eric Eilebrecht

Abstract There has been considerable interest in using control theory to build web servers, database managers, and other systems. We claim that the potential value of using control theory cannot be realized in practice without a methodology that addresses controller design, testing, and tuning. Based on our experience with building a controller for the .NET thread pool, we develop a methodology that: (a) designs for extensibility to integrate diverse control techniques, (b) scales the test infrastructure to enable running a large number of test cases, (c) constructs test cases for which the ideal controller performance is known a priori so that the outcomes of test cases can be readily assessed, and (d) tunes controller parameters to achieve good results for multiple performance metrics. We conclude by discussing how our methodology can be extended, especially to designing controllers for distributed systems.

‘Uniqueness and Reference Immutability for Safe Parallelism’ by Colin S. Gordon, Matthew Parkinson, Jared Parsons. Aleks Bromfield & Joe Duffy (alternative link)

Abstract A key challenge for concurrent programming is that side-effects (memory operations) in one thread can affect the behavior of another thread. In this paper, we present a type system to restrict the updates to memory to prevent these unintended side-effects. We provide a novel combination of immutable and unique (isolated) types that ensures safe parallelism (race freedom and deterministic execution). The type system includes support for polymorphism over type qualifiers, and can easily create cycles of immutable objects. Key to the system’s flexibility is the ability to recover immutable or externally unique references after violating uniqueness without any explicit alias tracking. Our type system models a prototype extension to C# that is in active use by a Microsoft team. We describe their experiences building large systems with this extension. We prove the soundness of the type system by an embedding into a program logic.

‘Design and Implementation of Generics for the .NET Common Language Runtime’ by Andrew Kennedy, Don Syme

Abstract The Microsoft .NET Common Language Runtime provides a shared type system, intermediate language and dynamic execution environment for the implementation and inter-operation of multiple source languages. In this paper we extend it with direct support for parametric polymorphism (also known as generics), describing the design through examples written in an extended version of the C# programming language, and explaining aspects of implementation by reference to a prototype extension to the runtime. Our design is very expressive, supporting parameterized types, polymorphic static, instance and virtual methods, “F-bounded” type parameters, instantiation at pointer and value types, polymorphic recursion, and exact run-time types. The implementation takes advantage of the dynamic nature of the runtime, performing justin-time type specialization, representation-based code sharing and novel techniques for efﬁcient creation and use of run-time types. Early performance results are encouraging and suggest that programmers will not need to pay an overhead for using generics, achieving performance almost matching hand-specialized code.

‘Securing the .NET Programming Model (Industrial Application)’ by Andrew Kennedy

Abstract The security of the .NET programming model is studied from the standpoint of fully abstract compilation of C#. A number of failures of full abstraction are identified, and fixes described. The most serious problems have recently been fixed for version 2.0 of the .NET Common Language Runtime.

‘A Study of Concurrent Real-Time Garbage Collectors’ by Filip Pizlo, Erez Petrank & Bjarne Steensgaard (this features work done as part of Midori)

Abstract Concurrent garbage collection is highly attractive for real-time systems, because offloading the collection effort from the executing threads allows faster response, allowing for extremely short deadlines at the microseconds level. Concurrent collectors also offer much better scalability over incremental collectors. The main problem with concurrent real-time collectors is their complexity. The first concurrent real-time garbage collector that can support fine synchronization, STOPLESS, has recently been presented by Pizlo et al. In this paper, we propose two additional (and different) algorithms for concurrent real-time garbage collection: CLOVER and CHICKEN. Both collectors obtain reduced complexity over the first collector STOPLESS, but need to trade a benefit for it. We study the algorithmic strengths and weaknesses of CLOVER and CHICKEN and compare them to STOPLESS. Finally, we have implemented all three collectors on the Bartok compiler and runtime for C# and we present measurements to compare their efficiency and responsiveness.

‘STOPLESS: A Real-Time Garbage Collector for Multiprocessors’ by Filip Pizlo, Daniel Frampton, Erez Petrank, Bjarne Steensgaard

Abstract We present STOPLESS: a concurrent real-time garbage collector suitable for modern multiprocessors running parallel multithreaded applications. Creating a garbage-collected environment that supports real-time on modern platforms is notoriously hard, especially if real-time implies lock-freedom. Known real-time collectors either restrict the real-time guarantees to uniprocessors only, rely on special hardware, or just give up supporting atomic operations (which are crucial for lock-free software). STOPLESS is the first collector that provides real-time responsiveness while preserving lock-freedom, supporting atomic operations, controlling fragmentation by compaction, and supporting modern parallel platforms. STOPLESS is adequate for modern languages such as C# or Java. It was implemented on top of the Bartok compiler and runtime for C# and measurements demonstrate high responsiveness (a factor of a 100 better than previously published systems), virtually no pause times, good mutator utilization, and acceptable overheads.

Finally, a full list of MS Research publications related to ‘programming languages and software engineering’ is available if you want to explore more of this research yourself.

Discuss this post on Hacker News

Open Source .NET – 2 years later

2016-11-23T00:00:00+00:00

A little over 2 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as Scott Hanselman said in his recent Connect keynote, the community has been contributing in a significant way:

You can see some more detail on this number in the talk ‘What’s New in the .NET Platform’ by Scott Hunter:

This post aims to give more context to those numbers and allow you to explore patterns and trends across different repositories.

Repository activity over time

First we are going to see an overview of the level of activity in each repo, by looking at the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (Yay sparklines FTW!!)

Note: Numbers in black are from the most recent month, with red showing the lowest and green the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.

The main trend I see across all repos is there’s a sustained level of activity for the entire 2 years, things didn’t start with a bang and then tailed off. In addition, many (but not all) repos have a trend of increased activity month-by-month. For instance the PR’s in CoreFX or the Issues in Visual Studio Code (vscode) are clear example of this, their best months have been the most recent.

Finally one interesting ‘story’ that jumps out of this data is the contrasting levels of activity (PR’s) across the dnx, cli and msbuild repositories, as highlighted in the image below:

If you don’t know the full story, initially all the cmd-line tooling was known as dnx, but in RC2 was migrated to .NET Core CLI. You can see this on the chart, activity in the dnx repo decreased at the same time that work in cli ramped up.

Following that, in May this year, the whole idea of having ‘project.json’ files was abandoned in favour of sticking with ‘msbuild’, you can see this change happen towards the right of the chart, there is a marked increase in the msbuild repo activity as any improvements that had been done in cli were ported over.

Methodology - Community v. Microsoft

But the main question I want to answer is:

How much Community involvement has there been since Microsoft open sourced large parts of the .NET framework?

(See my previous post to see how things looked after one year)

To do this we need to look at who opened the Issue or created the Pull Request (PR) and specifically if they worked for Microsoft or not. This is possible because (almost) all Microsoft employees have indicated where they work on their GitHub profile, for instance:

There are some notable exceptions, e.g. @shanselman clearly works at Microsoft, but it’s easy enough to allow for cases like this. Before you ask, I only analysed this data, I did not keep a copy of it in stored in MongoDB to sell to recruiters!!

Overall Participation - Community v. Microsoft

This data represents the total participation from the last 2 years, i.e. November 2014 to October 2016. All Pull Requests are Issues are treated equally, so a large PR counts the same as one that fixes a spelling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split.

Note: You can hover over the bars to get the actual numbers, rather than percentages.

Issues: Microsoft Community

Pull Requests: Microsoft Community

The general pattern these graphs show is that the Community is more likely to open an Issue than submit a PR, which I guess isn’t that surprising given the relative amount of work involved. However it’s clear that the Community is still contributing a considerable amount of work, for instance if you look at the CoreCLR repo it only has 21% of PRs from the Community, but this stills account for almost 900!

There’s a few interesting cases that jump out here, for instance Roslyn gets 35% of its issues from the Community, but only 6% of its PR’s, clearly getting code into the compiler is a tough task. Likewise it doesn’t seem like the Community is that interested in submitting code to msbuild, although it does have my favourite PR ever:

Participation over time - Community v. Microsoft

Finally we can see the ‘per-month’ data from the last 2 years, i.e. November 2014 to October 2016.

Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.

Issues: Microsoft Community

Pull Requests: Microsoft Community

Whilst not every repo is growing month-by-month, the majority are and those that aren’t at least show sustained contributions across 2 years.

Summary

I think that it’s clear to see that the Community has got on-board with the new Open-Source Microsoft, producing a sustained level of contributions over the last 2 years, lets hope it continues!

Discuss this post in /r/programming

The post Open Source .NET – 2 years later first appeared on my blog Performance is a Feature!

How does the 'fixed' keyword work?

2016-10-26T00:00:00+00:00

Well it turns out that it’s a really nice example of collaboration between the main parts of the .NET runtime, here’s a list of all the components involved:

Compiler
JITter
CLR
Garbage Collector (GC)

Now you could argue that all of these are required to execute any C# code, but what’s interesting about the fixed keyword is that they all have a specific part to play.

Compiler

To start with let’s look at one of the most basic scenarios for using the fixed keyword, directly accessing the contents of a C# string, (taken from a Roslyn unit test)

using System;
unsafe class C
{
    static unsafe void Main()
    {
        fixed (char* p = "hello")
        {
            Console.WriteLine(*p);
        }
    }
}

Which the compiler then turns into the following IL:

// Code size       34 (0x22)
.maxstack  2
.locals init (char* V_0, //p
              pinned string V_1)
IL_0000:  nop
IL_0001:  ldstr "hello"
IL_0006:  stloc.1
IL_0007:  ldloc.1
IL_0008:  conv.i
IL_0009:  stloc.0
IL_000a:  ldloc.0
IL_000b:  brfalse.s  IL_0015
IL_000d:  ldloc.0
IL_000e:  call "int System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData.get"
IL_0013:  add
IL_0014:  stloc.0
IL_0015:  nop
IL_0016:  ldloc.0
IL_0017:  ldind.u2
IL_0018:  call "void System.Console.WriteLine(char)"
IL_001d:  nop
IL_001e:  nop
IL_001f:  ldnull
IL_0020:  stloc.1
IL_0021:  ret

Note the pinned string V_1 that the compiler has created for us, it’s made a hidden local variable that holds a reference to the object we are using in the fixed statement, which in this case is the string “hello”. The purpose of this pinned local variable will be explained in a moment.

It’s also emitted an call to the OffsetToStringData getter method (from System.Runtime.CompilerServices.RuntimeHelpers), which we will cover in more detail when we discuss the CLR’s role.

However, as an aside the compiler is also performing an optimisation for us, normally it would wrap the fixed statement in a finally block to ensure the pinned local variable is nulled out after controls leaves the scope. But in this case it has determined that is can leave out the finally statement entirely, from LocalRewriter_FixedStatement.cs in the Roslyn source:

// In principle, the cleanup code (i.e. nulling out the pinned variables) is always
// in a finally block.  However, we can optimize finally away (keeping the cleanup
// code) in cases where both of the following are true:
//   1) there are no branches out of the fixed statement; and
//   2) the fixed statement is not in a try block (syntactic or synthesized).
if (IsInTryBlock(node) || HasGotoOut(rewrittenBody))
{
...
}

What is this pinned identifier?

Let’s start by looking at the authoritative source, from Standard ECMA-335 Common Language Infrastructure (CLI)

II.7.1.2 pinned The signature encoding for pinned shall appear only in signatures that describe local variables (§II.15.4.1.3). While a method with a pinned local variable is executing, the VES shall not relocate the object to which the local refers. That is, if the implementation of the CLI uses a garbage collector that moves objects, the collector shall not move objects that are referenced by an active pinned local variable.

[Rationale: If unmanaged pointers are used to dereference managed objects, these objects shall be pinned. This happens, for example, when a managed object is passed to a method designed to operate with unmanaged data. end rationale]

VES = Virtual Execution System CLI = Common Language Infrastructure CTS = Common Type System

But if you prefer an explanation in more human readable form (i.e. not from a spec), then this extract from .Net IL Assembler Paperback by Serge Lidin is helpful:

(Also available on Google Books)

CLR

Arguably the CLR has the easiest job to do (if you accept that it exists as a separate component from the JIT and GC), its job is to provide the offset of the raw string data via the OffsetToStringData method that is emitted by the compiler.

Now you might be thinking that this method does some complex calculations to determine the exact offset, but nope, it’s hard-coded!! (I told you that Strings and the CLR have a Special Relationship):

public static int OffsetToStringData
{
    // This offset is baked in by string indexer intrinsic, so there is no harm
    // in getting it baked in here as well.
    [System.Runtime.Versioning.NonVersionable] 
    get {
        // Number of bytes from the address pointed to by a reference to
        // a String to the first 16-bit character in the String.  Skip 
        // over the MethodTable pointer, & String length.  Of course, the 
        // String reference points to the memory after the sync block, so 
        // don't count that. 
        // This property allows C#'s fixed statement to work on Strings.
        // On 64 bit platforms, this should be 12 (8+4) and on 32 bit 8 (4+4).
#if BIT64
        return 12;
#else // 32
        return 8;
#endif // BIT64
    }
}

JITter

For the fixed keyword to work the role of the JITter is to provide information to the GC/Runtime about the lifetimes of variables within a method and in-particular if they are pinned locals. It does this via the GCInfo data it creates for every method:

To see this in action we have to enable the correct magic flags and then we will see the following:

Compiling    0 ConsoleApplication.Program::Main, IL size = 30, hsh=0x8d66958e
; Assembly listing for method ConsoleApplication.Program:Main(ref)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,   0  )     ref  ->  zero-ref   
;  V01 loc0         [V01,T00] (  5,   4  )    long  ->  rcx        
;  V02 loc1         [V02    ] (  3,   3  )     ref  ->  [rsp+0x20]   must-init pinned
;  V03 tmp0         [V03,T01] (  2,   4  )    long  ->  rcx        
;  V04 OutArgs      [V04    ] (  1,   1  )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40

G_M27250_IG01:
000000 4883EC28             sub      rsp, 40
000004 33C0                 xor      rax, rax
000006 4889442420           mov      qword ptr [rsp+20H], rax

G_M27250_IG02:
00000B 488B0C256830B412     mov      rcx, gword ptr [12B43068H]      'hello'
000013 48894C2420           mov      gword ptr [rsp+20H], rcx
000018 488B4C2420           mov      rcx, gword ptr [rsp+20H]
00001D 4885C9               test     rcx, rcx
000020 7404                 je       SHORT G_M27250_IG03
000022 4883C10C             add      rcx, 12

G_M27250_IG03:
000026 0FB709               movzx    rcx, word  ptr [rcx]
000029 E842FCFFFF           call     System.Console:WriteLine(char)
00002E 33C0                 xor      rax, rax
000030 4889442420           mov      gword ptr [rsp+20H], rax

G_M27250_IG04:
000035 4883C428             add      rsp, 40
000039 C3                   ret      

; Total bytes of code 58, prolog size 11 for method ConsoleApplication.Program:Main(ref)
; ============================================================
Set code length to 58.
Set Outgoing stack arg area size to 32.
Stack slot id for offset 32 (0x20) (sp) (pinned, untracked) = 0.
Defining 1 call sites:
    Offset 0x29, size 5.

See how in the section titled “Final local variable assignments” is had indicated that the V02 loc1 variable is must-init pinned and then down at the bottom is has this text:

Stack slot id for offset 32 (0x20) (sp) (pinned, untracked) = 0.

Aside: The JIT has also done some extra work for us and optimised away the call to OffsetToStringData by inlining it as the assembly code add rcx, 12. On a slightly related note, previously the fixed keyword prevented a method from being inlined, but recently that changed, see Support inlining method with pinned locals for the full details.

Garbage Collector

Finally we come to the GC which has an important “role to play”, or “not to play” depending on which way you look at it.

In effect the GC has to get out of the way and leave the pinned local variable alone for the life-time of the method. Normally the GC is concerned about which objects are live or dead so that it knows what it has to clean up. But with pinned objects it has to go one step further, not only must it not clean up the object, but it must not move it around. Generally the GC likes to relocate objects around during the Compact Phase to make memory allocations cheap, but pinning prevents that as the object is being accessed via a pointer and therefore its memory address has to remain the same.

There is a great visual explanation of what that looks like from the excellent presentation CLR: Garbage Collection Inside Out by Maoni Stephens (click for full-sized version):

Note how the pinned blocks (marked with a ‘P’) have remained where they are, forcing the Gen 0/1/2 segments to start at awkard locations. This is why pinning too many objects and keeping them pinned for too long can cause GC overhead, it has to perform extra booking keeping and work around them.

In reality, when using the fixed keyword, your object will only remain pinned for a short period of time, i.e. until control leaves the scope. But if you are pinning object via the GCHandle class then the lifetime could be longer.

So to finish, let’s get the final word on pinning from Maoni Stephens, from Using GC Efficiently – Part 3 (read the blog post for more details):

When you do need to pin, here are some things to keep in mind:

Pinning for a short time is cheap.

Pinning an older object is not as harmful as pinning a young object.

Creating pinned buffers that stay together instead of scattered around. This way you create fewer holes.

Summary

So that’s it, simple really!!

All the main parts of the .NET runtime do their bit and we get to use a handy feature that lets us drop-down and perform some bare-metal coding!!

Discuss this post in /r/programming

Adding a verb to the dotnet CLI tooling

2016-10-03T00:00:00+00:00

The dotnet CLI tooling comes with several built-in cmds such as build, run and test, but it turns out it’s possible to add your own verb to that list.

Arbitrary cmds

From Intro to .NET Core CLI - Design

The way the dotnet driver finds the command it is instructed to run using dotnet {command} is via a convention; any executable that is placed in the PATH and is named dotnet-{command} will be available to the driver. For example, when you install the CLI toolchain there will be an executable called dotnet-build in your PATH; when you run dotnet build, the driver will run the dotnet-build executable. All of the arguments following the command are passed to the command being invoked. So, in the invocation of dotnet build --native, the --native switch will be passed to dotnet-build executable that will do some action based on it (in this case, produce a single native binary).

This is also the basics of the current extensibility model of the toolchain. Any executable found in the PATH named in this way, that is as dotnet-{command}, will be invoked by the dotnet driver.

Fun fact: This means that it’s actually possible to make a dotnet go command! You just need to make a copy of go.exe and rename it to dotnet-go.exe

Yay dotnet go (I know, completely useless, but fun none-the-less)!!

(and yes before you ask, you can also make dotnet dotnet work, but please don’t do that!!)

With regards to documentation, there’s further information in the ‘Adding a Command’ section of the Developer Guide. Also the source code of the dotnet test command is a really useful reference and helped me out several times.

Before I go any further I just want to acknowledge the 2 blog posts listed below. They show you how to build a custom command that will compresses all the images in the current directory and how to make it available to the dotnet tooling as a NuGet package:

However they don’t explain how to interact with the current project or access it’s output. This is what I wanted to do, so this post will pick up where those posts left off.

Information about the current Project

Any effective dotnet verb needs to know about the project it is running in and helpfully those kind developers at Microsoft have created some useful classes that will parse and examine a project.json file (available in the Microsoft.DotNet.ProjectModel NuGet package). It’s pretty simple to work with, just a few lines of code and you’re able to access the entire Project model:

Project project;
var currentDirectory = Directory.GetCurrentDirectory();
if (ProjectReader.TryGetProject(currentDirectory, out project))
{
    if (project.Files.SourceFiles.Any())
    {
        Console.WriteLine("Files:");
        foreach (var file in project.Files.SourceFiles)
            Console.WriteLine("  {0}", file.Replace(currentDirectory, ""));
    }
    if (project.Dependencies.Any())
    {
        Console.WriteLine("Dependencies:");
        foreach (var dependancy in project.Dependencies)
        {
            Console.WriteLine("  {0} - Line:{1}, Column:{2}",
                    dependancy.SourceFilePath.Replace(currentDirectory, ""),
                    dependancy.SourceLine,
                    dependancy.SourceColumn);
        }
    }
    ...
}

Building a Project

In addition to knowing about the current project, we need to ensure it successfully builds before we can do anything else with it. Fortunately this is also simple thanks to the Microsoft.DotNet.Cli.Utils NuGet package (along with further help from Microsoft.DotNet.ProjectModel which provides the BuildWorkspace):

// Create a workspace
var workspace = new BuildWorkspace(ProjectReaderSettings.ReadFromEnvironment());

// Fetch the ProjectContexts
var projectPath = project.ProjectFilePath;
var runtimeIdentifiers = 
    RuntimeEnvironmentRidExtensions.GetAllCandidateRuntimeIdentifiers();
var projectContexts = workspace.GetProjectContextCollection(projectPath)
       .EnsureValid(projectPath)
       .FrameworkOnlyContexts
       .Select(c => workspace.GetRuntimeContext(c, runtimeIdentifiers))
       .ToList();

// Setup the build arguments
var projectContextToBuild = projectContexts.First();
var cmdArgs = new List<string>
{
    projectPath,
    "--configuration", "Release",
    "--framework", projectContextToBuild.TargetFramework.ToString()
};

// Build!!
Console.WriteLine("Building Project for {0}", projectContextToBuild.RuntimeIdentifier);
var result = Command.CreateDotNet("build", cmdArgs).Execute();
Console.WriteLine("Build {0}", result.ExitCode == 0 ? "SUCCEEDED" : "FAILED");

When this runs you get the familiar dotnet build output if it successfully builds or any error/diagnostic messages if not.

Integrating with BenchmarkDotNet

Now that we know the project has produced an .exe or .dll, we can finally wire-up BenchmarkDotNet and get it to execute the benchmarks for us:

try
{
    Console.WriteLine("Running BenchmarkDotNet");
    var benchmarkAssemblyPath = 
        projectContextToBuild.GetOutputPaths(config).RuntimeFiles.Assembly;
    var benchmarkAssembly = 
        AssemblyLoadContext.Default.LoadFromAssemblyPath(benchmarkAssemblyPath);
    Console.WriteLine("Successfully loaded: {0}\n", benchmarkAssembly);
    var switcher = new BenchmarkSwitcher(benchmarkAssembly);
    var summary = switcher.Run(args);
}
catch (Exception ex)
{
    Console.WriteLine("Error running BenchmarkDotNet");
    Console.WriteLine(ex);
}

Because BenchmarkDotNet is a command-line tool we don’t actually need to do much work. It’s just a case of creating a BenchmarkSwitcher, giving it a reference to the dll that contains the benchmarks and then passing in the command line arguments. BenchmarkDotNet will then do the rest of the work for us!

However if you need to parse command line arguments yourself I’d recommend re-using the existing helper classes as they make life much easier and will ensure that your tool fits in with the dotnet tooling ethos.

The final result

Finally, to test it out, we’ll use a simple test app from the BenchmarkDotNet Getting Started Guide, with the following in the project.json file (note the added tools section):

{
  "version": "1.0.0-*",
  "buildOptions": {
    "emitEntryPoint": true
  },
  "dependencies": {
    "Microsoft.NETCore.App": {
      "type": "platform",
      "version": "1.0.0-rc2-3002702"
    },
    "BenchmarkDotNet": "0.9.9"
  },
  "frameworks": {
    "netcoreapp1.0": {
      "imports": "dnxcore50"
    }
  },
  "tools": {
    "BenchmarkCommand": "1.0.0"
  }
}

Then after doing a dotnet restore, we can finally run our new dotnet benchmark command:

λ dotnet benchmark --class Md5VsSha256
Building Project - BenchmarkCommandTest
Project BenchmarkCommandTest (.NETCoreApp,Version=v1.0) will be compiled because expected outputs are missing
Compiling BenchmarkCommandTest for .NETCoreApp,Version=v1.0
Compilation succeeded.
    0 Warning(s)
    0 Error(s)
Time elapsed 00:00:00.9760886

Build SUCCEEDED

Running BenchmarkDotNet
C:\Projects\BenchmarkCommandTest\bin\Release\netcoreapp1.0\BenchmarkCommandTest.dll 
Successfully loaded: BenchmarkCommandTest, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null

Target type: Md5VsSha256
// ***** BenchmarkRunner: Start   *****
// Found benchmarks:
//   Md5VsSha256_Sha256
//   Md5VsSha256_Md5
// Validating benchmarks:
// **************************
// Benchmark: Md5VsSha256_Sha256
// *** Generate ***
// Result = Success
// BinariesDirectoryPath = C:\Projects\BDN.Auto\binaries
// *** Build ***
// Result = Success
// *** Execute ***
// Launch: 1
// Benchmark Process Environment Information:
// CLR=CORE, Arch=64-bit ? [RyuJIT]
// GC=Concurrent Workstation

...

If you’ve used BenchmarkDotNet before you’ll recognise its output, if not it’s output is all the lines starting with //. A final note, currently the Console colours from the command aren’t displayed, but that should be fixed sometime soon, which is great because BenchmarkDotNet looks way better in full-colour!!

Discuss this post in /r/csharp

The post Adding a verb to the dotnet CLI tooling first appeared on my blog Performance is a Feature!

Optimising LINQ

2016-09-29T00:00:00+00:00

What’s the problem with LINQ?

As outlined by Joe Duffy, LINQ introduces inefficiencies in the form of hidden allocations, from The ‘premature optimization is evil’ myth:

To take an example of a technology that I am quite supportive of, but that makes writing inefficient code very easy, let’s look at LINQ-to-Objects. Quick, how many inefficiencies are introduced by this code?
int[] Scale(int[] inputs, int lo, int hi, int c) {
   var results = from x in inputs
                 where (x >= lo) && (x <= hi)
                 select (x * c);
   return results.ToArray();
}

Good question, who knows, probably only Jon Skeet can tell just by looking at the code!! So to fully understand the problem we need to take a look at what the compiler is doing for us behind-the-scenes, the code above ends up looking something like this:

private int[] Scale(int[] inputs, int lo, int hi, int c)
{
    <>c__DisplayClass0_0 CS<>8__locals0;
    CS<>8__locals0 = new <>c__DisplayClass0_0();
    CS<>8__locals0.lo = lo;
    CS<>8__locals0.hi = hi;
    CS<>8__locals0.c = c;
    return inputs
        .Where<int>(new Func<int, bool>(CS<>8__locals0.<Scale>b__0))
        .Select<int, int>(new Func<int, int>(CS<>8__locals0.<Scale>b__1))
        .ToArray<int>();
}

[CompilerGenerated]
private sealed class c__DisplayClass0_0
{
    public int c;
    public int hi;
    public int lo;

    internal bool <Scale>b__0(int x)
    {
        return ((x >= this.lo) && (x <= this.hi));
    }

    internal int <Scale>b__1(int x)
    {
        return (x * this.c);
    }
}

As you can see we have an extra class allocated and some Func's to perform the actual logic. But this doesn’t even account for the overhead of the ToArray() call, using iterators and calling LINQ methods via dynamic dispatch. As an aside, if you are interested in finding out more about closures it’s worth reading Jon Skeet’s excellent blog post “The Beauty of Closures”.

So there’s a lot going on behind the scenes, but it is actually possible to be shown these hidden allocations directly in Visual Studio. If you install the excellent Heap Allocation Viewer plugin for Resharper, you will get the following tool-tip right in the IDE:

As useful as it is though, I wouldn’t recommend turning this on all the time as seeing all those red lines under your code tends to make you a bit paranoid!!

Aside: If you don’t have Resharper, there is a Roslyn based Heap Allocation Analyser available that provides similar functionality.

Now before we look at some ways you can reduce the impact of LINQ, it’s worth pointing out that LINQ itself does some pretty neat tricks (HT to Oren Novotny for pointing this out to me). For instance the common pattern of having a Where(..) followed by a Select(..) is optimised so that only a single iterator is used, not two as you would expect. Likewise two Select(..) statements in a row are combined, so that only a one iterator is needed.

A note on micro-optimisations

Whenever I write a post like this I inevitably get comments complaining that it’s an “premature optimisation” or something similar. So this time I just want to add the following caveat:

I am not in any way advocating that LINQ is a bad thing, I think it’s fantastic feature of the C# language!

Also:

Please do not re-write any of your code based purely on the results of some micro-benchmarks!

As I explain in one of my talks, you should always profile first and then benchmark. If you do it the other way round there is a temptation to optimise where it’s not needed.

Performance is a feature! - London .NET User Group from Matt Warren

Having said all that, the C# Compiler (Roslyn) coding guidelines do actually state the following:

Avoid allocations in compiler hot paths:

Avoid LINQ.

Avoid using foreach over collections that do not have a struct enumerator.

Consider using an object pool. There are many usages of object pools in the compiler to see an example.

Which is slightly ironic considering this advice comes from the same people who conceived and designed LINQ in the first place! But as outlined in the excellent talk “Essential Truths Everyone Should Know about Performance in a Large Managed Codebase”, they found LINQ has a noticeable cost.

Note: Hot paths are another way of talking about the critical 3% from the famous Donald Knuth quote:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

RoslynLinqRewrite and LinqOptimizer

Now clearly we could manually re-write any LINQ statement into an iterative version if we were concerned about performance, but wouldn’t it be much nicer if there were tools that could do the hard work for us? Well it turns out there are!

First up is RoslynLinqRewrite, as per the project page:

This tool compiles C# code by first rewriting the syntax trees of LINQ expressions using plain procedural code, minimizing allocations and dynamic dispatch.

Also available is the Nessos LinqOptimizer which is:

An automatic query optimizer-compiler for Sequential and Parallel LINQ. LinqOptimizer compiles declarative LINQ queries into fast loop-based imperative code. The compiled code has fewer virtual calls and heap allocations, better data locality and speedups of up to 15x (Check the Performance page).

At a high-level, the main differences between them are:

RoslynLinqRewrite
- works at compile time (but prevents incremental compilation of your project)
- no code changes, except if you want to opt out via [NoLinqRewrite]
LinqOptimiser
- works at run-time
- forces you to add AsQueryExpr().Run() to LINQ methods
- optimises Parallel LINQ

In the rest of the post will look at the tools in more detail and analyse their performance.

Comparison of LINQ support

Obviously before choosing either tool you want to be sure that it’s actually going to optimise the LINQ statements you have in your code base. However neither tool supports the whole range of available LINQ Query Expressions, as the chart below illustrates:

Method	RoslynLinqRewrite	LinqOptimiser	Both?
Select	✓	✓	Yes
Where	✓	✓	Yes
ToList	✓	✓	Yes
ToArray	✓	✓	Yes
Count	✓	✓	Yes
ForEach	✓	✓	Yes
Reverse	✓	✗
Cast	✓	✗
OfType	✓	✗
First/FirstOrDefault	✓	✗
Single/SingleOrDefault	✓	✗
Last/LastOrDefault	✓	✗
ToDictionary	✓	✗
LongCount	✓	✗
Any	✓	✗
All	✓	✗
ElementAt/ElementAtOrDefault	✓	✗
Contains	✓	✗
Aggregate	✗	✓
Sum	✗	✓
SelectMany	✗	✓
Take/TakeWhile	✗	✓
Skip/SkipWhile	✗	✓
GroupBy	✗	✓
OrderBy/OrderByDescending	✗	✓
ThenBy/ThenByDescending	✗	✓
Total	22	18	6

Performance Results

Finally we get to the main point of this blog post, how do the different tools perform, do they achieve their stated goals of optimising LINQ queries and reducing allocations?

Let’s start with a very common scenario, using LINQ to filter and map a sequence of numbers, i.e. in C#:

var results = items.Where(i => i % 10 == 0)
                   .Select(i => i + 5);

We will compare the LINQ code above with the 2 optimised versions, plus an iterative form that will serve as our baseline. Here are the results:

(Full benchmark code)

The first things that jumps out is that the LinqOptimiser version is allocating a lot of memory compared to the others. To see why this is happening we need to look at the code it generates, which looks something like this:

IEnumerable<int> LinqOptimizer(int [] input)
{
    var collector = new Nessos.LinqOptimizer.Core.ArrayCollector<int>();
    for (int counter = 0; counter < input.Length; counter++)
    {
        var i = input[counter];
        if (i % 10 == 0)
        {
            var result = i + 5;
            collector.Add(result);
        }
    }
    return collector;
}

This issue is that by default, ArrayCollector allocates a int[1024] as it’s backing storage, hence the excessive allocations!

By contrast RoslynLinqRewrite optimises the code like so:

IEnumerable<int> RoslynLinqRewriteWhereSelect_ProceduralLinq1(int[] _linqitems)
{
    if (_linqitems == null)
        throw new System.ArgumentNullException();
    for (int _index = 0; _index < _linqitems.Length; _index++)
    {
        var _linqitem = _linqitems[_index];
        if (_linqitem % 10 == 0)
        {
            var _linqitem1 = _linqitem + 5;
            yield return _linqitem1;
        }
    }
}

Which is much more sensible! By using the yield keyword it gets the compiler to do the hard work and so doesn’t have to allocate a temporary list to store the results in. This means that it is streaming the values, in the same way the original LINQ code does.

Lastly we’ll look at one more example, this time using a Count() expression, i.e.

items.Where(i => i % 10 == 0)
     .Count();

Here we can clearly see that both tools significantly reduce the allocations compared to the original LINQ code:

(Full benchmark code)

Future options

However even though using RoslynLinqRewrite or LinqOptimiser is pretty painless, we still have to install a 3rd party library into our project.

Wouldn’t it be even nicer if the .NET compiler, JITter and/or runtime did all the optimisations for us?

Well it’s certainly possible, as Joe Duffy explains in his QCon New York talk and work has already started so maybe we won’t have to wait too long!!

Discuss this post in /r/programming

Compact strings in the CLR

2016-09-19T00:00:00+00:00

In the CLR strings are stored as a sequence of UTF-16 code units, i.e. an array of char items. So if we have the string ‘testing’, in memory it looks like this:

But look at all those zero’s, wouldn’t it be more efficient if it could be stored like this instead?

Now this is a contrived example, clearly not all strings are simple ASCII text that can be compacted this way. Also, even though I’m an English speaker, I’m well aware that there are other languages with character sets than can only be expressed in Unicode. However it turns out that even in a fully internationalised modern web-application, there are still a large amount of strings that could be expressed as ASCII, such as:

Urls - Percent-encoding
Http Headers - RFC 7230 3.2.4. Field Parsing

So there is still an overall memory saving if the CLR provided an implementation that stored some strings in a more compact encoding that only takes 1 byte per character (ASCII or even ISO-8859-1 (Latin-1)) and the rest as Unicode (2 bytes per character).

Aside: If you are wondering “Why does C# use UTF-16 for strings?” Eric Lippert has a great post on this exact subject and Jon Skeet has something interesting to say about the subject in “Of Memory and Strings”

Real-world data

In theory this is all well and good, but what about in practice, what about a real-world example?

Well Nick Craver a developer at Stack Overflow was kind enough to run my Heap Analyser tool one of their memory dumps:

.NET Memory Dump Heap Analyser - created by Matt Warren - github.com/mattwarren

Found CLR Version: v4.6.1055.00

...

Overall 30,703,367 "System.String" objects take up 4,320,235,704 bytes (4,120.10 MB)
Of this underlying byte arrays (as Unicode) take up 3,521,948,162 bytes (3,358.79 MB)
Remaining data (object headers, other fields, etc) is 798,287,542 bytes (761.31 MB), at 26 bytes per object

Actual Encoding that the "System.String" could be stored as (with corresponding data size)
    3,347,868,352 bytes are ASCII
        5,078,902 bytes are ISO-8859-1 (Latin-1)
      169,000,908 bytes are Unicode (UTF-16)
Total: 3,521,948,162 bytes (expected: 3,521,948,162)

Compression Summary:
    1,676,473,627 bytes Compressed (to ISO-8859-1 (Latin-1))
      169,000,908 bytes Uncompressed (as Unicode/UTF-16)
       30,703,367 bytes EXTRA to enable compression (one byte field, per "System.String" object)
Total: 1,876,177,902 bytes, compared to 3,521,948,162 before compression

(The full output is available)

Here we can see that there are over 30 million strings in memory, taking up 4,120 MB out of a total heap size of 13,232 MB (just over 30%).

Further more we can see that the raw data used by the strings (excluding the CLR Object headers) takes up 3,358 MB when encoded as Unicode. However if the relevant strings were compacted to ASCII/Latin-1 only 1,789 MB would be needed to store them, a pretty impressive saving!

A proposal for compact strings in the CLR

I learnt about the idea of “Compact Strings” when reading about how they were implemented in Java and so I put together a proposal for an implementation in the CLR (isn’t .NET OSS Great!!).

Turns out that Vance Morrison (Performance Architect on the .NET Runtime Team) has been thinking about the same idea for quite a while:

To answer @mattwarren question on whether changing the internal representation of a string has been considered before, the short answer is YES. In fact it has been a pet desire of mine for probably over a decade now.

He also confirmed that they’ve done their homework and found that a significant amount of strings could be compacted:

What was clear now and has held true for quite sometime is that: Typical apps have 20% of their GC heap as strings. Most of the 16 bit characters have 0 in their upper byte. Thus you can save 10% of typical heaps by encoding in various ways that eliminate these pointless upper bytes.

It’s worth reading his entire response if you are interested in the full details of the proposal, including the trade-offs, benefits and drawbacks.

Implementation details

At a high-level the proposal would allow to strings to be stored in 2 formats:

Regular - i.e. Unicode encoded, as they are currently stored by the CLR
Compact - ASCII, ISO-8859-1 (Latin-1) or even another format

When you create a string, the constructor would determine the most efficient encoding and encode the data in that format. The formant used would then be stored in a field, so that the encoding is always known (CLR strings are immutable). That means that each method within the string class can use this field to determine how it operates, for instance the pseudo-code for the Equals method is shown below:

public boolean Equals(string other) 
{
    if (this.type != other.type)
       return false;
    if (type == ASCII)
        return StringASCII.Equals(this, other);
    else 
        return StringLatinUTF16.Equals(this, other);
} 

This shows a nice property of having strings in two formats; some operations can be short-circuited, because we know that strings stored in different encodings won’t be the same.

Advantages

less overall memory usage (as-per @davidfowl “At the top of every ASP.NET profile… strings!”)
strings become more cache-friendly, which may give better performance

Disadvantages

Makes some operations slower due to the extra if (type == ...) check needed
Breaks the fixed keyword, as well as COM and P/Invoke interop that relies on the current string layout/format
If very few strings in the application can be compacted, this will have an overhead for no gain

Next steps

In his reply Vance Morrison highlighted that solving the issue with the fixed keyword was a first step, because that has a hard dependency on the current string layout. Once that’s done the real work of making large, sweeping changes to the CLR can be done:

The main challenge is dealing with fixed, but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of ‘fixed’ in corefx).

Thus it IS doable, but it is at least moderately expensive (man months), and the payoff is non-trivial but not huge.

So stay tuned, one day we might have a more compact, more efficient implementation of strings in the CLR, yay!!

Subverting .NET Type Safety with 'System.Runtime.CompilerServices.Unsafe'

2016-09-14T00:00:00+00:00

In which we use `System.Runtime.CompilerServices.Unsafe` a generic API (“type-safe” but still “unsafe”) and mess with the C# Type System!

The post covers the following topics:

What it is and why it’s useful
How it works
Code samples
Tricks you can do with it
Using it safely

What it is and why it’s useful

The XML documentation comments for System.Runtime.CompilerServices.Unsafe state that it:

Contains generic, low-level functionality for manipulating pointers.

But we can get a better understanding of what it is by looking at the actual API definition from the current NuGet package (4.0.0):

// Contains generic, low-level functionality for manipulating pointers.
public static class Unsafe
{
    // Casts the given object to the specified type.
    public static T As<T>(object o) where T : class

    // Returns a pointer to the given by-ref parameter.    
    public static void* AsPointer<T>(ref T value);

    // Copies a value of type T to the given location.    
    public static void Copy<T>(void* destination, ref T source);

    // Copies a value of type T to the given location.
    public static void Copy<T>(ref T destination, void* source);

    // Copies bytes from the source address to the destination address.
    public static void CopyBlock(void* destination, void* source, uint byteCount);

    // Initializes a block of memory at the given location with a given initial value.    
    public static void InitBlock(void* startAddress, byte value, uint byteCount);

    // Reads a value of type T from the given location.
    public static T Read<T>(void* source);
    
    // Returns the size of an object of the given type parameter.    
    public static int SizeOf<T>();

    // Writes a value of type T to the given location.
    public static void Write<T>(void* destination, T value);
}

Note: I edited the the XML doc-comments for brevity, the full versions are available in the source. There are also some additional methods that have been added to the API, but to make use of them you have to use a version of the C# compiler with support for ref returns and locals.

However this doesn’t really tell us why it’s useful, to get some background on that we can look at the GitHub issue “Provide a generic API to read from and write to a pointer”:

So at a high-level the goals of the System.Runtime.CompilerServices.Unsafe library are to:

Provide a safer way of writing low-level unsafe code
- Without this library you have to resort to fixed and pointer manipulation, which can be error prone
Allow access to functionality that can’t be expressed in C#, but is possible in IL
- For instance Unsafe.Sizeof<T>() allows access to the Sizeof IL Opcode
Save developers from having to repeatedly write the same unsafe code
- There are already code-bases making use of it, including the Kestrel the high-performance web server, based on libuv.

It’s also worth pointing out that the library is primarily for use with a Value Type (int, float, etc) rather than a class or Reference type. You can use it with classes, however you have to pin them first, so they don’t move about in memory whilst you are working with the pointer.

Update: It was pointed out to me that Niels wrote an initial implementation of this library in a separate project, before Microsoft made their own version.

How it works

Because the library allows access to functionality that can’t be expressed in C#, it has to be written in raw IL, which is then compiled by a custom build-step. As an example we will look at the AsPointer method, which has the following signature:

public static void* AsPointer<T>(ref T value)

The IL for this is shown below, note how the ref keyword becomes & in IL and <T> is expressed as !!T:

.method public hidebysig static void* AsPointer<T>(!!T& 'value') cil managed aggressiveinlining
{
    .custom instance void System.Runtime.Versioning.NonVersionableAttribute::.ctor() = ( 01 00 00 00 )
    .maxstack 1
    ldarg.0
    conv.u
    ret
} // end of method Unsafe::AsPointer

Here we can see that it’s making use of the conv.u IL instruction. For reference the explanation of this, along with some of the other op codes used by the library are shown below:

Conv_U - Converts the value on top of the evaluation stack to unsigned native int, and extends it to native int.
Ldobj - Copies the value type object pointed to by an address to the top of the evaluation stack.
Stobj - Copies a value of a specified type from the evaluation stack into a supplied memory address.

After searching around I found several other places in the .NET Runtime that make use of raw IL in this way:

Code samples

There’s a nice set of unit tests that show the main use-cases for the library, for instance here is how to use Unsafe.Write(..) to directly change the value of an int via a pointer.

[Fact]
public static unsafe void WriteInt32()
{
    int value = 10;
    int* address = (int*)Unsafe.AsPointer(ref value);
    int expected = 20;
    Unsafe.Write(address, expected);

    Assert.Equal(expected, value);
    Assert.Equal(expected, *address);
    Assert.Equal(expected, Unsafe.Read<int>(address));
}

You can write something similar by manipulating pointers directly, but it’s not as straightforward (unless you are familiar with C or C++)

int value = 10;
int* ptr = &value;
*ptr = 30;
Console.WriteLine(value); // prints "30"

For a more real-world use case, the code below shows how you can access a KeyValuePair<DateTime, decimal> directly as a byte [] (taken from a GitHub discussion):

var dt = new KeyValuePair<DateTime, decimal>[2];
ref byte asRefByte = ref Unsafe.As<KeyValuePair<DateTime, decimal>, byte>(ref dt[0]);
fixed (byte * ptr = &asRefByte)
{
    // Treat the KeyValuePair<DateTime, decimal> as if it were a byte []
    ...
}

(this example is based on the StackOverflow question: “Get unsafe pointer to array of KeyValuePair<DateTime,decimal> in C#”)

Tricks you can do with it

Despite providing you with a nice strongly-typed API, you still have to mark your code as unsafe, which it’s a bit of a give-away that you can use it to do things that normal C# can’t!

Breaking immutability

Strings in C# are immutable and the runtime goes to great lengths to ensure you can’t bypass this behaviour. However under-the-hood the String data is just bytes which can be manipulated, indeed the runtime does this manipulation itself inside the StringBuilder class.

So using Unsafe.Write(..) we can modify the contents of a String - yay!! However it needs to be pointed out that this code will potentially break the behaviour of the String class in many subtle ways, so don’t ever use it in a real application!!

var text = "ABCDEFGHIJKLMNOPQRSTUVWXKZ";

Console.WriteLine("String Length {0}", text.Length); // prints 26
Console.WriteLine("Text: \"{0}\"", text); // "ABCDEFGHIJKLMNOPQRSTUVWXKZ"

var pinnedText = GCHandle.Alloc(text, GCHandleType.Pinned);
char* textAddress = (char*)pinnedText.AddrOfPinnedObject().ToPointer();

// Make an immutable string think that it is shorter than it actually is!!!
Unsafe.Write(textAddress - 2, 5);

Console.WriteLine("String Length {0}", text.Length); // prints 5
Console.WriteLine("Text: \"{0}\"", text); // prints "ABCDE

// change the 2nd character 'B' to '@'
Unsafe.Write(textAddress + 1, '@');

Console.WriteLine("Text: \"{0}\"", text); // prints "A@CDE

pinnedText.Free();

Messing with the CLR type-system

But we can go even further than that and do a really nasty trick to completely defeat the CLR type-system. This code is horrible and could potentially break the CLR in several ways, so as before don’t ever use it in a real application!!

int intValue = 5;
float floatValue = 5.0f;
object boxedInt = (object)intValue, boxedFloat = (object)floatValue;

var pinnedFloat = GCHandle.Alloc(boxedFloat, GCHandleType.Pinned);
var pinnedInt = GCHandle.Alloc(boxedInt, GCHandleType.Pinned);

int* floatAddress = (int*)pinnedFloat.AddrOfPinnedObject().ToPointer();
int* intAddress = (int*)pinnedInt.AddrOfPinnedObject().ToPointer();

Console.WriteLine("Type: {0}, Value: {1}", boxedInt.GetType().FullName, boxedInt);

// Make an int think it's a float!!!
int floatType = Unsafe.Read<int>(floatAddress - 1);
Unsafe.Write(intAddress - 1, floatType);

Console.WriteLine("Type: {0}, Value: {1}", boxedInt.GetType().FullName, boxedInt);

pinnedFloat.Free();
pinnedInt.Free();

Which prints out:

Type: System.Int32, Value: 5

Type: System.Single, Value: 7.006492E-45

Yep, we’ve managed to convince a int (Int32) type that it’s actually a float (Single) and behave like one instead!!

This works by overwriting the Method Table pointer for the int, with the same value as the float one. So when it looks up it’s type or prints out it’s value, it uses the float methods instead! Thanks to @Porges for the example that motivated this, his code does the same thing using fixed instead.

Using it safely

Despite the library requiring you to annotate your code with unsafe, there are still some safe or maybe more accurately safer ways to use it!

Fortunately one of the main .NET runtime developers provided a nice list of what you can and can’t do:

But as with all unsafe code, you’re asking the runtime to let you do things that you are normally prevented from doing, things that it normally saves you from, so you have to be careful!

Discuss this post in /r/csharp or /r/programming

The post Subverting .NET Type Safety with 'System.Runtime.CompilerServices.Unsafe' first appeared on my blog Performance is a Feature!

Analysing .NET Memory Dumps with CLR MD

2016-09-06T00:00:00+00:00

If you’ve ever spent time debugging .NET memory dumps in WinDBG you will be familiar with the commands shown below, which aren’t always the most straight-forward to work with!

However back in May 2013 Microsoft released the CLR MD library, describing it as:

… a set of advanced APIs for programmatically inspecting a crash dump of a .NET program much in the same way as the SOS Debugging Extensions (SOS). It allows you to write automated crash analysis for your applications and automate many common debugger tasks.

This post explores some of the things you can achieve by instead using CLR MD, a C# library which is now available as a NuGet Package. If you’re interested the full source code for all the examples is available.

Getting started with CLR MD

This post isn’t meant to serve as a Getting Started guide, there’s already a great set of Tutorials linked from project README that serve that purpose:

Getting Started - A brief introduction to the API and how to create a CLRRuntime instance.
The CLRRuntime Object - Basic operations like enumerating AppDomains, Threads, the Finalizer Queue, etc.
Walking the Heap - Walking objects on the GC heap, working with types in CLR MD.
Types and Fields in CLRMD - More information about dealing with types and fields in CLRMD.
Machine Code in CLRMD - Getting access to the native code produced by the JIT or NGEN

However we will be looking at what else CLR MD allows you to achieve.

Detailed GC Heap Information

I’ve previously written about the Garbage Collectors, so the first thing that we’ll do is see what GC related information we can obtain. The .NET GC creates 1 or more Heaps, depending on the number of CPU cores available and the mode it is running in (Server/Workstation). These heaps are in-turn made up of several Segments, for the different Generations (Gen0/Ephememral, Gen1, Gen2 and Large). Finally it’s worth pointing out that the GC initially Reserves the memory it wants, but only Commits it when it actually needs to. So using the code shown here, we can iterate through the different GC Heaps, printing out the information about their individual Segments as we go:

Analysing String usage

But knowing what’s inside those heaps is more useful, as David Fowler nicely summed up in a tweet, strings often significantly contribute to memory usage:

Now we could analyse the memory dump to produce a list of the most frequently occurring strings, as Nick Craver did with a memory dump from the App Pool of a Stack Overflow server (click for larger image):

However we’re going to look more closely at the actual contents of the string and in-particular analyse what the underlying encoding is, i.e. ASCII, ISO-8859-1 (Latin-1) or Unicode.

By default the .NET string Encoder, instead of giving an error, replaces any characters it can’t convert with ‘�’ (which is known as the Unicode Replacement Character). So we will need to force it to throw an exception. This means we can detect the most compact encoding possible, by trying to convert to the raw string data to ASCII, ISO-8859-1 (Latin-1) and then Unicode (sequence of UTF-16 code units) in turn. To see this in action, below is the code from the IsASCII(..) function:

private static Encoding asciiEncoder = Encoding.GetEncoding(
        Encoding.ASCII.EncodingName, 
        EncoderFallback.ExceptionFallback, 
        DecoderFallback.ExceptionFallback);
   
private static bool IsASCII(string text, out byte[] textAsBytes)
{
    var unicodeBytes = Encoding.Unicode.GetBytes(text);
    try
    {
        textAsBytes = Encoding.Convert(Encoding.Unicode, asciiEncoder, unicodeBytes);
        return true;
    }
    catch (EncoderFallbackException /*efEx*/)
    {
        textAsBytes = null;
        return false;
    }
}

Next we run this on a memory dump of Visual Studio with the HeapStringAnalyser source code solution loaded and get the following output:

The most interesting part is reproduced below:

Overall 145,872 "System.String" objects take up 12,391,286 bytes (11.82 MB)
Of this underlying byte arrays (as Unicode) take up 10,349,078 bytes (9.87 MB)
Remaining data (object headers, other fields, etc) are 2,042,208 bytes (1.95 MB), at 14 bytes per object

Actual Encoding that the "System.String" could be stored as (with corresponding data size)
       10,339,638 bytes ( 145,505 strings) as ASCII
            3,370 bytes (      65 strings) as ISO-8859-1 (Latin-1)
            6,070 bytes (     302 strings) as Unicode
Total: 10,349,078 bytes

So in this case we can see that out of the 145,872 string objects in memory, 145,505 of them could actually be stored as ASCII, a further 65 as ISO-8859-1 (Latin-1) and only 302 need the full Unicode encoding.

Additional resources

Hopefully this post has demonstrated that CLR MD is a powerful tool, if you want to find out more please refer to the links below:

Traversing the GC Heap with ClrMd
msos - Command-line environment a-la WinDbg for executing SOS commands without having SOS available
.NET Crash Dump and Live Process Inspection
ClrMD.Extensions
Get most duplicated strings from a heap dump using ClrMD
Dumpty - A Dump tool for .Net.
How to properly work with non-primitive ClrInstanceField values using ClrMD?

The post Analysing .NET Memory Dumps with CLR MD first appeared on my blog Performance is a Feature!

Analysing Optimisations in the Wire Serialiser

2016-08-23T00:00:00+00:00

Recently Roger Johansson wrote a post titled Wire – Writing one of the fastest .NET serializers, describing the optimisation that were implemented to make Wire as fast as possible. He also followed up that post with a set of benchmarks, showing how Wire compared to other .NET serialisers:

Using BenchmarkDotNet, this post will analyse the individual optimisations and show how much faster each change is. For reference, the full list of optimisations in the original blog post are:

Looking up value serializers by type
Looking up types when deserializing
Byte buffers, allocations and GC
Clever allocations
Boxing, Unboxing and Virtual calls
Fast creation of empty objects

Looking up value serializers by type

This optimisation changed code like this:

public ValueSerializer GetSerializerByType(Type type)
{
  ValueSerializer serializer;
 
  if (_serializers.TryGetValue(type, out serializer))
    return serializer;
 
  //more code to build custom type serializers.. ignore for now.
}

into this:

public ValueSerializer GetSerializerByType(Type type)
{
  if (ReferenceEquals(type.GetTypeInfo().Assembly, ReflectionEx.CoreAssembly))
  {
    if (type == TypeEx.StringType) //we simply keep a reference to each primitive type
      return StringSerializer.Instance;
 
    if (type == TypeEx.Int32Type)
      return Int32Serializer.Instance;
 
    if (type == TypeEx.Int64Type)
      return Int64Serializer.Instance;
    ...
}

So it has replaced a dictionary lookup with an if statement. In addition it is caching the Type instance of known types, rather than calculating them every time. As you can see the optimisation pays off in some circumstance but not in others, so it’s not a clear win. It depends on where the type is in the list of if statements. If it’s near the beginning (e.g. System.String) it’ll be quicker than if it’s near the end (e.g. System.Byte[]), which makes sense as all the other comparisons have to be done first.

Full benchmark code and results

Looking up types when deserializing

The 2nd optimisation works by removing all unnecessary memory allocations, it did this by:

Using a custom struct (value type) rather than a class
Pre-calculating a hash code once, rather than each time a comparison is needed.
Doing string comparisons with raw byte [], rather than deserialising to a string

Full benchmark code and results

Note: these results nicely demonstrate how BenchmarkDotNet can show you memory allocations as well as the time taken.

Interestingly they hadn’t actually removed all memory allocations as the comparisons between OptimisedLookup and OptimisedLookupCustomComparer show. To fix this I sent a P.R which removes unnecessary boxing, by using a Custom Comparer rather than the default struct comparer.

Byte buffers, allocations and GC

Again removing unnecessary memory allocations were key in this optimisation, most of which can be seen in the NoAllocBitConverter. Clearly serialisation spends a lot of time converting from the in-memory representation of an object to the serialised version, i.e. a byte []. So several tricks were employed to ensure that temporary memory allocations were either removed completely or if that wasn’t possible, they were done by re-using a buffer from a pool rather than allocating a new one each time (see “Buffer recycling”)

Full benchmark code and results

Clever allocations

This optimisation is perhaps the most interesting, because it’s implemented by creating a custom data structure, tailored to the specific needs of Wire. So, rather than using the default .NET dictionary, they implemented FastTypeUShortDictionary. In essence this data structure optimises for having only 1 item, but falls back to a regular dictionary when it grows larger. To see this in action, here is the code from the TryGetValue(..) method:

public bool TryGetValue(Type key, out ushort value)
{
    switch (_length)
    {
        case 0:
            value = 0;
            return false;
        case 1:
            if (key == _firstType)
            {
                value = _firstValue;
                return true;
            }
            value = 0;
            return false;
        default:
            return _all.TryGetValue(key, out value);
    }
}

Like we’ve seen before, the performance gains aren’t clear-cut. For instance it depends on whether FastTypeUShortDictionary contains the item you are looking for (Hit v Miss), but generally it is faster:

Full benchmark code and results

Boxing, Unboxing and Virtual calls

This optimisation is based on the widely used trick that I imagine almost all .NET serialisers employ. For a serialiser to be generic, is has to be able to handle any type of object that is passed to it. Therefore the first thing it does is use reflection to find the public fields/properties of that object, so that it knows the data is has to serialise. Doing reflection like this time-and-time again gets expensive, so the way to get round it is to do reflection once and then use dynamic code generation to compile a delegate than you can then call again and again.

If you are interested in how to implement this, see the Wire compiler source or this Stack Overflow question. As shown in the results below, compiling code dynamically is much faster than reflection and only a little bit slower than if you read/write the property directly in C# code:

Full benchmark code and results

Fast creation of empty objects

The final optimisation trick used is also based on dynamic code creation, but this time it is purely dealing with creating empty objects. Again this is something that a serialiser does many time, so any optimisations or savings are worth it.

Basically the benchmark is comparing code like this:

FormatterServices.GetUninitializedObject(type);

with dynamically generated code, based on Expression trees:

var newExpression = ExpressionEx.GetNewExpression(typeToUse);
Func<TestClass> optimisation = Expression.Lambda<Func<TestClass>>(newExpression).Compile();

However this trick only works if the constructor of the type being created is empty, otherwise it has to fall back to the slow version. But as shown in the results below, we can see that the optimisation is a clear win and worth implementing:

Full benchmark code and results

Summary

So it’s obvious that Roger Johansson and Szymon Kulec (who also contributed performance improvements) know their optimisations and as a result they have steadily made the Wire serialiser faster, which makes is an interesting project to learn from.

The post Analysing Optimisations in the Wire Serialiser first appeared on my blog Performance is a Feature!

Preventing .NET Garbage Collections with the TryStartNoGCRegion API

2016-08-16T00:00:00+00:00

Pauses are a known problem in runtimes that have a Garbage Collector (GC), such as Java or .NET. GC Pauses can last several milliseconds, during which your application is blocked or suspended. One way you can alleviate the pauses is to modify your code so that it doesn’t allocate, i.e. so the GC has nothing to do. But this can require lots of work and you really have to understand the runtime as many allocation are hidden.

Another technique is to temporarily suspend the GC, during a critical region of your code where you don’t want any pauses and then start it up again afterwards. This is exactly what the TryStartNoGCRegion API (added in .NET 4.6) allows you to do.

From the MSDN docs:

Attempts to disallow garbage collection during the execution of a critical path if a specified amount of memory is available.

TryStartNoGCRegion in Action

To see how the API works, I ran some simple tests using the .NET GC Workstation mode, on a 32-bit CPU. The test simply call TryStartNoGCRegion and then verify how much memory can be allocated before a Collection happens. The code is available if you want to try it out for yourself.

Test 1: Regular allocation, `TryStartNoGCRegion` not called

You can see that a garbage collection happens after the 2nd allocation (indicated by “**”):

Prevent GC: False, Over Allocate: False
Allocated:   3.00 MB, Mode:  Interactive, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:   6.00 MB, Mode:  Interactive, Gen0: 1, Gen1: 1, Gen2: 1, **
Allocated:   9.00 MB, Mode:  Interactive, Gen0: 1, Gen1: 1, Gen2: 1,
Allocated:  12.00 MB, Mode:  Interactive, Gen0: 1, Gen1: 1, Gen2: 1,
Allocated:  15.00 MB, Mode:  Interactive, Gen0: 1, Gen1: 1, Gen2: 1,

Test 2: `TryStartNoGCRegion(..)` with size set to 15MB

Here we see that despite allocating the same amount as in the first test, no garbage collections are triggered during the run.

Prevent GC: True, Over Allocate: False
TryStartNoGCRegion: Size=15 MB (15,360 K or 15,728,640 bytes) SUCCEEDED
Allocated:   3.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:   6.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:   9.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:  12.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:  15.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,

Test 3: `TryStartNoGCRegion(..)` size of 15MB, but allocating more than 15MB

Finally we see that once we’ve allocated more that the size we asked for, the mode switches from NoGCRegion to Interactive and garbage collections can now happen.

Prevent GC: True, Over Allocate: True
TryStartNoGCRegion: Size=15 MB (15,360 K or 15,728,640 bytes) SUCCEEDED
Allocated:   3.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:   6.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:   9.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:  12.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:  15.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:  18.00 MB, Mode:   NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated:  21.00 MB, Mode:  Interactive, Gen0: 1, Gen1: 1, Gen2: 1, **
Allocated:  24.00 MB, Mode:  Interactive, Gen0: 1, Gen1: 1, Gen2: 1,
Allocated:  27.00 MB, Mode:  Interactive, Gen0: 2, Gen1: 2, Gen2: 2, **
Allocated:  30.00 MB, Mode:  Interactive, Gen0: 2, Gen1: 2, Gen2: 2,

So this shows that at least in the simple test we’ve done, the API works as advertised. As long as you don’t subsequently allocate more memory than you asked for, no Garbage Collections will take place.

Object Size

However there are a few caveats when using TryStartNoGCRegion, the first of which is that you are required to know up-front, the total size in bytes of the objects you will be allocating. As we’ve seen previously if you allocate more than totalSize bytes, the No GC Region will no longer be active and it will then be possible for garbage collections to happen.

It’s not straight forward to get the size of an object in .NET, it’s a managed-runtime and it tries it’s best to hide that sort of detail from you. To further complicate matters is varies depending on the CPU architecture and even the version of the runtime.

But you do have a few options:

Guess?!
Search on Stack Overflow
Start-up WinDbg and use the !objsize command on a memory dump of your process
Get a estimate using the technique that Jon Skeet proposes
Use DotNetEx, which relies on inspecting the internal fields of the CLR object

Personally I would go with a variation of 3), use WinDbg, but automate it using the excellent CLRMD C# library.

Segment Size

Update: It turns out that I completely missed the section on segment sizes on the MSDN page, thanks to Maoni for pointing this out to me. In the section on “Generations” there is the following chart (which fortunately correlates with my findings below):

However even when you know how many bytes will be allocated within the No GC Region, you still need to ensure that it’s less that the maximum amount allowed, because if you specify a value too large an ArgumentOutOfRangeException exception is thrown. From the MSDN docs (emphasis mine):

The amount of memory in bytes to allocate without triggering a garbage collection. It must be less than or equal to the size of an ephemeral segment. For information on the size of an ephemeral segment, see the “Ephemeral generations and segments” section in the Fundamentals of Garbage Collection article.

~~However if you visit the linked article on GC Fundamentals, it has no exact figure for the size of an ephemeral segment, it does however have this stark warning:~~

Important The size of segments allocated by the garbage collector is implementation-specific and is subject to change at any time, including in periodic updates. Your app should never make assumptions about or depend on a particular segment size, nor should it attempt to configure the amount of memory available for segment allocations.

~~Excellent, that’s very helpful!?~~

So let me get this straight, to prevent TryStartNoGCRegion from throwing an exception, we have to pass in a totalSize value that isn’t larger than the size of an ephemeral segment, but we’re not allowed to know the actual value of an ephemeral segment, in-case we assume too much!!

~~So where does that leave us?~~

Well fortunately it’s possible to figure out the size of an ephemeral or Small Object Heap (SOH) segment using either VMMap, or the previously mentioned CLRMD library (code sample available).

Here are the results I got with the .NET Framework 4.6.1, running on a 4 Core (HT) - Intel® Core™ i7-4800MQ, i.e. Environment.ProcessorCount = 8. If you click on the links for each row heading, you can see the full breakdown as reported by VMMap.

GC Mode	CPU Arch	SOH Segment	LOH Segment	Initial GC Size	Largest No GC Region `totalSize` value
Workstation	32-bit	16 MB	16 MB	32 MB	16 MB
Workstation	64-bit	256 MB	128 MB	384 MB	244 MB
Server	32-bit	32 MB	16 MB	384 MB	256 MB
Server	64-bit	2,048 MB	256 MB	18,423 MB	16,384 MB

The final column is the largest totalSize value that can be passed into TryStartNoGCRegion(long totalSize), this was found by experimentation/trial-and-error.

Note: The main difference between Server and Workstation is that in Workstation mode there is only one heap, whereas in Server mode there is one heap per logical CPU.

TryStartNoGCRegion under-the-hood

What’s nice is that the entire feature is in a single Github commit, so it’s easy to see what code changes were made:

Around half of the files modified (listed below) are the changes needed to set-up the plumbing and error handling involved in adding a API to the System.GC class, they also give an interesting overview of what’s involved in having the external C# code talk to the internal C++ code in the CLR (click on a link to go directly to the diff):

The rest of the changes are where the actual work takes place, with all the significant heavy-lifting happening in gc.cpp:

TryStartNoGCRegion Implementation

When you call TryStartNoGCRegion the following things happen:

The maximum required heap sizes are calculated based on the totalSize parameter passed in. These calculations take place in gc_heap::prepare_for_no_gc_region
If the current heaps aren’t large enough to accommodate the new value, they are re-sized. To achieve this a full collection is triggered (see GCHeap::StartNoGCRegion)

Note: Due to the way the GC uses segments, it won’t always allocate memory. It will however ensure that it reserves the maximum amount of memory required, so that it can be committed when actually needed.

Then next time the GC wants to perform a collection it checks:

Is the current mode set to No GC Region
- By checking gc_heap::settings.pause_mode == pause_no_gc, relevant code here
Can we stay in the No GC Region mode
- This is done by calling gc_heap::should_proceed_for_no_gc(), which performs a sanity-check to ensure that we haven’t allocated more than the # of bytes we asked for when TryStartNoGCRegion was set-up

If 1) and 2) are both true then a collection does not take place because the GC knows that it has already reserved enough memory to fulfil future allocations, so it doesn’t need to clean-up up any existing garbage to make space.

GC Pauses and Safe Points

2016-08-08T00:00:00+00:00

GC pauses are a popular topic, if you do a google search, you’ll see lots of articles explaining how to measure and more importantly how to reduce them. This issue is that in most runtimes that have a GC, allocating objects is a quick operation, but at some point in time the GC will need to clean up all the garbage and to do this is has to pause the entire runtime (except if you happen to be using Azul’s pauseless GC for Java).

The GC needs to pause the entire runtime so that it can move around objects as part of it’s compaction phase. If these objects were being referenced by code that was simultaneously executing then all sorts of bad things would happen. So the GC can only make these changes when it knows that no other code is running, hence the need to pause the entire runtime.

GC Flow

In a previous post I demonstrated how you can use ETW Events to visualise what the .NET Garbage Collector (GC) is doing. That post included the following GC flow for a Foreground/Blocking Collection (info taken from the excellent blog post by Maoni Stephens the main developer on the .NET GC):

GCSuspendEE_V1
GCSuspendEEEnd_V1 <– suspension is done
GCStart_V1
GCEnd_V1 <– actual GC is done
GCRestartEEBegin_V1
GCRestartEEEnd_V1 <– resumption is done.

This post is going to be looking at how the .NET Runtime brings all the threads in an application to a safe-point so that the GC can do it’s work. This corresponds to what happens between step 1) GCSuspendEE_V1 and 2) GCSuspendEEEnd_V1 in the flow above.

For some background this passage from the excellent Pro .NET Performance: Optimize Your C# Applications explains what’s going on:

Technically the GC itself doesn’t actually perform a suspension, it calls into the Execution Engine (EE) and asks that to suspend all the running threads. This suspension needs to be as quick as possible, because the time taken contributes to the overall GC pause. Therefore this Time To Safe Point (TTSP) as it’s known, needs to be minimised, the CLR does this by using several techniques.

GC suspension in Runtime code

Inside code that it controls, the runtime inserts method calls to ensure that threads can regularly poll to determine when they need to suspend. For instance take a look at the following code snippet from the IndexOfCharArray() method (which is called internally by String.IndexOfAny(..)). Notice that it contains multiple calls to the macro FC_GC_POLL_RET():

FCIMPL4(INT32, COMString::IndexOfCharArray, StringObject* thisRef, CHARArray* valueRef, INT32 startIndex, INT32 count)
{
    // <OTHER CODE REMOVED>

    // use probabilistic map, see (code:InitializeProbabilisticMap)
    int charMap[PROBABILISTICMAP_SIZE] = {0};

    InitializeProbabilisticMap(charMap, valueChars, valueLength);

    for (int i = startIndex; i < endIndex; i++) {
        WCHAR thisChar = thisChars[i];
        if (ProbablyContains(charMap, thisChar))
            if (ArrayContains(thisChars[i], valueChars, valueLength) >= 0) {
                FC_GC_POLL_RET();
                return i;
            }
    }

    FC_GC_POLL_RET();
    return -1;
}

The are lots of other places in the runtime where these calls are inserted, to ensure that a GC suspension can happen as soon as possible. However having these calls spread throughout the code has an overhead, so the runtime uses a special trick to ensure the cost is only paid when a suspension has actually been requested, From jithelp.asm you can see that the method call is re-written to a nop routine when not needed and only calls the actual JIT_PollGC() function when absolutely required:

; Normally (when we're not trying to suspend for GC), the 
; CORINFO_HELP_POLL_GC helper points to this nop routine.  When we're 
; ready to suspend for GC, we whack the Jit Helper table entry to point 
; to the real helper. When we're done with GC we whack it back.
PUBLIC @JIT_PollGC_Nop@0
@JIT_PollGC_Nop@0 PROC
ret
@JIT_PollGC_Nop@0 ENDP

However calls to FC_GC_POLL need to be carefully inserted in the correct locations, too few and the EE won’t be able to suspend quickly enough and this will cause excessive GC pauses, as this comment from one of the .NET JIT devs confirms:

GC suspension in User code

Alternatively, in code that the runtime doesn’t control things are a bit different. Here the JIT analyses the code and classifies it as either:

Partially interruptible
Fully interruptible

Partially interruptible code can only be suspended at explicit GC poll locations (i.e. FC_GC_POLL calls) or when it calls into other methods. On the other hand fully interruptible code can be interrupted or suspended at any time, as every line within the method is considered a GC safe-point.

I’m not going to talk about how the thread-hijacking mechanism works (used with fully interruptible code), as it’s a complex topic, but as always there’s an in-depth section in the BOTR that gives all the gory details. If you don’t want to read the whole thing, in summary it suspends the underlying native thread, via the Win32 SuspendThread API.

You can see some of the heuristics that the JIT uses to decide whether code is fully or partially interruptible as it seeks to find the best trade-off between code quality/size and GC suspension latency. But as a concrete example, if we take the following code that accumulates a counter in a tight loop:

public static long TestMethod()
{
    long counter = 0;
    for (int i = 0; i < 1000 * 1000; i++)
    {
        for (int j = 0; j < 2000; j++)
        {
            if (i % 10 == 0)
                counter++;
        }
    }
    Console.WriteLine("Loop exited, counter = {0:N0}", counter);
    return counter;
}

And then execute it with the JIT diagnostics turned on you get the following output, which shows that this code is classified as fully interruptible:

; Assembly listing for method ConsoleApplication.Program:TestMethod():long
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible

(full JIT diagnostic output of Fully Interruptible method)

Now, if we run the same test again, but tweak the code by adding a few Console.WriteLine(..) methods calls:

public static long TestMethod()
{
    long counter = 0;
    for (int i = 0; i < 1000 * 1000; i++)
    {
        for (int j = 0; j < 2000; j++)
        {
            if (i % 10 == 0)
                counter++;
            Console.WriteLine("Inside Inner Loop, counter = {0:N0}", counter);
        }
        Console.WriteLine("After Inner Loop, counter = {0:N0}", counter);
    }
    Console.WriteLine("Thread loop exited cleanly, counter = {0:N0}", counter);
    return counter;
}

The method is then classified as Partially Interruptible, due to the additional Console.WriteLine(..) calls:

; Assembly listing for method ConsoleApplication.Program:TestMethod():long
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible

(full JIT diagnostic output of Partially Interruptible method)

Interesting enough there seems to be functionality that enables JIT_PollGC() calls to be inserted into user code as they are compiled by the .NET JIT, this is controlled by the GCPollType CLR Configuration flag. However by default it’s disabled and in my tests turning it on causes the CoreCLR to exit with some interesting errors. So it appears that currently, the default or supported behaviour is to use thread-hijacking on user code, rather than inserting explicit JIT_PollGC() calls.

How the dotnet CLI tooling runs your code

2016-07-04T00:00:00+00:00

Just over a week ago the official 1.0 release of .NET Core was announced, the release includes:

the .NET Core runtime, libraries and tools and the ASP.NET Core libraries.

However alongside a completely new, revamped, xplat version of the .NET runtime, the development experience has been changed, with the dotnet based tooling now available (Note: the tooling itself is currently still in preview and it’s expected to be RTM later this year)

So you can now write:

dotnet new
dotnet restore
dotnet run

and at the end you’ll get the following output:

Hello World!

It’s the dotnet CLI (Command Line Interface) tooling that is the focus of this post and more specifically how it actually runs your code, although if you want a tl;dr version see this tweet from @citizenmatt:

Traditional way of running .NET executables

As a brief reminder, .NET executables can’t be run directly (they’re just IL, not machine code), therefore the Windows OS has always needed to do a few tricks to execute them, from CLR via C#:

After Windows has examined the EXE file’s header to determine whether to create a 32-bit process, a 64-bit process, or a WoW64 process, Windows loads the x86, x64, or IA64 version of MSCorEE.dll into the process’s address space. … Then, the process’ primary thread calls a method defined inside MSCorEE.dll. This method initializes the CLR, loads the EXE assembly, and then calls its entry point method (Main). At this point, the managed application is up and running.

New way of running .NET executables

`dotnet run`

So how do things work now that we have the new CoreCLR and the CLI tooling? Firstly to understand what is going on under-the-hood, we need to set a few environment variables (COREHOST_TRACE and DOTNET_CLI_CAPTURE_TIMING) so that we get a more verbose output:

Here, amongst all the pretty ASCII-art, we can see that dotnet run actually executes the following cmd:

dotnet exec --additionalprobingpath C:\Users\matt\.nuget\packages c:\dotnet\bin\Debug\netcoreapp1.0\myapp.dll

Note: this is what happens when running a Console Application. The CLI tooling supports other scenarios, such as self-hosted web sites, which work differently.

`dotnet exec` and `corehost`

Up-to this point everything was happening within managed code, however once dotnet exec is called we jump over to unmanaged code within the corehost application. In addition several other .dlls are loaded, the last of which is the CoreCLR runtime itself (click to go to the main source file for each module):

The main task that the corehost application performs is to calculate and locate all the dlls needed to run the application, along with their dependencies. The full output is available, but in summary it processes:

99 Managed dlls (“Adding runtime asset..”)
136 Native dlls (“Adding native asset..”)

There are so many individual files because the CoreCLR operates on a “pay-for-play” model, from Motivation Behind .NET Core:

By factoring the CoreFX libraries and allowing individual applications to pull in only those parts of CoreFX they require (a so-called “pay-for-play” model), server-based applications built with ASP.NET 5 can minimize their dependencies.

Finally, once all the housekeeping is done control is handed off to corehost, but not before the following properties are set to control the execution of the CoreCLR itself:

TRUSTED_PLATFORM_ASSEMBLIES =
- Paths to 235 .dlls (99 managed, 136 native), from C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702
APP_PATHS =
- c:\dotnet\bin\Debug\netcoreapp1.0
APP_NI_PATHS =
- c:\dotnet\bin\Debug\netcoreapp1.0
NATIVE_DLL_SEARCH_DIRECTORIES =
- C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702
- c:\dotnet\bin\Debug\netcoreapp1.0
PLATFORM_RESOURCE_ROOTS =
- c:\dotnet\bin\Debug\netcoreapp1.0
- C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702
AppDomainCompatSwitch =
- UseLatestBehaviorWhenTFMNotSpecified
APP_CONTEXT_BASE_DIRECTORY =
- c:\dotnet\bin\Debug\netcoreapp1.0
APP_CONTEXT_DEPS_FILES =
- c:\dotnet\bin\Debug\netcoreapp1.0\dotnet.deps.json
- C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702\Microsoft.NETCore.App.deps.json
FX_DEPS_FILE =
- C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702\Microsoft.NETCore.App.deps.json

Note: You can also run your app by invoking corehost.exe directly with the following command:

corehost.exe C:\dotnet\bin\Debug\netcoreapp1.0\myapp.dll

Executing a .NET Assembly

At last we get to the point at which the .NET dll/assembly is loaded and executed, via the code shown below, taken from unixinterface.cpp:

hr = host->SetStartupFlags(startupFlags);
IfFailRet(hr);

hr = host->Start();
IfFailRet(hr);

hr = host->CreateAppDomainWithManager(
    appDomainFriendlyNameW,
    // Flags:
    // APPDOMAIN_ENABLE_PLATFORM_SPECIFIC_APPS
    // - By default CoreCLR only allows platform neutral assembly to be run. To allow
    //   assemblies marked as platform specific, include this flag
    //
    // APPDOMAIN_ENABLE_PINVOKE_AND_CLASSIC_COMINTEROP
    // - Allows sandboxed applications to make P/Invoke calls and use COM interop
    //
    // APPDOMAIN_SECURITY_SANDBOXED
    // - Enables sandboxing. If not set, the app is considered full trust
    //
    // APPDOMAIN_IGNORE_UNHANDLED_EXCEPTION
    // - Prevents the application from being torn down if a managed exception is unhandled
    //
    APPDOMAIN_ENABLE_PLATFORM_SPECIFIC_APPS |
    APPDOMAIN_ENABLE_PINVOKE_AND_CLASSIC_COMINTEROP |
    APPDOMAIN_DISABLE_TRANSPARENCY_ENFORCEMENT,
    NULL, // Name of the assembly that contains the AppDomainManager implementation
    NULL, // The AppDomainManager implementation type name
    propertyCount,
    propertyKeysW,
    propertyValuesW,
    (DWORD *)domainId);

This is making use of the ICLRRuntimeHost Interface, which is part of the COM based hosting API for the CLR. Despite the file name, it is actually from the Windows version of the CLI tooling. In the xplat world of the CoreCLR the hosting API that was originally written for Unix has been replicated across all the platforms so that a common interface is available for any tools that want to use it, see the following GitHub issues for more information:

And that’s it, your .NET code is now running, simple really!!

Additional information:

The post How the dotnet CLI tooling runs your code first appeared on my blog Performance is a Feature!

Visualising the .NET Garbage Collector

2016-06-20T00:00:00+00:00

As part of an ongoing attempt to learn more about how a real-life Garbage Collector (GC) works (see part 1) and after being inspired by Julia Evans’ excellent post gzip + poetry = awesome I spent a some time writing a tool to enable a live visualisation of the .NET GC in action.

The output from the tool is shown below, click to Play/Stop (direct link to gif). The full source is available if you want to take a look.

Capturing GC Events in .NET

Fortunately there is a straight-forward way to capture the raw GC related events, using the excellent TraceEvent library that provides a wrapper over the underlying ETW Events the .NET GC outputs.

It’s a simple as writing code like this :

session.Source.Clr.GCAllocationTick += allocationData =>
{
    if (ProcessIdsUsedInRuns.Contains(allocationData.ProcessID) == false)
        return;

    totalBytesAllocated += allocationData.AllocationAmount;

    Console.Write(".");
};

Here we are wiring up a callback each time a GCAllocationTick event is fired, other events that are available include GCStart, GCEnd, GCSuspendEEStart, GCRestartEEStart and many more.

As well outputting a visualisation of the raw events, they are also aggregated so that a summary can be produced:

Memory Allocations:
        1,065,720 bytes currently allocated
    1,180,308,804 bytes have been allocated in total
GC Collections:
  16 in total (12 excluding B/G)
     2 - generation 0
     9 - generation 1
     1 - generation 2
     4 - generation 2 (B/G)
Time in GC: 1,300.1 ms (108.34 ms avg)
Time under test: 3,853 ms (33.74 % spent in GC)
Total GC Pause time: 665.9 ms
Largest GC Pause: 75.99 ms

GC Pauses

Most of the visualisation and summary information is relatively easy to calculate, however the timings for the GC pauses are not always straight-forward. Since .NET 4.5 the Server GC has 2 main modes available the new Background GC mode and the existing Foreground/Non-Concurrent one. The .NET Workstation GC has had a Background GC mode since .NET 4.0 and a Concurrent mode before that.

The main benefit of the Background mode is that it reduces GC pauses, or more specifically it reduces the time that the GC has to suspend all the user threads running inside the CLR. The problem with these “stop-the-world” pauses, as they are also known, is that during this time your application can’t continue with whatever it was doing and if the pauses last long enough users will notice.

As you can see in the image below (courtesy of the .NET Blog) , with the newer Background mode in .NET 4.5 the time during which user-threads are suspended is much smaller (the dark blue arrows). They only need to be suspended for part of the GC process, not the entire duration.

Foreground (Blocking) GC flow

So calculating the pauses for a Foreground GC (this means all Gen 0/1 GCs and full blocking GCs) is relatively straightforward, using the info from the excellent blog post by Maoni Stephens the main developer on the .NET GC:

GCSuspendEE_V1
GCSuspendEEEnd_V1 <– suspension is done
GCStart_V1
GCEnd_V1 <– actual GC is done
GCRestartEEBegin_V1
GCRestartEEEnd_V1 <– resumption is done.

So the pause is just the difference between the timestamp of the GCSuspendEEEnd_V1 event and that of the GCRestartEEEnd_V1.

Background GC flow

However for Background GC (Gen 2) it is more complicated, again from Maoni’s blog post:

GCSuspendEE_V1
GCSuspendEEEnd_V1
GCStart_V1 <– Background GC starts
GCRestartEEBegin_V1
GCRestartEEEnd_V1 <– done with the initial suspension
GCSuspendEE_V1
GCSuspendEEEnd_V1
GCRestartEEBegin_V1
GCRestartEEEnd_V1 <– done with Background GC’s own suspension
GCSuspendEE_V1
GCSuspendEEEnd_V1 <– suspension for Foreground GC is done
GCStart_V1
GCEnd_V1 <– Foreground GC is done
GCRestartEEBegin_V1
GCRestartEEEnd_V1 <– resumption for Foreground GC is done
GCEnd_V1 <– Background GC ends

It’s a bit easier to understand these steps by using an annotated version of the image from the MSDN page on GC (the numbers along the bottom correspond to the steps above)

But there’s a few caveats that make it trickier to calculate the actual time:

Of course there could be more than one foreground GC, there could be 0+ between line 5) and 6), and more than one between line 9) and 16).

We may also decide to do an ephemeral GC before we start the BGC (as BGC is meant for gen2) so you might also see an ephemeral GC between line 3) and 4) – the only difference between it and a normal ephemeral GC is you wouldn’t see its own suspension and resumption events as we already suspended/resumed for BGC purpose.

Age of Ascent - GC Pauses

Finally, if you want a more dramatic way of visualising a “Stop the World” or more accurately a “Stop the Universe” GC pause, take a look at the video below. The GC pause starts at around 7 seconds in (credit to Ben Adams and Age of Ascent)

Discuss this post on Hacker News

The post Visualising the .NET Garbage Collector first appeared on my blog Performance is a Feature!

Strings and the CLR - a Special Relationship

2016-05-31T00:00:00+00:00

Strings and the Common Language Runtime (CLR) have a special relationship, but it’s a bit different (and way less political) than the UK <-> US special relationship that is often talked about.

This relationship means that Strings can do things that aren’t possible in the C# code that you and I can write and they also get a helping hand from the runtime to achieve maximum performance, which makes sense when you consider how ubiquitous they are in .NET applications.

String layout in memory

Firstly strings differ from any other data type in the CLR (other than arrays) in that their size isn’t fixed. Normally the .NET GC knows the size of an object when it’s being allocated, because it’s based on the size of the fields/properties within the object and they don’t change. However in .NET a string object doesn’t contain a pointer to the actual string data, which is then stored elsewhere on the heap. That raw data, the actual bytes that make up the text are contained within the string object itself. That means that the memory representation of a string looks like this:

The benefit is that this gives excellent memory locality and ensures that when the CLR wants to access the raw string data it doesn’t have to do another pointer lookup. For more information, see the Stack Overflow questions “Where does .NET place the String value?” and Jon Skeet’s excellent post on strings.

Whereas if you were to implement your own string class, like so:

public class MyString
{
    int Length;
    byte [] Data;
}

If would look like this in memory:

In this case, the actual string data would be held in the byte [], located elsewhere in memory and would therefore require a pointer reference and lookup to locate it.

This is summarised nicely in the excellent BOTR, in in the mscorlib section:

The managed mechanism for calling into native code must also support the special managed calling convention used by String’s constructors, where the constructor allocates the memory used by the object (instead of the typical convention where the constructor is called after the GC allocates memory).

Implemented in un-managed code

Despite the String class being a managed C# source file, large parts of it are implemented in un-managed code, that is in C++ or even Assembly. For instance there are 15 methods in String.cs that have no method body, are marked as extern with [MethodImplAttribute(MethodImplOptions.InternalCall)] applied to them. This indicates that their implementations are provided elsewhere by the runtime. Again from the mscorlib section of the BOTR (emphasis mine)

We have two techniques for calling into the CLR from managed code. FCall allows you to call directly into the CLR code, and provides a lot of flexibility in terms of manipulating objects, though it is easy to cause GC holes by not tracking object references correctly. QCall allows you to call into the CLR via the P/Invoke, and is much harder to accidentally mis-use than FCall. FCalls are identified in managed code as extern methods with the MethodImplOptions.InternalCall bit set. QCalls are static extern methods that look like regular P/Invokes, but to a library called “QCall”.

Types with a Managed/Unmanaged Duality

A consequence of Strings being implemented in unmanaged and managed code is that they have to be defined in both and those definitions must be kept in sync:

Certain managed types must have a representation available in both managed & native code. You could ask whether the canonical definition of a type is in managed code or native code within the CLR, but the answer doesn’t matter – the key thing is they must both be identical. This will allow the CLR’s native code to access fields within a managed object in a very fast, easy to use manner. There is a more complex way of using essentially the CLR’s equivalent of Reflection over MethodTables & FieldDescs to retrieve field values, but this probably doesn’t perform as well as you’d like, and it isn’t very usable. For commonly used types, it makes sense to declare a data structure in native code & attempt to keep the two in sync.

So in String.cs we can see:

//NOTE NOTE NOTE NOTE
//These fields map directly onto the fields in an EE StringObject.  
//See object.h for the layout.
[NonSerialized]private int  m_stringLength;
[NonSerialized]private char m_firstChar;

Which corresponds to the following in object.h

private:
    DWORD   m_StringLength;
    WCHAR   m_Characters[0];

Fast String Allocations

In a typical .NET program, one of the most common ways that you would allocate strings dynamically is either via StringBuilder or String.Format (which uses StringBuilder under the hood).

So you may have some code like this:

var builder = new StringBuilder();
...
builder.Append(valueX);
...
builder.Append("Some text")
...
var text = builder.ToString();

var text = string.Format("{0}, {1}", valueX, valueY);

Then, when the StringBuilder ToString() method is called, it internally calls the FastAllocateString on the String class, which is declared like so:

[System.Security.SecurityCritical]  // auto-generated
[MethodImplAttribute(MethodImplOptions.InternalCall)]
internal extern static String FastAllocateString(int length);

This method is marked as extern and has the [MethodImplAttribute(MethodImplOptions.InternalCall)] attribute applied and as we saw earlier this implies it will be implemented in un-managed code by the CLR. It turns out that eventually the call stack ends up in a hand-written assembly function, called AllocateStringFastMP_InlineGetThread from JitHelpers_InlineGetThread.asm

This also shows something else we talked about earlier. The assembly code is actually allocating the memory needed for the string, based on the required length that was passed in by the calling code.

LEAF_ENTRY AllocateStringFastMP_InlineGetThread, _TEXT
        ; We were passed the number of characters in ECX

        ; we need to load the method table for string from the global
        mov     r9, [g_pStringClass]

        ; Instead of doing elaborate overflow checks, we just limit the number of elements
        ; to (LARGE_OBJECT_SIZE - 256)/sizeof(WCHAR) or less.
        ; This will avoid avoid all overflow problems, as well as making sure
        ; big string objects are correctly allocated in the big object heap.

        cmp     ecx, (ASM_LARGE_OBJECT_SIZE - 256)/2
        jae     OversizedString

        mov     edx, [r9 + OFFSET__MethodTable__m_BaseSize]

        ; Calculate the final size to allocate.
        ; We need to calculate baseSize + cnt*2, 
        ; then round that up by adding 7 and anding ~7.

        lea     edx, [edx + ecx*2 + 7]
        and     edx, -8

        PATCHABLE_INLINE_GETTHREAD r11, AllocateStringFastMP_InlineGetThread__PatchTLSOffset
        mov     r10, [r11 + OFFSET__Thread__m_alloc_context__alloc_limit]
        mov     rax, [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr]

        add     rdx, rax

        cmp     rdx, r10
        ja      AllocFailed

        mov     [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr], rdx
        mov     [rax], r9

        mov     [rax + OFFSETOF__StringObject__m_StringLength], ecx

ifdef _DEBUG
        call    DEBUG_TrialAllocSetAppDomain_NoScratchArea
endif ; _DEBUG

        ret

    OversizedString:
    AllocFailed:
        jmp     FramedAllocateString
LEAF_END AllocateStringFastMP_InlineGetThread, _TEXT

There is also a less optimised version called AllocateStringFastMP from JitHelpers_Slow.asm. The reason for the different versions is explained in jinterfacegen.cpp and then at run-time the decision is made as to which one to use, depending on the state of the Thread-local storage

// These are the fastest(?) versions of JIT helpers as they have the code to 
// GetThread patched into them that does not make a call.
EXTERN_C Object* JIT_TrialAllocSFastMP_InlineGetThread(CORINFO_CLASS_HANDLE typeHnd_);
EXTERN_C Object* JIT_BoxFastMP_InlineGetThread (CORINFO_CLASS_HANDLE type, void* unboxedData);
EXTERN_C Object* AllocateStringFastMP_InlineGetThread (CLR_I4 cch);
EXTERN_C Object* JIT_NewArr1OBJ_MP_InlineGetThread (CORINFO_CLASS_HANDLE arrayTypeHnd_, INT_PTR size);
EXTERN_C Object* JIT_NewArr1VC_MP_InlineGetThread (CORINFO_CLASS_HANDLE arrayTypeHnd_, INT_PTR size);

// This next set is the fast version that invoke GetThread but is still faster 
// than the VM implementation (i.e. the "slow" versions).
EXTERN_C Object* JIT_TrialAllocSFastMP(CORINFO_CLASS_HANDLE typeHnd_);
EXTERN_C Object* JIT_TrialAllocSFastSP(CORINFO_CLASS_HANDLE typeHnd_);
EXTERN_C Object* JIT_BoxFastMP (CORINFO_CLASS_HANDLE type, void* unboxedData);
EXTERN_C Object* JIT_BoxFastUP (CORINFO_CLASS_HANDLE type, void* unboxedData);
EXTERN_C Object* AllocateStringFastMP (CLR_I4 cch);
EXTERN_C Object* AllocateStringFastUP (CLR_I4 cch);

Optimised String Length

The final example of the “special relationship” is shown by how the string Length property is optimised by the run-time. Finding the length of a string is a very common operation and because .NET strings are immutable should also be very quick, because the value can be calculated once and then cached.

As we can see in the comment from String.cs, the CLR ensures that this is true by implementing it in such a way that the JIT can optimise for it:

// Gets the length of this string
//
/// This is a EE implemented function so that the JIT can recognise is specially
/// and eliminate checks on character fetches in a loop like:
///        for(int i = 0; i < str.Length; i++) str[i]
/// The actually code generated for this will be one instruction and will be inlined.
//
// Spec#: Add postcondition in a contract assembly.  Potential perf problem.
public extern int Length {
    [System.Security.SecuritySafeCritical]  // auto-generated
    [MethodImplAttribute(MethodImplOptions.InternalCall)]
    get;
}

This code is implemented in stringnative.cpp, which in turn calls GetStringLength:

FCIMPL1(INT32, COMString::Length, StringObject* str) {
    FCALL_CONTRACT;

    FC_GC_POLL_NOT_NEEDED();
    if (str == NULL)
        FCThrow(kNullReferenceException);

    FCUnique(0x11);
    return str->GetStringLength();
}
FCIMPLEND

Which is a simple method call that the JIT can inline:

DWORD   GetStringLength()   { LIMITED_METHOD_DAC_CONTRACT; return( m_StringLength );}

Why have a special relationship?

In one word performance, strings are widely used in .NET programs and therefore need to be as optimised, space efficient and cache-friendly as possible. That’s why the CLR developers have gone to great lengths to make this happen, including implementing methods in assembly and ensuring that the JIT can optimise code as much as possible.

Interestingly enough one of the .NET developers recently made a comment about this on a GitHub issue, in response to a query asking why more string functions weren’t implemented in managed code they said:

We have looked into this in the past and moved everything that could be moved without significant perf loss. Moving more depends on having pretty good managed optimizations for all coreclr architectures. This makes sense to consider only once RyuJIT or better codegen is available for all architectures that coreclr runs on (x86, x64, arm, arm64).

Discuss this post on Hacker News or /r/programming

The post Strings and the CLR - a Special Relationship first appeared on my blog Performance is a Feature!

Adventures in Benchmarking - Performance Golf

2016-05-16T00:00:00+00:00

Recently Nick Craver one of the developers at Stack Overflow has been tweeting snippets of code from their source, the other week the following code was posted:

A daily screenshot from the Stack Overflow codebase (checking strings for tokens without allocations). #StackCode pic.twitter.com/sDPqviHgD0
— Nick Craver (@Nick_Craver) April 20, 2016

This code is an optimised version of what you would normally write, specifically written to ensure that is doesn’t allocate memory. Previously Stack Overflow have encountered issues with large pauses caused by the .NET GC, so it appears that where appropriate, they make a concerted effort to write code that doesn’t needlessly allocate.

I also have to give Nick credit for making me aware of the term “Performance Golf”, I’ve heard of Code Golf, but not the Performance version.

Aside: If you want to see the full discussion and the code for all the different entries, take a look at this gist. Also for a really in-depth explanation of what the fastest version is actually doing, I really recommend checking out Kevin Montrose’s blog post “An Optimisation Exercise”, there’s some very cool tricks in there, although by this point he is basically writing C/C++ code rather than anything you would recognise as C#!

Good Benchmarking Tools

In this post I’m not going to concentrate too much on this particular benchmark, but instead I’m going to use it as an example of what I believe a good benchmarking library should provide for you. Full disclaimer, I’m one of the authors of BenchmarkDotNet, so I admit I might be biased!

I think that a good benchmarking tool should offer the following features:

Benchmark Scaffolding
Diagnose what is going on
Consistent, Reliable and Clear Results

Benchmark Scaffolding

By using BenchmarkDotNet, or indeed any benchmarking tool, you can just get on with the business of actually writing the benchmark and not worry about any of the mechanics of accurately measuring the code. This is important because often when someone has posted an optimisation and accompanying benchmark on Stack Overflow, several of the comments then point out why their measurements are inaccurate or plain wrong.

In the case of BenchmarkDotNet, it’s as simple as adding a [Benchmark] attribute to the methods that you want to benchmark and then a few lines of code to launch the run:

[Benchmark(Baseline = true)]
public bool StringSplit()
{
    var tokens = Value.Split(delimeter);
    foreach (var token in tokens)
    {
        if (token == Match)
            return true;
    }
    return false;
}

static void Main(string[] args)
{
    var summary = BenchmarkRunner.Run<Program>();
}

It also offers a few more tools for advanced scenarios, for instance you can decorate a field/property with the [Params] attribute like so:

[Params("Foo;Bar", 
        "Foo;FooBar;Whatever", 
        "Bar;blaat;foo", 
        "blaat;foo;Bar", 
        "foo;Bar;Blaat", 
        "foo;FooBar;Blaat", 
        "Bar1;Bar2;Bar3;Bar4;Bar", 
        "Bar1;Bar2;Bar3;Bar4;NoMatch", 
        "Foo;FooBar;Whatever", 
        "Some;Other;Really;Interesting;Tokens")]     
public string Value { get; set; }

and then each benchmark will be run multiples times, with Value set to the different strings. This gives you a really easy way of trying out benchmarks across different inputs. For instance some methods were consistently fast, whereas other performed badly on inputs that were a worse-case scenario for them.

Diagnose what is going on

If you state that the aim of optimising you code is to “check a string for tokens, without allocations”, you would really like to be able to prove if that is true or not. I’ve previously written about how BenchmarkDotNet can give you this information and in this case we get the following results (click for full-size image):

So you can see that the ContainTokenFransBouma benchmark isn’t allocation free, which in the scenario is a problem.

Consistent, Reliable and Clear Results

Another important aspect is that you should be able to rely on the results. Part of this is trusting the tool and hopefully people will come to trust BenchmarkDotNet over time.

Also you should be able to get clear results, so in as well as providing a text-based result table that you can easily paste into a GitHub issue or Stack Overflow answer, BenchmarkDotNet will provide several graphs using the R statistics and graphing library. Sometimes a wall of text isn’t the easiest thing to interpret, but colourful graphs can help (click for full image).

Here we can see that the original ContainsToken code is “slower” in some scenarios (although it’s worth pointing out that the Y-axis is in nanoseconds).

Summary

Would I recommend writing code like any of these optimisations for normal day-to-day scenarios? No.

Without exception the optimised versions of the code are less readable, harder to debug and probably contain more errors. Certainly, by the time you get to the fastest version you are no longer writing recognisable C# code, it’s basically C++/C masquerading as C#.

However, for the purposes of learning, a bit of fun or just because you like a spot of competition, then it’s fine. Just make sure you use a decent tool that lets you get on with the fun part of writing the most optimised code possible!

The post Adventures in Benchmarking - Performance Golf first appeared on my blog Performance is a Feature!

Coz: Finding Code that Counts with Causal Profiling - An Introduction

2016-03-30T00:00:00+00:00

A while ago I came across an interesting and very readable paper titled “COZ Finding Code that Counts with Causal Profiling” that was presented at SOSP 2015 (and was recipient of a Best Paper Award). This post is my attempt to provide an introduction to Causal Profiling for anyone who doesn’t want to go through the entire paper.

What is “Causal Profiling”

Here’s the explanation from the paper itself:

Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus “virtually” speeding it up.

Or if you prefer, below is an image from the paper explaining what it does (click to enlarge)

The key part is that it tries to find the effect of speeding up a given block of code on the overall running time of the program. But being able to speed up arbitrary pieces of code is very hard and if the authors could do that, then then would be better off making lots of money selling code optimisation tools. So instead of speeding up a given piece of code, they artificially slow-down all the other code that is running at the same time, which has exactly the same relative effect.

In the diagram above Coz is trying to determine the effect that optimising the code in block f would have on the overall runtime. Instead of making f run quicker, as shown in part (b), they instead make g run slower by inserting pauses, see part (c). Then Coz is able to infer that the speed-up seen in (c) will have the same relative effect if f was to run faster, therefore the “Actual Speedup” as shown in (b) is possible.

Unfortunately Coz doesn’t tell you how to speed up your code, that’s left up to you, but it does tell you which parts of the code you should focus on to get the best overall improvements. Or another way of saying it is, Coz tells you:

If you speed up a given block of code by this much, the program will run this much faster

Existing profilers

In the paper, the authors argue that existing profilers only tell you about:

Frequently executed code (# of calls)
Code that runs for a long time (% of total time)

What they don’t help you with is finding important code in parallel programs and this is the problem that Coz solves. The (contrived) example they give is:

void a() { // ˜6.7 seconds
    for(volatile size_t x=0; x<2000000000; x++) {}
}

void b() { // ˜6.4 seconds
    for(volatile size_t y=0; y<1900000000; y++) {}
}

int main() {
    // Spawn both threads and wait for them.
    thread a_thread(a), b_thread(b);
    a_thread.join(); b_thread.join();
}

which they state is a:

.. simple multi-threaded program that illustrates the shortcomings of existing profilers. Optimizing fa will improve performance by no more than 4.5%, while optimizing fb would have no effect on performance.

As shown in the comparison below (click for larger version), a regular profiler shows that fa and fb both comprise similar fractions of the total runtime (55.20% and 45.19% respectively). However by using a Causal Profiler, it predicts that optimising line 2 from fa will increase the overall runtime by 4-6%, whereas optimising fb will only increase it by < 2%.

Results

However their research was not only done on contrived programs, they also looked at several real-world projects including:

SQLite
PARSEC benchmark suite
- dedup - Next-generation compression with data deduplication
- ferred - Content similarity search server

Results taken from a presentation by Charlie Curtsinger (one of the authors of Coz) show that there are several situations where Coz identifies an area for optimisation that a conventional profiler would miss. For instance they identified a function in SQLite that when optimised provided a 25% speed-up, however very little time was actually spent in the function, only 0.15%, so it would not have shown up in the output from a conventional profiler.

Project	Speedup with Coz	% Runtime reported via a Profiler
SQLite	25%	0.15%
dedup	9%	14.38%
ferred	21%	0.00%

You can explore these results in the interactive viewer that has been developed alongside the tool. For instance the image below shows the lines on code in the SQLite source base that Coz identifies as having the maximum impact, positive or negative (click for full-size version):

Summary

It’s worth pointing out that Coz is currently a prototype causal profiler, that at the moment only runs on Linux, but doesn’t require you to modify your executable. However the ideas presented in the paper could be ported to other OSes, programming languages or runtimes. For instance work has already begun on a Go version that only required a few modifications to the runtime to get a prototype up and running.

It would be great to see something like this for .NET, any takers?

Further Information

If you want to find out any more information about Coz, here is a list of useful links:

The Coz paper “Finding Code that Counts with Causal Profiling”
Comprehensive (and more in-depth) write-up on the paper from “the morning paper” blog
Coz GitHub repository
- Getting started with Coz
- Coz profiling modes
Presentation by Charlie Curtsinger (one of the authors of Coz)
- Video
- Slides
Causal Profiling for Go is an attempt to implement Coz within the Go runtime

The post Coz: Finding Code that Counts with Causal Profiling - An Introduction first appeared on my blog Performance is a Feature!

Adventures in Benchmarking - Method Inlining

2016-03-09T00:00:00+00:00

In a previous post I looked at how you can use BenchmarkDotNet to help diagnose why one benchmark is running slower than another. The post outlined how ETW Events are used to give you an accurate measurement of the # of Bytes allocated and the # of GC Collections per benchmark.

Inlining

In addition to memory allocation, BenchmarkDotNet can also give you information about which methods were inlined by the JITter. Inlining is the process by which code is copied from one function (the inlinee) directly into the body of another function (the inliner). The reason for this is to save the overhead of a method call and the associated work that needs to be done when control is passed from one method to another.

To see this in action we are going to run the following benchmark:

[Benchmark]
public int Calc()
{
    return WithoutStarg(0x11) + WithStarg(0x12);
}

private static int WithoutStarg(int value)
{
    return value;
}

private static int WithStarg(int value)
{
    if (value < 0)
        value = -value;
    return value;
}

BenchmarkDotNet also gives you the ability to run Benchmarks against different versions of the .NET JITter and on various CPU Platforms. So in this test will will ask it to run against the following configurations:

Legacy JIT - x86
Legacy JIT - x64

Once this is all set-up, we can run the benchmark and we get the following results:

The interesting thing to note is that Legacy JIT - x64 runs significantly faster than than the x86 version, even though they are both running the same C# code (from the Calc() function above).

So now we are going to ask BenchmarkDotNet to give us the JIT inlining diagnostics. These diagnostics are available via ETW Events and are collected, parsed and displayed at the end of the output, as shown below:

Here we can that when the x64 JITter runs the WithStarg() function is successfully inlined into the Calc() function, whereas with x86 version it is not. So the same code is being executed, but because the WithStarg() function is relatively simple, when it is not inlined the cost of the method call dominates and causes the Calc() function to take more time. For a comparison the WithoutStarg() function is always inlined, because it doesn’t do anything with the value that is passed into it.

For a full-explanation of why there is a difference in behaviour between the 2 version of the JITter, I recommend reading Andrey Akinhin’s blog post on the subject. But in summary the x64 version is more efficient and it’s a bug/regression that the x86 version doesn’t have the same behaviour.

.NET JIT inlining rules

In this case the specific reason that the Legacy JIT - x86 gives for not inlining the WithStarg() method is:

Fail Reason: Inlinee writes to an argument.

For reference, there is a comprehensive list of JIT ETW Inlining Event Fail Reasons available on MSDN, although interestingly enough it doesn’t include this reason!

However, inlining isn’t always a win-win scenario. Because you are copying the same code to 2 locations, it can bloat the amount of memory that your programs needs.

Update: A more recent list of justifications that the .NET JITter provides for not inlining a method is available, thanks to Andy Ayers from Microsoft for pointing it out to me.

So there are some rules that the .NET JITter follows when deciding whether or not to inline a method (Note this list is from 2004, so the rules may well have changed since then)

These are some of the reasons for which we won’t inline a method:

Method is marked as not inline with the CompilerServices.MethodImpl attribute.

Size of inlinee is limited to 32 bytes of IL: This is a heuristic, the rationale behind it is that usually, when you have methods bigger than that, the overhead of the call will not be as significative compared to the work the method does. Of course, as a heuristic, it fails in some situations. There have been suggestions for us adding an attribute to control these threshold. For Whidbey, that attribute has not been added (it has some very bad properties: it’s x86 JIT specific and it’s longterm value, as compilers get smarter, is dubious).

Virtual calls: We don’t inline across virtual calls. The reason for not doing this is that we don’t know the final target of the call. We could potentially do better here (for example, if 99% of calls end up in the same target, you can generate code that does a check on the method table of the object the virtual call is going to execute on, if it’s not the 99% case, you do a call, else you just execute the inlined code), but unlike the J language, most of the calls in the primary languages we support, are not virtual, so we’re not forced to be so aggressive about optimizing this case.

Valuetypes: We have several limitations regarding value types an inlining. We take the blame here, this is a limitation of our JIT, we could do better and we know it. Unfortunately, when stack ranked against other features of Whidbey, getting some statistics on how frequently methods cannot be inlined due to this reason and considering the cost of making this area of the JIT significantly better, we decided that it made more sense for our customers to spend our time working in other optimizations or CLR features. Whidbey is better than previous versions in one case: value types that only have a pointer size int as a member, this was (relatively) not expensive to make better, and helped a lot in common value types such as pointer wrappers (IntPtr, etc).

MarshalByRef: Call targets that are in MarshalByRef classes won’t be inlined (call has to be intercepted and dispatched). We’ve got better in Whidbey for this scenario

VM restrictions: These are mostly security, the JIT must ask the VM for permission to inline a method (see CEEInfo::canInline in Rotor source to get an idea of what kind of things the VM checks for).

Complicated flowgraph: We don’t inline loops, methods with exception handling regions, etc…

If basic block that has the call is deemed as it won’t execute frequently (for example, a basic block that has a throw, or a static class constructor), inlining is much less aggressive (as the only real win we can make is code size)

Other: Exotic IL instructions, security checks that need a method frame, etc…

Summary

So we can see that BenchmarkDotNet will display multiple pieces of information that allow you to diagnosing why your benchmarks take the time they do:

Amount of Bytes allocated per Benchmark
Number of GC Collections triggered (Gen 0/1/2)
Whether a method was inlined or not

The post Adventures in Benchmarking - Method Inlining first appeared on my blog Performance is a Feature!

Adventures in Benchmarking - Memory Allocations

2016-02-17T00:00:00+00:00

For a while now I’ve been involved in the Open Source BenchmarkDotNet library along with Andrey Akinshin the project owner. Our goal has been to produce a .NET Benchmarking library that is:

Accurate
Easy-to-use
Helpful

First and foremost we do everything we can to ensure that BenchmarkDotNet gives you accurate measurements, everything else is just “sprinkles on the sundae”. That is, without accurate measurements, a benchmarking library is pretty useless, especially one that displays results in nanoseconds.

But once point 1) has been dealt with, 2) it a bit more subjective. Using BenchmarkDotNet involves little more than adding a [Benchmark] attribute to your method and then running it as per the Step-by-step guide in the GitHub README. I’ll let you decide if that is easy-to-use or not, but again it’s something we strive for. Once you’re done with the “Getting Started” guide, there is also a complete set of Tutorial Benchmarks available, as well as some more real-word examples for you to take a look at.

Being “Helpful”

But this post isn’t going to be a general BenchmarkDotNet tutorial, instead I’m going to focus on some of the specific tools that it gives you to diagnose what is going on in a benchmark, or to put it another way, to help you answer the question “Why is Benchmark A slower than Benchmark B?”

String Concat vs StringBuilder

Let’s start with a simple benchmark:

public class Framework_StringConcatVsStringBuilder
{
  [Params(1, 2, 3, 4, 5, 10, 15, 20)]
  public int Loops;

  [Benchmark]
  public string StringConcat()
  {
    string result = string.Empty;
    for (int i = 0; i < Loops; ++i)
      result = string.Concat(result, i.ToString());
    return result;
  }

  [Benchmark]
  public string StringBuilder()
  {
    StringBuilder sb = new StringBuilder(string.Empty);
    for (int i = 0; i < Loops; ++i)
      sb.Append(i.ToString());
    return sb.ToString();
  }
}

Note: In case it’s not obvious the [Params(..)] attribute lets you run the same benchmark for a set of different input values. In this case the Loops field is set to each of the values in turn, i.e. 1, 2, 3, 4, 5, 10, 15, 20, before another instance of the benchmark is run.

If you’ve been programming in C# for long enough, you’ll have no doubt have been given the guidance “use StringBuilder to concatenate strings”, but what is the actual difference?

Well in terms of time taken there is a difference, but even with 20 loops it’s not huge, we are talking about roughly 500 ns, i.e. 0.0005 ms, so you would have to be doing it alot to notice a slow-down.

However, this time lets see what the results would look like if we have the BenchmarkDotNet “Garbage Collection” (GC) Diagnostics enabled:

Here we can clearly see a difference between the benchmarks. Once we get beyond 10 loops, the StringBuilder benchmark is way more efficient compared to StringConcat. It causes way less “Generation 0” collections and allocates roughly 50% less bytes for each Operation, i.e. each invocation of the benchmark method.

It’s worth noting that in this case, 10 loops is the break-even point. Before that point StringConcat is marginally faster and allocates less memory, but after that point StringBuilder is more efficient. The reason is that there is a memory overhead for the StringBuilder class itself, which dominates the cost when you are only appending a few short strings (as we are in this particular benchmark). Interesting enough the .NET Runtime developers noticed this overhead and so introduced a StringBuilder Cache, to enable re-use of existing instances, rather than allocating a new one every time.

Dictionary vs IDictionary

But what about a less well-known example. Imagine after some re-factoring you noticed that your application was triggering a lot more Gen 0/1/2 collections (you do monitor this in your live systems right?) After looking at the recent code commits and carrying out some profiling you narrow the problem down to a refactoring that changed a variable declaration from Dictionary to IDictionary, i.e. exactly the type of refactoring that this Stack Overflow question is discussing.

To benchmark what’s actually going on here, we can write some code like so:

public class Framework_DictionaryVsIDictionary
{
  Dictionary<string, string> dict;
  IDictionary<string, string> idict;

  [Setup]
  public void Setup()
  {
    dict = new Dictionary<string, string>();
    idict = (IDictionary<string, string>)dict;
  }

  [Benchmark]
  public Dictionary<string, string> DictionaryEnumeration()
  {
    foreach (var item in dict)
    {
      ;
    }
    return dict;
  }

  [Benchmark]
  public IDictionary<string, string> IDictionaryEnumeration()
  {
    foreach (var item in idict)
    {
      ;
    }
    return idict;
  }
}

Note: we are deliberately not doing anything with the items inside the foreach loop because we just want to see what the difference in iteration of the 2 collections is. Also note that we are using the same underlying data structure, we are just accessing via an IDictionary cast in the 2nd benchmark.

So what results do we get:

Nice and clear, accessing the same data via the IDictionary interface causes a lot of extra allocations, roughly 22 bytes per foreach loop. This in turn triggers a lot of extra GC collections. It’s worth pointing out that when BenchmarkDotNet executes, it will run the same benchmark method, IDictionaryEnumeration() in this case, millions of times, so that we can obtain an accurate measurment. Therefore the actual # of Gen 0 collections isn’t so important, it is the relative amount compared to the DictionaryEnumeration() benchmark that should be looked at.

Now this scenario might seem a bit contrived and I have to admit that I knew the answer before I started investigating it, however it did originate from a real-life issue, discovered by Ben Adams. For the full background take a look at the CoreCLR GitHub issue, Avoid enumeration allocation via interface, but as shown below this was identified because in Kestrel/ASP.NET the request/resposne headers are kept in an IDictionary data structure and so cause an additional 128 MBytes of garbage per second, when running at 1 Million requests per/second.

Finally, what is the technical explanation of the additional allocations, quoting from Stephen Toub of Microsoft

… But when accessed via the interface, you’re using the interface method that’s typed to return IEnumerator<KeyValuePair<TKey,TValue>> rather than Dictionary<TKey, TValue>.Enumerator, so the struct gets boxed.

and then further down the same issue

Yes, the issue isn’t just enumerator allocations, it’s also interface-based dispatch. In addition to boxing the enumerator, the MoveNext and Current calls made per element go from being potentially-inlineable non-virtual calls to being interface calls.

Implementation Details

Update Feb 2017 - This section is now out-of-date as the implementation details have now changed, please see Adam Sitnik’s blog post for all the details

This is all made possible be the excellent Gargage Collection ETW Events that the .NET runtime produces. In particular the GCAllocationTick_V2 Event that is fired each time approximately 100 KB is allocated. An xml representation of a typical event is shown below, you can see that 0x1A060 or 106,592 bytes have just been allocated.

<UserData>
  <GCAllocationTick_V3 xmlns='myNs'>
    <AllocationAmount>0x1A060</AllocationAmount>
    <AllocationKind>0</AllocationKind>
    <ClrInstanceID>34</ClrInstanceID>
    <AllocationAmount64>0x1A060</AllocationAmount64>
    <TypeID>0xEE05D18</TypeID>
    <TypeName>LibGit2Sharp.Core.GitDiffFile</TypeName>
    <HeapIndex>0</HeapIndex>
    <Address>0x32056CD0</Address>
  </GCAllocationTick_V3>
</UserData> 

To collect these events BenchmarkDotNet uses the logman tool that is built into Windows. This runs in the background and collects the specified ETW events until you ask it to stop. These events are continuously written to an .etl file that can then be read by tools such as Windows Performance Analyzer. Once the ETW events have been collected, BenchmarkDotNet then parses them using the excellent TraceEvent library, using code like this:

using (var source = new ETWTraceEventSource(fileName))
{
  source.Clr.GCAllocationTick += (gcData =>
  {
    if (statsPerProcess.ContainsKey(gcData.ProcessID))
      statsPerProcess[gcData.ProcessID].AllocatedBytes += gcData.AllocationAmount64;
  });

  source.Clr.GCStart += (gcData =>
  {
    if (statsPerProcess.ContainsKey(gcData.ProcessID))
    {
      var genCounts = statsPerProcess[gcData.ProcessID].GenCounts;
      if (gcData.Depth >= 0 && gcData.Depth < genCounts.Length)
      {
        // ignore calls to GC.Collect(..) from BenchmarkDotNet itself
        if (gcData.Reason != GCReason.Induced)
          genCounts[gcData.Depth]++;
      }
    }
  });

  source.Process();
}

Hopefully this has shown you some of the power of BenchmarkDotNet, please consider giving it a go next time you need to (micro-)benchmark some .NET code, hopefully it will save you from having to hand-roll your own benchmarking code.

The post Adventures in Benchmarking - Memory Allocations first appeared on my blog Performance is a Feature!

Technically Speaking - Anniversary Mentoring

2016-02-16T00:00:00+00:00

I’ve been reading the excellent Technically Speaking newsletter for a while now and when they announced they would be running a mentoring program, I jumped at the chance and applied straight away. The idea was that each applicant had to set themselves speaking goals or identify areas they wanted to improve and then if you were selected @techspeakdigest would set you up with a mentor.

I was fortunate enough to be chosen and assigned to Cate one of the authors of the newsletter, who is also a prolific conference speaker. As part of scheme I had to identify the areas that I wanted to improve during the hour-long mentoring session, which for me were:

Turning an outline into a good abstract.
Tips for getting a talk accepted via a CFP submission

I’ve previously done some talks and they seemed to be well received, but I wanted to expand the range of topics I talked about and try and speak at some other conferences.

Writing a Good Abstract

At the start of the session Cate looked through an existing submission and offered some advice, which started with the initial comment of:

Good idea, not well pitched

She then went onto offer some really great tips about what conferences were looking for and how I could develop my abstract. I’ve put the rest of my notes below and left them as I wrote them down, so they are a bit jumbled, but they reflect what happened during the conversation!

Tips for an abstract (after reading mine):

Be pragmatic, too much “one true way” can put people off. Maybe a bit too opinionated.
Don’t tie your talk to just one library, might alienate people too much.

Talk outline/structure

Explain - what does it mean to write faster code
Situate - optimisation - what is it? how do you do it? benchmark, etc
Apply - specific examples

Other suggestions

If listeners (or conference organisation committee) agree with your assumptions, they might be more likely to choose your pitch

Be careful about being too specific in the abstract
Don’t put too much in the abstract, leave some specifics out

be compelling, but a little big vague

1 or 2 examples of what not to do is okay, but must give them something to do afterwards, otherwise you could put them off.
Broad v. Narrow talks
- Most conferences will want “broader talks”
Bio is pitch for you
- Abstract is pitch for you talk

Finally, as well as offering general advice, Cate also took the time to help me re-write an existing abstract I’d put together. I’ve included the “before” and “after” below, so you can see the difference. Whilst it’s hard to see someone pick apart what you’re written, I do agree that the “after” reads much better and sounds more compelling than the “before”!

Before

Microbenchmarks and Optimisations

We all want to write faster code right, but how do we know it really is faster, how do we measure it correctly?

During this talk we will look at what mistakes to avoid when benchmarking .NET code and how to do it accurately. Along the way we will also discover some surprising code optimisations and explore why they are happening

After

Where the Wild Things Are - Finding Performance Problems Before They Bite You

You don’t want to prematurely optimize, but sometimes you want to optimize, the question is - where to start? Benchmarking can help you figure out what your application is doing and where performance problems could arise - allowing you to find (and fix!) them before your customers do.

If you aren’t already benchmarking your code this talk will offer some starting points. We’ll look at how to accurately benchmark in .NET and things to avoid. Along the way we’ll also discover some surprising code optimisations!

The End Result

After the mentoring with Cate took place I was accepted to talk at ProgSCon London 2016, so obviously the tips and re-write of my abstract made a big difference!!

So thanks to Chiu-Ki Chan and Cate for producing Technically Speaking every week, it’s certainly helped me out!

Learning How Garbage Collectors Work - Part 1

2016-02-04T00:00:00+00:00

This series is an attempt to learn more about how a real-life “Garbage Collector” (GC) works internally, i.e. not so much “what it does”, but “how it does it” at a low-level. I will be mostly be concentrating on the .NET GC, because I’m a .NET developer and also because it’s recently been Open Sourced so we can actually look at the code.

Note: If you do want to learn about what a GC does, I really recommend the talk Everything you need to know about .NET memory by Ben Emmett, it’s a fantastic talk that uses lego to explain what the .NET GC does (the slides are also available)

Well, trying to understand what the .NET GC does by looking at the source was my original plan, but if you go and take a look at the code on GitHub you will be presented with the message “This file has been truncated,…”:

This is because the file is 36,915 lines long and 1.19MB in size! Now before you send a PR to Microsoft that chops it up into smaller bits, you might want to read this discussion on reorganizing gc.cpp. It turns out you are not the only one who’s had that idea and your PR will probably be rejected, for some specific reasons.

Goals of the GC

So, as I’m not going to be able to read and understand a 36 KLOC .cpp source file any time soon, instead I tried a different approach and started off by looking through the excellent Book-of-the-Runtime (BOTR) section on the “Design of the Collector”. This very helpfully lists the following goals of the .NET GC (emphasis mine):

The GC strives to manage memory extremely efficiently and require very little effort from people who write managed code. Efficient means:

GCs should occur often enough to avoid the managed heap containing a significant amount (by ratio or absolute count) of unused but allocated objects (garbage), and therefore use memory unnecessarily.

GCs should happen as infrequently as possible to avoid using otherwise useful CPU time, even though frequent GCs would result in lower memory usage.

A GC should be productive. If GC reclaims a small amount of memory, then the GC (including the associated CPU cycles) was wasted.

Each GC should be fast. Many workloads have low latency requirements.

Managed code developers shouldn’t need to know much about the GC to achieve good memory utilization (relative to their workload). – The GC should tune itself to satisfy different memory usage patterns.

So there’s some interesting points in there, in particular they twice included the goal of ensuring developers don’t have to know much about the GC to make it efficient. This is probably one of the main differences between the .NET and Java GC implementations, as explained in an answer to the Stack Overflow question “.Net vs Java Garbage Collector”

A difference between Oracle’s and Microsoft’s GC implementation ‘ethos’ is one of configurability.

Oracle provides a vast number of options (at the command line) to tweak aspects of the GC or switch it between different modes. Many options are of the -X or -XX to indicate their lack of support across different versions or vendors. The CLR by contrast provides next to no configurability; your only real option is the use of the server or client collectors which optimise for throughput verses latency respectively.

.NET GC Sample

So now we have an idea about what the goals of the GC are, lets take a look at how it goes about things. Fortunately those nice developers at Microsoft released a GC Sample that shows you, at a basic level, how you can use the full .NET GC in your own code. After building the sample (and finding a few bugs in the process), I was able to get a simple, single-threaded Workstation GC up and running.

What’s interesting about the sample application is that it clearly shows you what actions the .NET Runtime has to perform to make the GC work. So for instance, at a high-level the runtime needs to go through the following process to allocate an object:

AllocateObject(..)
- See below for the code and explanation of the allocation process
CreateGlobalHandle(..)
- If we want to store the object in a “strong handle/reference”, as opposed to a “weak” one. In C# code this would typically be a static variable. This is what tells the GC that the object is referenced, so that is can know that it shouldn’t be cleaned up when a GC collection happens.
ErectWriteBarrier(..)
- For more information see “Marking the Card Table” below

Allocating an Object

AllocateObject(..) code from GCSample.cpp

Object * AllocateObject(MethodTable * pMT)
{
    alloc_context * acontext = GetThread()->GetAllocContext();
    Object * pObject;

    size_t size = pMT->GetBaseSize();

    uint8_t* result = acontext->alloc_ptr;
    uint8_t* advance = result + size;
    if (advance <= acontext->alloc_limit)
    {
        acontext->alloc_ptr = advance;
        pObject = (Object *)result;
    }
    else
    {
        pObject = GCHeap::GetGCHeap()->Alloc(acontext, size, 0);
        if (pObject == NULL)
            return NULL;
    }

    pObject->RawSetMethodTable(pMT);

    return pObject;
}

To understand what’s going on here, the BOTR again comes in handy as it gives us a clear overview of the process, from “Design of Allocator”:

When the GC gives out memory to the allocator, it does so in terms of allocation contexts. The size of an allocation context is defined by the allocation quantum.

Allocation contexts are smaller regions of a given heap segment that are each dedicated for use by a given thread. On a single-processor (meaning 1 logical processor) machine, a single context is used, which is the generation 0 allocation context.

The Allocation quantum is the size of memory that the allocator allocates each time it needs more memory, in order to perform object allocations within an allocation context. The allocation is typically 8k and the average size of managed objects are around 35 bytes, enabling a single allocation quantum to be used for many object allocations.

This shows how is is possible for the .NET GC to make allocating an object (or memory) such a cheap operation. Because of all the work that it has done in the background, the majority of the time an object allocation happens, it is just a case of incrementing a pointer by the number of bytes needed to hold the new object. This is what the code in the first half of the AllocateObject(..) function (above) is doing, it’s bumping up the free-space pointer (acontext->alloc_ptr) and giving out a pointer to the newly created space in memory.

It’s only when the current allocation context doesn’t have enough space that things get more complicated and potentially more expensive. At this point GCHeap::GetGCHeap()->Alloc(..) is called which may in turn trigger a GC collection before a new allocation context can be provided.

Finally, it’s worth looking at the goals that the allocator was designed to achieve, again from the BOTR:

Triggering a GC when appropriate: The allocator triggers a GC when the allocation budget (a threshold set by the collector) is exceeded or when the allocator can no longer allocate on a given segment. The allocation budget and managed segments are discussed in more detail later.

Preserving object locality: Objects allocated together on the same heap segment will be stored at virtual addresses close to each other.

Efficient cache usage: The allocator allocates memory in allocation quantum units, not on an object-by-object basis. It zeroes out that much memory to warm up the CPU cache because there will be objects immediately allocated in that memory. The allocation quantum is usually 8k.

Efficient locking: The thread affinity of allocation contexts and quantums guarantee that there is only ever a single thread writing to a given allocation quantum. As a result, there is no need to lock for object allocations, as long as the current allocation context is not exhausted.

Memory integrity: The GC always zeroes out the memory for newly allocated objects to prevent object references pointing at random memory.

Keeping the heap crawlable: The allocator makes sure to make a free object out of left over memory in each allocation quantum. For example, if there is 30 bytes left in an allocation quantum and the next object is 40 bytes, the allocator will make the 30 bytes a free object and get a new allocation quantum.

One of the interesting items this highlights is an advantage of GC systems, namely that you get efficient CPU cache usage or good object locality because memory is allocated in units. This means that objects created one after the other (on the same thread), will sit next to each other in memory.

Marking the “Card Table”

The 3rd part of the process of allocating an object was a call to ErectWriteBarrier(..) , which looks like this:

inline void ErectWriteBarrier(Object ** dst, Object * ref)
{
    // if the dst is outside of the heap (unboxed value classes) then we simply exit
    if (((uint8_t*)dst < g_lowest_address) || ((uint8_t*)dst >= g_highest_address))
        return;
        
    if ((uint8_t*)ref >= g_ephemeral_low && (uint8_t*)ref < g_ephemeral_high)
    {
        // volatile is used here to prevent fetch of g_card_table from being reordered 
        // with g_lowest/highest_address check above. 
        uint8_t* pCardByte = (uint8_t *)*(volatile uint8_t **)(&g_card_table) + 
                             card_byte((uint8_t *)dst);
        if(*pCardByte != 0xFF)
            *pCardByte = 0xFF;
    }
}

Now explaining what is going on here is probably an entire post on it’s own and fortunately other people have already done the work for me, if you are interested in finding our more take a look at the links at the end of this post.

But in summary, the card-table is an optimisation that allows the GC to collect a single Generation (e.g. Gen 0), but still know about objects that are referenced from other, older generations. For instance if you had an array, myArray = new MyClass[100] that was in Gen 1 and you wrote the following code myArray[5] = new MyClass(), a write barrier would be set-up to indicate that the MyClass object was referenced by a given section of Gen 1 memory.

Then, when the GC wants to perform the mark phase for a Gen 0, in order to find all the live-objects it uses the card-table to tell it in which memory section(s) of other Generations it needs to look. This way it can find references from those older objects to the ones stored in Gen 0. This is a space/time tradeoff, the card-table represents 4KB sections of memory, so it still has to scan through that 4KB chunk, but it’s better than having to scan the entire contents of the Gen 1 memory when it wants to carry of a Gen 0 collection.

If it didn’t do this extra check (via the card-table), then any Gen 0 objects that were only referenced by older objects (i.e. those in Gen 1/2) would not be considered “live” and would then be collected. See the image below for what this looks like in practice:

Image taken from Back To Basics: Generational Garbage Collection

GC and Execution Engine Interaction

The final part of the GC sample that I will be looking at is the way in which the GC needs to interact with the .NET Runtime Execution Engine (EE). The EE is responsible for actually running or coordinating all the low-level things that the .NET runtime needs to-do, such as creating threads, reserving memory and so it acts as an interface to the OS, via Windows and Unix implementations.

To understand this interaction between the GC and the EE, it’s helpful to look at all the functions the GC expects the EE to make available:

void SuspendEE(GCToEEInterface::SUSPEND_REASON reason)
void RestartEE(bool bFinishedGC)
void GcScanRoots(promote_func* fn, int condemned, int max_gen, ScanContext* sc)
void GcStartWork(int condemned, int max_gen)
void AfterGcScanRoots(int condemned, int max_gen, ScanContext* sc)
void GcBeforeBGCSweepWork()
void GcDone(int condemned)
bool RefCountedHandleCallbacks(Object * pObject)
bool IsPreemptiveGCDisabled(Thread * pThread)
void EnablePreemptiveGC(Thread * pThread)
void DisablePreemptiveGC(Thread * pThread)
void SetGCSpecial(Thread * pThread)
alloc_context * GetAllocContext(Thread * pThread)
bool CatchAtSafePoint(Thread * pThread)
void AttachCurrentThread()
void GcEnumAllocContexts (enum_alloc_context_func* fn, void* param)
void SyncBlockCacheWeakPtrScan(HANDLESCANPROC, uintptr_t, uintptr_t)
void SyncBlockCacheDemote(int /*max_gen*/)
void SyncBlockCachePromotionsGranted(int /*max_gen*/)

If you want to see how the .NET Runtime performs these “tasks”, you can take a look at the real implementation. However in the GC Sample these methods are mostly stubbed out as no-ops. So that I could get an idea of the flow of the GC during a collection, I added simple print(..) statements to each one, then when I ran the GC Sample I got the following output:

SuspendEE(SUSPEND_REASON = 1)
GcEnumAllocContexts(..)
GcStartWork(condemned = 0, max_gen = 2)
GcScanRoots(condemned = 0, max_gen = 2)
AfterGcScanRoots(condemned = 0, max_gen = 2)
GcScanRoots(condemned = 0, max_gen = 2)
GcDone(condemned = 0)
RestartEE(bFinishedGC = TRUE)

Which fortunately corresponds nicely with the GC phases for WKS GC with concurrent GC off as outlined in the BOTR:

User thread runs out of allocation budget and triggers a GC.

GC calls SuspendEE to suspend managed threads.

GC decides which generation to condemn.

Mark phase runs.

Plan phase runs and decides if a compacting GC should be done.

If so relocate and compact phase runs. Otherwise, sweep phase runs.

GC calls RestartEE to resume managed threads.

User thread resumes running.

Further Information

If you want to find out any more information about Garbage Collectors, here is a list of useful links:

GC Sample Code Layout (for reference)

GC Sample Code (under \sample)

GCSample.cpp
gcenv.h
gcenv.ee.cpp
gcenv.windows.cpp
gcenv.unix.cpp

GC Sample Environment (under \env)

common.cpp
common.h
etmdummy.g
gcenv.base.h
gcenv.ee.h
gcenv.interlocked.h
gcenv.interlocked.inl
gcenv.object.h
gcenv.os.h
gcenv.structs.h
gcenv.sync.h

GC Code (top-level folder)

gc.cpp (36,911 lines long!!)
gc.h
gccommon.cpp
gcdesc.h
gcee.cpp
gceewks.cpp
gcimpl.h
gcrecord.h
gcscan.cpp
gcscan.h
gcsvr.cpp
gcwks.cpp
handletable.h
handletable.inl
handletablecache.cpp
gandletablecore.cpp
handletablepriv.h
handletablescan.cpp
objecthandle.cpp
objecthandle.h

The post Learning How Garbage Collectors Work - Part 1 first appeared on my blog Performance is a Feature!

Open Source .NET – 1 year later - Now with ASP.NET

2016-01-15T00:00:00+00:00

In the previous post I looked at the community involvement in the year since Microsoft open-sourced large parts of the .NET framework.

As a follow-up I’m going to repeat that analysis, but this time focussing on the repositories that sit under the ASP.NET umbrella project:

MVC - Model view controller framework for building dynamic web sites with clean separation of concerns, including the merged MVC, Web API, and Web Pages w/ Razor.
DNX - The DNX (a .NET Execution Environment) contains the code required to bootstrap and run an application, including the compilation system, SDK tools, and the native CLR hosts.
EntityFramework - Microsoft’s recommended data access technology for new applications in .NET.
KestrelHttpServer - A web server for ASP.NET 5 based on libuv.

Methodology

In the first part I classified the Issues/PRs as Owner, Collaborator or Community. However this turned out to have some problems, as was pointed out to me in the comments. There are several people who are non Microsoft employees, but have been made “Collaborators” due to their extensive contributions to a particular repository, for instance @kangaroo and @benpye.

To address this, I decided to change to just the following 2 categories:

Microsoft
Community

This is possible because (almost) all Microsoft employees have indicated where they work on their GitHub profile, for instance:

There are some notable exceptions, e.g. @shanselman clearly works at Microsoft, but it’s easy enough to allow for cases like this.

Results

So after all this analysis, what results did I get. Well overall, the Community involvement accounts for just over 60% over the “Issues Created” and 33% of the “Merged Pull Requests (PRs)”. However the amount of PRs is skewed by Entity Framework which has a much higher involvement from Microsoft employees, if this is ignored the Community proportion of PRs increases to 44%.

Issues Created (Nov 2013 - Dec 2015)

Project	Microsoft	Community	Total
aspnet/MVC	716	1380	2096
aspnet/dnx	897	1206	2103
aspnet/EntityFramework	1066	1427	2493
aspnet/KestrelHttpServer	89	176	265
Total	2768	4189	6957

Merged Pull Requests (Nov 2013 - Dec 2015)

Project	Microsoft	Community	Total
aspnet/MVC	385	228	613
aspnet/dnx	406	368	774
aspnet/EntityFramework	937	225	1162
aspnet/KestrelHttpServer	69	88	157
Total	1798	909	2706

Note: I included the Kestrel Http Server because it is an interesting case. Currently the #1 contributor is not a Microsoft employee, it is Ben Adams, who is doing a great job of improving the memory usage and in the process helping Kestrel handle more and more requests per/second.

By looking at the results over time, you can see that there is a clear and sustained Community involvement (the lighter section of the bars) over the past 2 years (Nov 2013 - Dec 2015) and it doesn’t look like it’s going to stop.

Issues Per Month - By Submitter (click for full-size image)

In addition, whilst the Community involvement is easier to see with the Issues per/month, it is still visible in the Merged PRs and again it looks like it has being sustained over the 2 years.

Merged Pull Request Per Month - By Submitter (click for full-size image)

Total Number of People Contributing

It’s also interesting to look at the total number of different people who contributed to each project. By doing this you get a real sense of the size of the Community contribution, it’s not just a small amount of people doing a lot of work, it’s spread across a large amount of people.

This table shows the number of different GitHub users (per project) who opened an Issue or created a PR that was Merged:

Project	Microsoft	Community	Total
aspnet/MVC	39	395	434
aspnet/dnx	46	421	467
aspnet/EntityFramework	31	570	601
aspnet/KestrelHttpServer	22	95	117
Total	138	1481	1619

FSharp

In the comments of my first post, Isaac Abraham correctly pointed out:

parts of .NET have been open source for quite a bit more than a year – the F# compiler and FSharp.Core have been for quite a while now.

So, to address this, I will take a quick look at the main FSharp repositories:

As Isaac explained, their relationship is:

… visualfsharp is the Microsoft-owned repo Visual F#. The other is the community owned one. The former one feeds directly into tools like Visual F# tooling in Visual Studio etc.; the latter feeds into things like Xamarin etc. There’s a (slightly out of date) diagram that explains the relationship, and this is another useful resource http://fsharp.github.io/.

FSharp - Issues Created (Dec 2010 - Dec 2015)

Project	Microsoft	Community	Total
fsharp/fsharp	9	312	321
microsoft/visualfsharp	161	367	528
Total	170	679	849

FSharp - Merged Pull Requests (May 2011 - Dec 2015)

Project	Microsoft	Community	Total
fsharp/fsharp	27	134	161
microsoft/visualfsharp	36	33	69
Total	63	167	230

Conclusion

I think that it’s fair to say that the Community has responded to Microsoft making more and more of their code Open Source. There have been a significant amount of Community contributions across several projects, over a decent amount of time. Whilst you could argue that it took Microsoft a long time to open source their code, it seems that .NET developers are happy they have done it, as shown by a sizeable Community response.

The post Open Source .NET – 1 year later - Now with ASP.NET first appeared on my blog Performance is a Feature!

Open Source .NET – 1 year later

2015-12-08T00:00:00+00:00

A little over a year ago Microsoft announced that they were open sourcing large parts of the .NET framework. At the time Scott Hanselman did a nice analysis of the source, using Microsoft Power BI. Inspired by this and now that a year has passed, I wanted to try and answer the question:

How much Community involvement has there been since Microsoft open sourced large parts of the .NET framework?

I will be looking at the 3 following projects, as they are all highly significant parts of the .NET ecosystem and are also some of the most active/starred/forked projects within the .NET Foundation:

Roslyn - The .NET Compiler Platform (“Roslyn”) provides open-source C# and Visual Basic compilers with rich code analysis APIs.
CoreCLR - the .NET Core runtime, called CoreCLR, and the base library, called mscorlib. It includes the garbage collector, JIT compiler, base .NET data types and many low-level classes.
CoreFX the .NET Core foundational libraries, called CoreFX. It includes classes for collections, file systems, console, XML, async and many others.

Available Data

GitHub itself has some nice graphs built-in, for instance you can see the Commits per Month over an entire year:

Also you can get a nice dashboard showing the Monthly Pulse

However to answer the question above, I needed more data. Fortunately GitHub provides a really comprehensive API, which combined with the excellent Octokit.net library and the brilliant LINQPad, meant I was able to easily get all the data I needed. Here’s a sample LINQPad script if you want to start playing around with the API yourself.

However, knowing the “# of Issues” or “Merged Pull Requests” per/month on it’s own isn’t that useful, it doesn’t tell us anything about who created the issue or submitted the PR. Fortunately GitHub classifies users into categories, for instance in the image below from Roslyn Issue #670 we can see what type of user posted each comment, an “Owner”, “Collaborator” or blank which signifies a “Community” member, i.e. someone who (AFAICT) doesn’t work at Microsoft.

Results

So now that we can get the data we need, what results do we get.

Total Issues - By Submitter

Project	Owner	Collaborator	Community	Total
Roslyn	481	1867	1596	3944
CoreCLR	86	298	487	871
CoreFX	334	911	735	1980
Total	901	3076	2818	6795

Here you can see that the Owners and Collaborators do in some cases dominate, e.g. in Roslyn where almost 60% of the issues were opened by them. But in other cases the Community is very active, especially in CoreCLR where Community members are opening more issues than Owners/Collaborators combined. Part of the reason for this is the nature of the different repositories, CoreCLR is the most visible part of the .NET framework as it encompasses most of the libraries that .NET developers would use on a day-to-day basis, so it’s not surprising that the Community has lots of suggestions for improvements or bug fixes. In addition, the CoreCLR has been around for a much longer time and so the Community has had more time to use it and find out the parts it doesn’t like. Whereas Roslyn is a much newer project so there has been less time to use it, plus finding bugs in a compiler is by its nature harder to do.

Total Merged Pull Requests - By Submitter

Project	Owner	Collaborator	Community	Total
Roslyn	465	2093	118	2676
CoreCLR	378	567	201	1146
CoreFX	516	1409	464	2389
Total	1359	4069	783	6211

However if we look at Merged Pull Requests, we can see that that the overall amount of Community contributions across the 3 projects is much lower, only accounting for roughly 12%. This however isn’t that surprising, there’s a much higher bar for getting a pull request accepted. Firstly, if the project is using this mechanism, you have to pick an issue that is “up for grabs”, then you have to get any API changes through a review, then finally you have to meet any comparability/performance/correctness issues that come up during the code review itself. So actually 12% is a pretty good result as there is a non–trivial amount of work involved in getting your PR merged, especially considering most Community members will be working in their spare time.

Update: I was wrong about the “up for grabs” requirement, see this comment from David Kean and this tweet for more information. “Up for grabs” is a guideline and meant to help new users, but it is not a requirement, you can submit PRs for issues that don’t have that label.

Finally if you look at the amount per/month (see the 2 graphs below, click for larger images), it’s hard to pick up any definite trends or say if the Community is definitely contributing more or less over time. But you can say that over a year the Community has consistently contributed and it doesn’t look like that contribution is going to end. It is not just an initial burst that only happened straight after the projects were open sourced, it is a sustained level of contributions over an entire year.

Issues Per Month - By Submitter

Merged Pull Request Per Month - By Submitter

Top 20 Issue Labels

The last thing that I want to do whilst I have the data is to take a look at the most popular Issue Labels and see what they tell us about the type of work that has been going on since the 3 projects were open sourced.

Here are a few observations about the results:

Having CodeGen so high on the list is not that surprising considering that RyuJIT - the next-gen .NET JIT Compiler was only released 2 years ago. However, it’s a bit worrying that were so many issues, especially considering that some of them have severe consequences as the devs at Stack Overflow found out! (On a related note, if you want to find out lots of low-level details about what the JIT does, just take a look at all the issues that @MikeDN has commented on, unbelievably for someone with that much knowledge he doesn’t actually work on the product itself, or even another team at Microsoft!!)
It’s nice to see that all 3 projects have a lots of “Up for Grabs” issues, see Roslyn, CoreCLR and CoreFX, plus the Community seems to be grabbing them back!
Finally, I love the fact that Performance and Optimisation are being taken seriously, after all Performance is a Feature!!

Discuss on /r/programming and Hacker News

The post Open Source .NET – 1 year later first appeared on my blog Performance is a Feature!

The Stack Overflow Tag Engine – Part 3

2015-10-29T00:00:00+00:00

This is the part 3 of a mini-series looking at what it might take to build the Stack Overflow Tag Engine, if you haven’t read part 1 or part 2, I recommend reading them first.

Complex boolean queries

One of the most powerful features of the Stack Overflow Tag Engine is that it allows you to do complex boolean queries against multiple Tag, for instance:

A simple way of implementing this is to write code like below, which makes use of a HashSet to let us efficiently do lookups to see if a particular questions should be included or excluded.

var result = new List<Question>(pageSize);
var andHashSet = new HastSet<int>(queryInfo[tag2]);
foreach (var id in queryInfo[tag1])
{
    if (result.Count >= pageSize)
        break;

    baseQueryCounter++;
    if (questions[id].Tags.Any(t => tagsToExclude.Contains(t))) 
    {
        excludedCounter++;
    }
    else if (andHashSet.Remove(item))
    {
        if (itemsSkipped >= skip)
            result.Add(questions[item]);
        else
            itemsSkipped++;
    }
}

The main problem is that we have to scan through all the ids for tag1 until we have enough matches, i.e. foreach (var id in queryInfo[tag1]). In addition we have to initially load up the HashSet with all the ids for tag2, so that we can check matches. So this method takes longer as we skip more and more questions, i.e. for larger value of skip or if there are a large amount of tagsToExclude (i.e. “Ignored Tags”), see Part 2 for more infomation.

Bitmaps

So can we do any better, well yes, there is a fairly established mechanism for doing these types of queries, known as Bitmap indexes. To use these you have to pre-calculate an index in which each bit is set to 1 to indicate a match and 0 otherwise. In our scenario this looks so:

Then it is just a case of doing the relevant bitwise operations against the bits (a byte at a time), for example if you want to get the questions that have the C# AND Java Tags, you do the following:

for (int i = 0; i < numBits / 8; i++)
{
    result[i] = bitSetCSharp[i] & bitSetJava[i];
}

The main drawback is that we have to create a Bitmap index for each tag (C#, .NET, Java, etc) for every sort order (LastActivityDate, CreationDate, Score, ViewCount, AnswerCount), so we soon use up a lot of memory. The Sept 2014 Stack Overflow dataset contains just under 8 million questions and so at 8 questions per byte, a single Bitmap needs 976KB or 0.95MB. This adds up to an impressive 149GB (0.95MB * 32,000 Tags * 5 sort orders).

Compressed Bitmaps

Fortunately there is a way to heavily compress the Bitmaps using a form of Run-length encoding, to do this I made use of the C# version of the excellent EWAH library. This library is based on the research carried out in the paper Sorting improves word-aligned bitmap indexes by Daniel Lemire and others. By using EWAH it has the added benefit that you don’t need to uncompress the Bitmap to perform the bitwise operations, they can be done in-place (for an idea of how this is done take a look at this commit where I added a single in-place AndNot function to the existing library).

However if you don’t want to read the research paper, the diagram below shows how the Bitmap is compressed into 64-bit words that have 1 or more bits set, plus runs of repeating zeros or ones. So 31 0x00 indicates that 31 instances of a 64-bit word (with all the bits set to 0) have be encoded as a single value, rather than as 31 individual words.

0 0x00
1 words
        [   0]=                   17,  2 bits set ->
        {0000000000000000000000000000000000000000000000000000000000010001}
31 0x00
1 words
        [   0]=        2199023255552,  1 bits set ->
        {0000000000000000000000100000000000000000000000000000000000000000}
18 0x01
1 words
        [   0]=                   64,  1 bits set ->
        {0000000000000000000000000000000000000000000000000000000001000000}
48 0x01
3 words
        [   0]=              1048576,  1 bits set ->
        {0000000000000000000000000000000000000000000100000000000000000000}
        [   1]=     9007199254740992,  1 bits set ->
        {0000000000100000000000000000000000000000000000000000000000000000}
        [   2]=     9007199304740992,  13 bits set ->
        {0000000000100000000000000000000000000010111110101111000010000000}
131 0x00
1 words
        [   0]=            536870912,  1 bits set ->
        {0000000000000000000000000000000000100000000000000000000000000000}
....

To give an idea of the space savings that can be achieved, the table below shows the size in bytes for compressed Bitmaps that have varying amounts of individual bit set to 1 (for comparision uncompressed Bitmaps are 1,000,000 bytes or 0.95MB)

# Bits Set	Size in Bytes
1	24
10	168
25	408
50	808
100	1,608
200	3,208
400	6,408
800	12,808
1,600	25,608
32,000	512,008
64,000	1,000,008
128,000	1,000,008

As you can see it’s not until we get over 64,000 bits (62,016 to be precise) that we match the size of the regular Bitmaps. Note: in these tests I was setting the bits with an evenly spaced distribution across the entire range of 8 million possible bits. The compression is also dependant on which bits are set, so this is a worse case. The more the bits are clumped together (within the same byte), the more it will be compressed.

So over the entire Stack Overflow data set of 32,000 Tags, the Bitmaps compress down to an impressive 1.17GB, compared to 149GB uncompressed!

Results

But do queries against compressed Bitmaps actually perform faster than the naive queries using HashSets (see code above). Well yes they do and in some cases the difference is significant.

As you can see below, for AND NOT queries they are much faster, especially compared to the worse-case where the regular/naive code takes over 150 ms and the compressed Bitmap code takes ~5 ms (the x-axis is # of excluded/skipped questions and the y-axis is time in milliseconds).

For reference there are 194,384 questions tagged with .net and 528,490 tagged with jquery.

To ensure I’m being fair, I should point out that the compressed Bitmap queries are slower for OR queries, as shown below. But note the scale, they take ~5 ms compared to ~1-2 ms for the regular queries, so the compressed Bitmap queries are still fast! The nice things about the compressed Bitmap queries is that they take the same amount of time, regardless of how many questions we skip, whereas the regular queries get slower as # of excluded/skipped questions increases.

If you are interested the results for all the query types are available:

Future Posts

But there’s still more things to implement, in future posts I hope to cover the following:

Currently my implementation doesn’t play nicely with the Garbage Collector and it does lots of allocations. I will attempt to replicate the “no-allocations” rule that Stack Overflow have after their battle with the .NET GC

How a DDOS attack on TagServer might have been caused

In October, we had a situation where a flood of crafted requests were causing high resource utilization on our Tag Engine servers, which is our internal application for associating questions and tags in a high-performance way.

The post The Stack Overflow Tag Engine – Part 3 first appeared on my blog Performance is a Feature!

The Stack Overflow Tag Engine – Part 2

2015-08-19T00:00:00+00:00

I’ve added a Resources and Speaking page to my site, check them out if you want to learn more. There’s also a video available of my NDC London 2014 talk “Performance is a Feature!”.

Recap of Stack Overflow Tag Engine

This is the long-delayed part 2 of a mini-series looking at what it might take to build the Stack Overflow Tag Engine, if you haven’t read part 1, I recommend reading it first.

Since the first part was published, Stack Overflow published a nice performance report, giving some more stats on the Tag Engine Servers. As you can see they run the Tag Engine on some pretty powerful servers, but only have a peak CPU usage of 10%, which means there’s plenty of overhead available. It’s a nice way of being able to cope with surges in demand or busy times of the day.

Ignored Tag Preferences

In part 1, I only really covered the simple things, i.e. a basic search for all the questions that contain a given tag, along with multiple sort orders (by score, view count, etc). But the real Tag Engine does much more than that, for instance:

What is he talking about here? Well any time you do a tag search, after the actual search has been done per-user exclusions can then be applied. These exclusions are configurable and allow you to set “Ignored Tags”, i.e. tags that you don’t want to see questions for. Then when you do a search, it will exclude these questions from the results.

Note: it will let you know if there were questions excluded due to your preferences, which is a pretty nice user-experience. If that happens, you get this message: (it can also be configured so that matching questions are greyed out instead):

Now most people probably have just a few exclusions and maybe 10’s at most, but fortunately @leppie a Stack Overflow power-user got in touch with me and shared his list of preferences.

You’ll need to scroll across to appreciate this full extent of this list, but here’s some statistics to help you:

It contains 3,753 items, of which 210 are wildcards (e.g. cocoa* or *hibernate*)

The tags and wildcards expand to 7,677 tags in total (out of a possible 30,529 tags)

There are 6,428,251 questions (out of 7,990,787) that have at least one of the 7,677 tags in them!

Wildcards

If you want to see the wildcard expansion in action you can visit the url’s below:

*java*
- [facebook-javascript-sdk] [java] [java.util.scanner] [java-7] [java-8] [javabeans] [javac] [javadoc] [java-ee] [java-ee-6] [javafx] [javafx-2] [javafx-8] [java-io] [javamail] [java-me] [javascript] [javascript-events] [javascript-objects] [java-web-start]
.net*
- [.net] [.net-1.0] [.net-1.1] [.net-2.0] [.net-3.0] [.net-3.5] [.net-4.0] [.net-4.5] [.net-4.5.2] [.net-4.6] [.net-assembly] [.net-cf-3.5] [.net-client-profile] [.net-core] [.net-framework-version] [.net-micro-framework] [.net-reflector] [.net-remoting] [.net-security] [.nettiers]

Now a simple way of doing these matches is the following, i.e. loop through the wildcards and compare each one with every single tag to see if it could be expanded to match that tag. (IsActualMatch(..) is a simple method that does a basic string StartsWith, EndsWith or Contains as appropriate)

var expandedTags = new HashSet();
foreach (var wildcard in wildcardsToExpand)
{
    if (IsWildCard(tagToExpand))
    {
        var rawTagPattern = tagToExpand.Replace("*", "");
        foreach (var tag in allTags)
        {
            if (IsActualMatch(tag, tagToExpand, rawTagPattern))
                expandedTags.Add(tag);
        }
    }
    else if (allTags.ContainsKey(tagToExpand))
    {
        expandedTags.Add(tagToExpand);
    }
}

This works fine with a few wildcards, but it’s not very efficient. Even on a relatively small data-set containing 32,000 tags, it’s slow when comparing it to 210 wildcardsToExpand, taking over a second. After chatting to a few of the Stack Overflow developers on Twitter, they consider a Tag Engine query that takes longer than 500 milliseconds to be slow, so a second just to apply the wildcards is unacceptable.

Trigram Index

So can we do any better? Well it turns out that that there is a really nice technique for doing Regular Expression Matching with a Trigram Index that is used in Google Code Search. I’m not going to explain all the details, the linked page has a very readable explanation. But basically what you do is create an inverted index of the tags and search the index instead. That way you aren’t affected so much by the amount of wilcards, because you are only searching via an index rather than a full search that runs over the whole list of tags.

For instance when using Trigrams, the tags are initially split into 3 letter chunks, for instance the expansion for the tag javascript is shown below (‘_’ is added to denote the start/end of a word):

_ja, jav, ava, vas, asc, scr, cri, rip, ipt, pt_

Next you create an index of all the tags as trigrams and include the position of tag they came from so that you can reference back to it later:

_ja -> { 0, 5, 6 }

jav -> { 0, 5, 12 }

ava -> { 0, 5, 6 }

va_ -> { 0, 5, 11, 13 }

_ne -> { 1, 10, 12 }

net -> { 1, 10, 12, 15 }

…

For example if you want to match any tags that contain java any where in the tag, i.e. a *java* wildcard query, you fetch the index values for jav and ava, which gives you (from above) these 2 matching index items:

jav -> { 0, 5, 12 }

ava -> { 0, 5, 6 }

and you now know that the tags with index 0 and 5 are the only matches because they have jav and ava (6 and 12 don’t have both)

Results

On my laptop I get the results shown below, where Contains is the naive way shown above and Regex is an attempt to make it faster by using compiled Regex queries (which was actually slower)

Expanded to 7,677 tags (Contains), took 721.51 ms
Expanded to 7,677 tags (Regex), took 1,218.69 ms
Expanded to 7,677 tags (Trigrams), took  54.21 ms

As you can see, the inverted index using Trigrams is a clear winner. If you are interested, the source code is available on GitHub.

In this post I showed one way that the Tag Engine could implement wildcards matching. As I don’t work at Stack Overflow there’s no way of knowing if they use the same method or not, but at the very least my method is pretty quick!

Future Posts

But there’s still more things to implement, in future posts I hope to cover the following:

Complex boolean queries, i.e. questions tagged “c# OR .NET”, “.net AND (NOT jquery)” and how to make them fast
How a DDOS attack on TagServer might have been caused

In October, we had a situation where a flood of crafted requests were causing high resource utilization on our Tag Engine servers, which is our internal application for associating questions and tags in a high-performance way.

The post The Stack Overflow Tag Engine – Part 2 first appeared on my blog Performance is a Feature!

The Stack Overflow Tag Engine – Part 1

2014-11-01T00:00:00+00:00

I’ve added a Resources and Speaking page to my site, check them out if you want to learn more.

Stack Overflow Tag Engine

I first heard about the Stack Overflow Tag engine of doom when I read about their battle with the .NET Garbage Collector. If you haven’t heard of it before I recommend reading the previous links and then this interesting case-study on technical debt.

But if you’ve ever visited Stack Overflow you will have used it, maybe without even realising. It powers the pages under stackoverflow.com/questions/tagged, for instance you can find the questions tagged .NET, C# or Java and you get a page like this (note the related tags down the right-hand side):

Tag API

As well as simple searches, you can also tailor the results with more complex queries (you may need to be logged into the site for these links to work), so you can search for:

It’s worth noting that all these searches take your personal preferences into account. So if you have asked to have any tags excluded, questions containing these tags are filtered out. You can see your preferences by going to your account page and clicking on Preferences, the Ignored Tags are then listed at the bottom of the page. Apparently some power-users on the site have 100’s of ignored tags, so dealing with these is a non-trivial problem.

Publicly available Question Data set

As I said I wanted to see what was involved in building a version of the Tag Engine. Fortunately, data from all the Stack Exchange sites is available to download. To keep things simple I just worked with the posts (not their entire history of edits), so I downloaded stackoverflow.com-Posts.7z (warning direct link to 5.7 GB file), which appears to contain data up-to the middle of September 2014. To give an idea of what is in the data set, a typical question looks like the .xml below. For the Tag Engine we only need the items highlighted in red, because it is only providing an index into the actual questions themselves, so we ignore any content and just look at the meta-data.

Below is the output of the code that runs on start-up and processes the data, you can see there are just over 7.9 millions questions in the data set, taking up just over 2GB of memory, when read into a List<Question>.

Took 00:00:31.623 to DE-serialise 7,990,787 Stack Overflow Questions, used 2136.50 MB
Took 00:01:14.229 (74,229 ms) to group all the tags, used 2799.32 MB
Took 00:00:34.148 (34,148 ms) to create all the "related" tags info, used 362.57 MB
Took 00:01:31.662 (91,662 ms) to sort the 191,025 arrays
After SETUP - Using 4536.21 MB of memory in total

So it takes roughly 31 seconds to de-serialise the data from disk (yay protobuf-net!) and another 3 1/2 minutes to process and sort it. At the end we are using roughly 4.5GB of memory.

Max LastActivityDate 14/09/2014 03:07:29
Min LastActivityDate 18/08/2008 03:34:29
Max CreationDate 14/09/2014 03:06:45
Min CreationDate 31/07/2008 21:42:52
Max Score 8596 (Id 11227809)
Min Score -147
Max ViewCount 1917888 (Id 184618)
Min ViewCount 1
Max AnswerCount 518 (Id 184618)
Min AnswerCount 0

Yes that’s right, there is actually a Stack Overflow questions with 1.9 million views, not surprisingly it’s locked for editing, but it’s also considered “not constructive”! The same question also has 518 answers, the most of any on the site and if you’re wondering, the question with the highest score has an impressive 8192 votes and is titled Why is processing a sorted array faster than an unsorted array?

Creating an Index

So what does the index actually look like, well it’s basically a series of sorted lists (List<int>) that contain an offset into the main List<Question> that contains all the Question data. Or in a diagram, something like this:

Note: This is very similar to the way that Lucene indexes data.

It turns out the the code to do this isn’t that complex:

// start with a copy of the main array, with Id's in order, { 0, 1, 2, 3, 4, 5, ..... }
tagsByLastActivityDate = new Dictionary<string, int[]>(groupedTags.Count);
var byLastActivityDate = tag.Value.Positions.ToArray(); 
Array.Sort(byLastActivityDate, comparer.LastActivityDate);

Where the comparer is as simple as the following (note that is sorting the byLastActiviteDate array, using the values in the question array to determine the sort order.

public int LastActivityDate(int x, int y)
{
    if (questions[y].LastActivityDate == questions[x].LastActivityDate)
        return CompareId(x, y);
    // Compare LastActivityDate DESCENDING, i.e. most recent is first
    return questions[y].LastActivityDate.CompareTo(questions[x].LastActivityDate);
}

So once we’ve created the sorted list on the left and right of the diagram above (Last Edited and Score), we can just traverse them in order to get the indexes of the Questions. For instance if we walk through the Score array in order (1, 2, .., 7, 8), collecting the Id’s as we go, we end up with { 8, 4, 3, 5, 6, 1, 2, 7 }, which are the array indexes for the corresponding Questions. The code to do this is the following, taking account of the pageSize and skip values:

var result = queryInfo[tag]
        .Skip(skip)
        .Take(pageSize)
        .Select(i => questions[i])
        .ToList();

Once that’s all done, I ended up with an API that you can query in the browser. Note that the timing is the time taken on the server-side, but it is correct, basic queries against a single tag are lightening quick!

Next time

Now that the basic index is setup, next time I’ll be looking at how to handle:

Complex boolean queries .net or jquery- and c#
Power users who have 100’s of excluded tags

and anything else that I come up with in the meantime.

The post The Stack Overflow Tag Engine – Part 1 first appeared on my blog Performance is a Feature!

The Art of Benchmarking (Updated 2014-09-23)

2014-09-19T00:00:00+00:00

tl;dr

Benchmarking is hard, it’s very easy to end up “not measuring, what you think you are measuring”

Update (2014-09-23): Sigh - I made a pretty big mistake in these benchmarks, fortunately Reddit user zvrba corrected me:

Yep, can’t argue with that, see Results and Resources below for the individual updates.

Intro to Benchmarks

To start with, lets clarify what types of benchmarks we are talking about. Below is a table from the DEVOXX talk by Aleksey Shipilev, who works on the Java Micro-benchmarking Harness (JMH)

kilo: > 1000 s, Linpack
????: 1…1000 s, SPECjvm2008, SPECjbb2013
milli: 1…1000 ms, SPECjvm98, SPECjbb2005
micro: 1…1000 us, single webapp request
nano: 1…1000 ns, single operations
pico: 1…1000 ps, pipelining

He then goes on to say:

Millibenchmarks are not really hard
Microbenchmarks are challenging, but OK
Nanobenchmarks are the damned beasts!
Picobenchmarks…

This post is talking about micro and nano benchmarks, that is ones where the code we are measuring takes microseconds or nanoseconds to execute.

First attempt

Let’s start with a nice example available from Stack Overflow:

static void Profile(string description, int iterations, Action func) 
{
    // clean up
    GC.Collect();
    GC.WaitForPendingFinalizers();
    GC.Collect();

    // warm up 
    func();

    var watch = Stopwatch.StartNew();
    for (int i = 0; i < iterations; i++) 
    {
        func();
    }
    watch.Stop();
    Console.Write(description);
    Console.WriteLine("Time Elapsed {0} ms", 
                      watch.Elapsed.TotalMilliseconds);
}

You then use it like this:

Profile("a description", how_many_iterations_to_run, () =>
{
   // ... code being profiled
});

Now there is a lot of good things that this code sample is doing:

Eliminating the overhead of the .NET GC (as much as possible), by making sure it has run before the timing takes place
Calling the function that is being profiled, outside the timing loop, so that the overhead of the .NET JIT Compiler isn’t included in the benchmark itself. The first time a function is called the JITter steps in and converts the code from IL into machine code, so that it can actually be executed by the CPU.
Using Stopwatch rather than DateTime.Now, Stopwatch is a high-precision timer with a low-overhead, DateTime.Now isn’t!
Running a lot of iterations of the code (100,000’s), to give an accurate measurement

Now far be it from me to criticise a highly voted Stack Overflow answer, but that’s exactly what I’m going to do! I should add that for a whole range of scenarios the Stack Overflow code is absolutely fine, but it does have it’s limitations. There are several situations where this code doesn’t work, because it fails to actually profile the code you want it to.

Baseline benchmark

But first let’s take a step back and look at the simplest possible case, with all the code inside the function. We’re going to measure the time that Math.Sqrt(..) takes to execute, nice and simple:

static void ProfileDirect(string description, int iterations) 
{
    // clean up
    GC.Collect();
    GC.WaitForPendingFinalizers();
    GC.Collect();

    // warm up
    Math.Sqrt(123.456);

    var watch = new Stopwatch();
    watch.Start();
    for (int i = 0; i < iterations; i++)
    {
        Math.Sqrt(123.456);
    }
    watch.Stop();
    Console.WriteLine("ProfileDirect - " + description);
    Console.WriteLine(
        "{0:0.00} ms ({1:N0} ticks) (over {2:N0} iterations), {3:N0} ops/milliseconds",
        watch.ElapsedMilliseconds, watch.ElapsedTicks, iterations, 
        (double)iterations / watch.ElapsedMilliseconds);
}

And the results:

ProfileDirect - 2.00 ms (7,822 ticks) (over 10,000,000 iterations), 5,000,000 ops/millisecond

That’s 5 million operations per millisecond, I know CPU’s are fast, but that seems quite high!

For reference, the assembly code that the JITter produced is below, from this you can see that there is no sqrt instruction as we’d expect there to be. So in effect we are timing an empty loop!

;   91:             var watch = new Stopwatch();
000000a6  lea         rcx,[5D3EBA90h] 
000000ad  call        000000005F6722F0 
000000b2  mov         r12,rax 
000000b5  mov         rcx,r12 
000000b8  call        000000005D284EF0 
;   92:             watch.Start();
000000bd  mov         rcx,r12 
000000c0  call        000000005D284E60 
;   93:             for (int i = 0; i < iterations; i++)
000000c5  mov         r13d,dword ptr [rbp+58h] 
000000c9  test        r13d,r13d 
000000cc  jle         00000000000000D7 
000000ce  xor         eax,eax 
000000d0  inc         eax 
000000d2  cmp         eax,r13d 
000000d5  jl          00000000000000D0 
;   97:             }
;   98:             watch.Stop();
000000d7  mov         rcx,r12 
000000da  call        000000005D32CBD0 
;   99:             Console.WriteLine(description + " (ProfileDirect)");

Note: To be able to get the optimised version of the assembly code that JITter produces, see this MSDN page. If you just debug the code normally in Visual Studio, you only get the un-optimised code, which doesn’t help at all.

Dead-code elimination

One of the main problems with writing benchmarks is that you are often fighting against the just-in-time (JIT) compiler, which is trying to optimise the code as much as it can. One of the many things is does, is to remove code that it thinks is not needed, or to be more specific, code it thinks has no side-effects. This is non-trivial to do, there’s some really tricky edge-cases to worry about, aside from the more obvious problem of knowing which code causes side-effects and which doesn’t. But this is exactly what is happening in the original profiling code.

Aside: For a full list of all the optimisations that the .NET JIT Compiler performs, see this very thorough SO answer.

So let’s fix the original code, by storing the result of Math.Sqrt in a variable:

private static double result;
static void ProfileDirect(string description, int iterations) 
{
    // clean up
    GC.Collect();
    GC.WaitForPendingFinalizers();
    GC.Collect();

    // warm up
    Math.Sqrt(123.456);

    var watch = new Stopwatch();
    watch.Start();
    for (int i = 0; i < iterations; i++)
    {
        result = Math.Sqrt(123.456);
    }
    watch.Stop();
    Console.WriteLine("ProfileDirect - " + description);
    Console.WriteLine(
        "{0:0.00} ms ({1:N0} ticks) (over {2:N0} iterations), {3:N0} ops/milliseconds",
        watch.ElapsedMilliseconds, watch.ElapsedTicks, 
        iterations, (double)iterations / watch.ElapsedMilliseconds);
}

Note: result has to be a class-level field, it can’t be local to the method, i.e. double result = Math.Sqrt(123.456). This is because the JITter is clever enough to figure out that the local field isn’t accessed outside of the method and optimise it away, again you are always fighting against the JITter.

So now the results look like this, which is a bit more sane!

ProfileDirectWithStore - 68.00 ms (180,801 ticks) (over 10,000,000 iterations), 147,059 ops/millisecond

Loop-unrolling

One other thing you have to look out for is whether or not the time spent running the loop is dominating the code you want to profile. In this case Math.Sqrt() ends up as a few assembly instructions, so less time is spent executing that, compared to the instructions needed to make for (..) loop happen.

To fix this we can unroll the loop, so that we execute Math.Sqrt(..) multiple times per loop, but to compensate we run the loop less times. The code now looks like this:

static void ProfileDirectWithStoreUnrolledx10(string description, int iterations)
{
    // clean up
    GC.Collect();
    GC.WaitForPendingFinalizers();
    GC.Collect();

    // warm up
    var temp = Math.Sqrt(123.456);

    var watch = new Stopwatch();
    watch.Start();
    var loops = iterations / 10;
    for (int i = 0; i < loops; i++)
    {
        result = Math.Sqrt(123.456); 
        result = Math.Sqrt(123.456);
        result = Math.Sqrt(123.456);
        result = Math.Sqrt(123.456);
        result = Math.Sqrt(123.456);
        result = Math.Sqrt(123.456);
        result = Math.Sqrt(123.456);
        result = Math.Sqrt(123.456);
        result = Math.Sqrt(123.456);
        result = Math.Sqrt(123.456);
    }
    watch.Stop();
    Console.WriteLine("ProfileDirectWithStoreUnrolled x10 - " + description);
    Console.WriteLine(
        "{0:0.00} ms ({1:N0} ticks) (over {2:N0} iterations), {3:N0} ops/milliseconds",
        watch.ElapsedMilliseconds, watch.ElapsedTicks, iterations,
        (double)iterations / watch.ElapsedMilliseconds);
}

And now the result is:

ProfileDirectWithStoreUnrolled x10 - 47.00 ms (124,582 ticks) (over 10,000,000 iterations), 212,766 ops/millisecond

So we are now doing 212,766 ops/millisecond, compared to 147,059 when we didn’t unroll the loop. I did some further tests to see if unrolling the loop 20 or 40 times made any further difference and if did continue to get slightly faster, but the change was not significant.

Results

These results were produced by running the code in RELEASE mode and launching the application from outside Visual Studio, also the .exe’s were explicitly compiled in x86/x64 mode and optimisations were turned on. To ensure I didn’t mess up, I included some diagnostic code in the application, that prints out a message in red if anything is setup wrong. Finally these tests were run with .NET 4.5, so the results will be different under other versions, the JIT compilers have brought in more and more optimisations over time.

As seen in the chart below the best results for 64-bit (red) were achieved when we unrolled the loop (“ProfileDirectWithStoreUnrolled”). There are other other results that were faster, but in these the actual code we wanted to profile was optimised away by the JITter (“Profile via an Action”, “ProfileDirect” and “ProfileDirectWithConsume”).

Update (2014-09-23): The correct results are in the chart below

CLR JIT Compiler - 32-bit v. 64-bit

You might have noticed that the 32-bit and 64-bit results in the graph vary per test, why is this? Well one reason is the fundamental difference between 32-bit and 64-bit, 64-bit has 8 byte pointers compared to 4 byte ones in 32-bit. But the larger difference is that in .NET there are 2 different JIT compilers, with different goals

The .NET 64-bit JIT was originally designed to produce very efficient code throughout the long run of a server process. This differs from the .NET x86 JIT, which was optimized to produce code quickly so that the program starts up fast. Taking time to compile efficient code made sense when 64-bit was primarily for server code. But “server code” today includes web apps that have to start fast. The 64-bit JIT currently in .NET isn’t always fast to compile your code, meaning you have to rely on other technologies such as NGen or background JIT to achieve fast program startup.

However one benefit of RyuJIT (the next generation JIT Compiler) is that it’s a common code base for 32-bit and 64-bit, so when it comes out, everything may change! (BTW RyuJIT, what a great name)

For reference the assembly code that is generated in both cases is available:

32-bit version where the fsqrt instruction is used
64-bit version where the sqrtsd instruction is used

But there’s still more to do

Even though this post is over 2000 words longs, it still hasn’t covered:

How you store and present the results
How users can write their own benchmarks
Multi-threaded benchmarks
Allowing state in benchmarks

And there’s even more than that to worry about, see the complete list below, taken from this discussion thread on the excellent mechanical sympathy group:

Dynamic selection of benchmarks.
Loop optimizations.
Dead-code elimination.
Constant foldings
Non-throughput measures
Synchronize iterations
Multi-threaded sharing
Multi-threaded setup/teardown
False-sharing
Asymmetric benchmarks
Inlining

Note: these are only the headings, the discussion goes into a lot of detail about how these issues are solved in JMH. But whilst the JVM and the CLR do differ in a number of ways, a lot of what is said applies to writing benchmarks for the CLR.

The summary from Aleksey sums it all up really!

The benchmarking harness business is very hard, and very non-obvious. My own experience tells me even the smartest people make horrible mistakes in them, myself included. We try to get around that by fixing more and more things in JMH as we discover more, even if that means significant API changes….

The job for a benchmark harness it to provide [a] reliable benchmarking environment …

Resources

Here’s a list of all the code samples and other data used in making this post:

The full benchmarking code Updated (2014-09-23)
Profile via an Action
Profile Direct
Profile Direct, storing the result (BROKEN)
Profile Direct, storing the result (FIXED)
Profile Direct, storing the result, unrolled 10 times
Spreadsheet of results Updated (2014-09-23)
Generated assembly code Updated (2014-09-23):
Profile via a Action
Profile Direct
Profile Direct and storing the result (BROKEN)
Profile Direct and storing the result (FIXED)

Stack Overflow - performance lessons (part 2)

2014-09-05T00:00:00+00:00

In Part 1 I looked at some of the more general performance issues that can be learnt from Stack Overflow (the team/product), in Part 2 I’m looking at some of the examples of coding performance lessons.

Please don’t take these blog posts as blanket recommendations of techniques that you should go away and apply to your code base. They are specific optimisations that you can use if you want to squeeze every last drop of performance out of your CPU.

Also, don’t optimise anything unless you have measured and profiled first, you will probably optimise the wrong thing!

Battles with the .NET Garbage Collector

I first learnt about the performance work done in Stack Overflow (the site/company), when I read the post on their battles with the .NET Garbage Collector (GC). If you haven’t read it, the short summary is that they were experiencing page load times that would suddenly spike to the 100’s of msecs, compared to the normal sub 10 msecs they were use to. After investigating for a few days they narrowed the problem down to the behaviour of the GC. GC pauses are a real issue and even the new modes available in .NET 4.5 don’t fully eliminate them, see my previous investigation for more information.

One thing to remember is that to make this all happen, they needed the following items in place:

Monitoring in production - these issues would only show up under load, once the application had been running for a while, so they would be very hard to recreate in staging or during development.
Multiple measurements - they recorded both ASP.NET and IIS web server response times and were able to cross-reference them (see image below).
Storing outliers - these spikes rarely happened so having detailed metrics was needed, averages hide too much information.
Good knowledge of the .NET GC - according to the article, it took them 3 weeks to identify and fix this issue “So Marc and I set off on a 3 week adventure to resolve the memory pressure.”

You can read all the gory details of the fix and the follow-up in the posts below, but the tl;dr is that they removed of all the work that the .NET Garbage Collector had to do, thus eliminating the pauses:

In managed code we trust, our recent battles with the .NET Garbage Collector
Assault by GC
Technical Debt, a case study : tags (a follow-up post)

Jil - A fast JSON (de)serializer, with a number of somewhat crazy optimization tricks.

But if you think that the struct based code they wrote is crazy, their JSON serialisation library, Jil, takes things to a new level. This is all in the pursuit of the maximum performance and based on their benchmarks, it seems to be working! Note: protobuf-net is a binary serialisation library, but doesn’t support JSON, it’s only included is a base-line:

For instance, instead of writing code like this

public T Serialise<T>(string json, bool isJSONP)
{
  if (isJSONP)
  {
    // code to handle JSONP
  }
  else 
  {
    // code to handle regular JSON
  }
}

They write code like this, which is a classic memory/speed trade-off.

public ISerialiser GetSerialiser(bool isJSONP)
{
  if (isJSONP)
    return new SerialiseWithJSONP();
  else
    return new Serialiser();
}

public class SerialiserWithJSONP : ISerialiser
{
  private T Serialiser<T>(string json)
  {
    // code to handle JSONP  
  }
}

public class Serialiser : ISerialiser
{
  private T Serialise<T>(string json)
  {
    // code to handle regular JSON
  }
}

This means that during serialisation there doesn’t need to be any “feature switches”, they just emit the different versions of the code at creation time and based on the options you specify, hand you the correct one. Of course the classes (SerialiserWithJSONP and Serialiser in this case) are dynamically created just once and then cached for later re-use, so the cost of the dymanic code generation is only paid once.

By doing this the code plays nicely with CPU branch prediction, because it has a nice predictable pattern that the CPU can easily work with. It also has the added benefit of making the methods smaller, which may make then candidates for in-lining by the the .NET JITter.

For more examples of optimisations used, see the links below

Jil - Marginal Gains.

On top of this the measure everything to ensure that the optimisations actually work! These tests are all run as unit-tests, allowing easy generation of the results, take a look at ReorderMembers for instance.

Note: All the times are in milliseconds, but timed over 1000’s of runs, not per call.

Feature name	Original	Improved	Difference
ReorderMembers	2721	2712	9
SkipNumberFormatting	166	163	3
UseCustomIntegerToString	589	339	250
SkipDateTimeMathMethods	108	100	8
UseCustomISODateFormatting	399	269	130
UseFastLists	277	267	10
UseFastArrays	486	469	17
UseFastGuids	744	304	440
AllocationlessDictionaries	134	127	7
PropagateConstants	77	35	42
AlwaysUseCharBufferForStrings	63	56	7
UseHashWhenMatchingMembers	141	131	10
DynamicDeserializer_UseFastNumberParsing	94	51	43
DynamicDeserializer_UseFastIntegerConversion	131	131	2
UseHashWhenMatchingEnums	38	10	28
UseCustomWriteIntUnrolledSigned	2182	1765	417

This is very similar to the “Marginal Gains” approach that worked so well for British Cycling in the last Olympics:

There’s fitness and conditioning, of course, but there are other things that might seem on the periphery, like sleeping in the right position, having the same pillow when you are away and training in different places. Do you really know how to clean your hands? Without leaving the bits between your fingers? If you do things like that properly, you will get ill a little bit less. “They’re tiny things but if you clump them together it makes a big difference.”

Summary

All-in-all there is a lot to be learnt from code and blog posts that have come from Stack Overflow developers, I’m glad they’ve shared everything so openly. Also by having a high-profile website running on .NET, it stops the argument that .NET is inherently slow.

The post Stack Overflow - performance lessons (part 2) first appeared on my blog Performance is a Feature!

Stack Overflow - performance lessons (part 1)

2014-09-01T00:00:00+00:00

This post is part of a semi-regular series, you can find the other entries here and here

Before diving into any of the technical or coding aspects of performance, it is really important to understand that the main lesson to take-away from Stack Overflow (the team/product) is that they take performance seriously. You can see this from the blog post that Jeff Atwood wrote, it’s a part of their culture and has been from the beginning:

But anyone can come up with a catchy line like “Performance is a Feature!!”, it only means something if you actually carry it out. Well it’s clear that Stack Overflow have done just this, not only is it a Top 100 website, but they’ve done the whole thing with very few servers and several of those are running at only 15% of their capacity, so they can scale up if needed and/or deal with large traffic bursts.

Update (2/9/2014 9:25:35 AM): Nick Craver tweeted me to say that the High Scalability post is a bad summarisation (apparently they have got things wrong before), so take what it says with a grain of salt!

Aside: If you want even more information about their set-up, I definitely recommend reading the Hacker News discussion and this post from Nick Craver, one of the Stack Overflow developers.

Interestingly they have gone for scale-up rather than scale-out, by building their own servers instead of using cloud hosting. The reason for this, just to get better performance!

Why do I choose to build and colocate servers? Primarily to achieve maximum performance. That’s the one thing you consistently just do not get from cloud hosting solutions unless you are willing to pay a massive premium, per month, forever: raw, unbridled performance….

Taking performance seriously

It’s also worth noting that they are even prepared to sacrifice the ability to unit test their code, because it gives them better performance.

Garbage collection driven programming. SO goes to great lengths to reduce garbage collection costs, skipping practices like TDD, avoiding layers of abstraction, and using static methods. While extreme, the result is highly performing code. When you’re doing hundreds of millions of objects in a short window, you can actually measure pauses in the app domain while GC runs. These have a pretty decent impact on request performance.

Now, this isn’t for everyone and even suggesting that unit testing isn’t needed or useful tends to produce strong reactions. But you can see that they are making an informed trade-off and they are prepared to go against the conventional wisdom (“write code that is unit-testing friendly”), because it gives them the extra performance they want. One caveat is that they are in a fairly unique position, they have passionate users that are willing to act as beta-testers, so having less unit test might not harm them, not everyone has that option!

To get around garbage collection problems, only one copy of a class used in templates are created and kept in a cache. Everything is measured, including GC operation, from statistics it is known that layers of indirection increase GC pressure to the point of noticeable slowness.

For a more detailed discussion on why this approach to coding can make a difference to GC pressure, see here and here.

Another non-technical lesson is that Stack Overflow are committed to doing things out in the open and sharing what they create as code or lessons-learnt blog posts. Their list of open source projects includes:

MiniProfiler - which gives developers an overview of where the time is being spent when a page renders (front-end, back-end, database, etc)
Dapper - developed because Entity Framework imposed too large an overhead when materialising the results of a SQL query into POCO’s
Jil - a newly release JSON serialisation/library, developed so that they can get the best possible performance. JSON parsing and serialisation must be a very common operation across their web-servers, so shaving off microseconds from the existing libraries is justified.
TagServer - a custom .NET service that was written to make the complex tag searches quicker than they would be if done directly in SQL Server.
Opserver - fully featured monitoring tool, giving their operation engineers a deep-insight into what their servers are doing in production.

All these examples show that they are not afraid to write their own tools when the existing ones aren’t up-to scratch, don’t have the features they need or don’t give the performance they require.

Measure, profile and display

As shown by the development of Opserver, they care about measuring performance accurately even (or especially) in production. Take a look at the images below and you can see not only the detailed level of information they keep, but how it is displayed in a way that makes is easy to see what is going on (there are also more screenshots available).

Finally I really like their guidelines for achieving good observability in a production system. They serve as a really good check-list of things you need to do if you want to have any chance of knowing what your system up to in production. I would image these steps and the resulting screens they designed into Opserver have been built up over several years of monitoring and fixing issues in the Stack Overflow sites, so they are battle-hardened!

5 Steps to Achieving Good Observability: In order to achieve good observability an SRE team (often in conduction with the rest of the organization) needs to do the following steps.

Instrument your systems by publishing metrics and events

Gather those metrics and events in a queryable data store(s)

Make that data readily accessible

Highlight metrics that are, or are trending towards abnormal or out of bounds behavior

Establish the resources to drill down into abnormal or out of bounds behavior

Next time

Next time I’ll look at some concrete examples of performance lessons for the open source projects that SO have set-up, including the crazy tricks they use in Jil, their JSON serialisation library.

The post Stack Overflow - performance lessons (part 1) first appeared on my blog Performance is a Feature!

How to mock sealed classes and static methods

2014-08-14T00:00:00+00:00

Typemock & JustMock are 2 commercially available mocking tools that let you achieve something that should be impossible. Unlike all other mocking frameworks, they let you mock sealed classes, static and non-virtual methods, but how do they do this?

Dynamic Proxies

Firstly it’s worth covering how regular mocking frameworks work with virtual methods or interfaces. Suppose you have a class you want to mock, like so:

public class TestingMocking
{
  public virtual void MockMe()
  {
    ..
  }
}

At runtime the framework will generate a mocked class like the one below. As it inherits from TestingMocking you can use it instead of your original class, but the mocked method will be called instead.

public class DynamicProxy : TestingMocking
{
  public override void MockMe()
  {
    ..
  }
}

This is achieved using the DynamicMethod class available in System.Reflection.Emit, this blog post contains a nice overview and Bill Wagner has put together a more complete example that gives you a better idea of what is involved. I found that once you discover dynamic code generation is possible, you realise that it is used everywhere, for instance:

Dapper (see this gist for ver1)
Entity Framework (it enables lazy-loading when doing Code-First)
protobuf-net
Json.NET
AutoMapper
and many more!

BTW if you ever find yourself needing to dynamically emit IL code, I’d recommend using the Sigil library that was created by some of the developers at StackOverflow. It takes away a lot of the pain associated with writing and debugging IL.

However dynamically generated proxies will always run into the limitation that you can’t override non-virtual methods and they also can’t do anything with static methods or sealed class (i.e. classes that can’t be inherited).

.NET Profiling API and JITCompilationStarted() Method

How Typemock and JustMock achieve what they do is hinted at in a StackOverflow answer by a Typemock employee and is also discussed in this blog post. But they only talk about the solution, I wanted to actually write a small proof-of-concept myself, to see what is involved.

To start with the .NET profiling API is what makes this possible, but a word of warning, it is a C++ API and it requires you to write a COM component to be able to interact with it, you can’t work with it from C#. To get started I used the excellent profiler demo project from Shaun Wilde. If you want to learn more about the profiling API and in particular how you can use it to re-write methods, I really recommend looking at this code step-by-step and also reading the accompanying slides.

By using the profiling API and in particular the JITCompilationStarted method, we are able to modify the IL of any method being run by the CLR (user code or the .NET runtime), before the JITer compiles it to machine code and it is executed. This means that we can modify a method that originally looks like this:

public sealed class ClassToMock
{
  public static int StaticMethodToMock()
  {
    Console.WriteLine("StaticMethodToMock called, returning 42");
    return 42;
  }
}

So that instead it does this:

public sealed class ClassToMock
{
  public static int StaticMethodToMock()
  {
    // Inject the IL to do this instead!!
    if (Mocked.ShouldMock("Profilier.ClassToMock.StaticMethodToMock"))
      return Mocked.MockedMethod();

    Console.WriteLine("StaticMethodToMock called, returning 42");
    return 42;
  }
}

For reference, the original IL looks like this:

IL_0000 ( 0) nop
IL_0001 ( 1) ldstr (70)00023F    //"StaticMethodToMockWhatWeWantToDo called, returning 42"
IL_0006 ( 6) call (06)000006     //call Console.WriteLine(..)
IL_000B (11) nop
IL_000C (12) ldc.i4.s 2A         //return 42;
IL_000E (14) stloc.0
IL_000F (15) br IL_0014
IL_0014 (20) ldloc.0
IL_0015 (21) ret

and after code injection, it ends up like this:

IL_0000 ( 0) ldstr (70)000135
IL_0005 ( 5) call (0A)00001B     //call ShouldMock(string methodNameAndPath)
IL_000A (10) brfalse.s IL_0012
IL_000C (12) call (0A)00001C     //call MockedMethod()
IL_0011 (17) ret
IL_0012 (18) nop
IL_0013 (19) ldstr (70)00023F    //"StaticMethodToMockWhatWeWantToDo called, returning 42"
IL_0018 (24) call (06)000006     //call Console.WriteLine(..)
IL_001D (29) nop
IL_001E (30) ldc.i4.s 2A         //return 42;
IL_0020 (32) stloc.0
IL_0021 (33) br IL_0026
IL_0026 (38) ldloc.0
IL_0027 (39) ret

And that is the basics of how you can modify any .NET method, it seems relatively simple when you know how! In my simple demo I just add in the relevant IL so that a mocked method is called instead, you can see the C++ code needed to achieve this here. Of course in reality it’s much more complicated, my simple demo only deals with a very simplistic scenario, a static method that returns an int. The commercial products that do this are way more powerful and have to deal with all the issues that you can encounter when you are re-writing code at the IL level, for instance if you aren’t careful you get exceptions like this:

Running the demo code

If you want to run my demo, you need to open the solution file under step5_main_injected_method_object_array and set “ProfilerHost” as the “Start-up Project” (right-click on the project in VS) before you run. When you run it, you should see something like this:

You can see the C# code that controls the mocking below. At the moment the API in the demo is fairly limited, it only lets you turn mocking on/off and set the value that is returned from the mocked method.

static void Main(string[] args)
{
  // Without mocking enabled (the default)
  Console.WriteLine(new string('#', 90));
  Console.WriteLine("Calling ClassToMock.StaticMethodToMock() (a static method in a sealed class)");
  var result = ClassToMock.StaticMethodToMock();
  Console.WriteLine("Result: " + result);
  Console.WriteLine(new string('#', 90) + "n");

  // With mocking enabled, doesn't call the static method, calls mocked version instead
  Console.WriteLine(new string('#', 90));
  Mocked.SetReturnValue = 1;
  Console.WriteLine("Turning ON mocking of Profilier.ClassToMock.StaticMethodToMock");
  Mocked.Configure("ProfilerTarget.ClassToMock.StaticMethodToMock", mockMethod: true);

  Console.WriteLine("Calling ClassToMock.StaticMethodToMock() (a static method in a sealed class)");
  result = ClassToMock.StaticMethodToMock();
  Console.WriteLine("Result: " + result);
  Console.WriteLine(new string('#', 90) + "n");
}

Other Uses for IL re-writing

Again once you learn about this mechanism, you realise that it is used in lots of places, for instance

profilers, see this SO answer for more info (Ants and JetBrains)
test coverage (NCover)
productions monitoring systems

Discuss on /r/csharp

The post How to mock sealed classes and static methods first appeared on my blog Performance is a Feature!

Know thy .NET object memory layout (Updated 2014-09-03)

2014-07-04T00:00:00+00:00

Apologies to Nitsan Wakart, from whom I shamelessly stole the title of this post!

tl;dr

The .NET port of HdrHistogram can control the field layout within a class, using the same technique that the original Java code does.

Recently I’ve spent some time porting HdrHistogram from Java to .NET, it’s been great to learn a bit more about Java and get a better understanding of some low-level code. In case you’re not familiar with it, the goals of HdrHistogram are to:

Provide an accurate mechanism for measuring latency at a full-range of percentiles (99.9%, 99.99% etc)
Minimising the overhead needed to perform the measurements, so as to not impact your application

You can find a full explanation of what is does and how point 1) is achieved in the project readme.

Minimising overhead

But it’s the 2nd of the points that I’m looking at in this post, by answering the question

How does HdrHistogram minimise its overhead?

But first it makes sense to start with the why, well it turns out it’s pretty simple. HdrHistogram is meant for measuring low-latency applications, if it had a large overhead or caused the GC to do extra work, then it would negatively affect the performance of the application is was meant to be measuring.

Also imagine for a minute that HdrHistogram took 1/10,000th of a second (0.1 milliseconds or 100,000 nanoseconds) to record a value. If this was the case you could only hope to accurately record events lasting down to a millisecond (1/1,000th of a second), anything faster would not be possible as the overhead of recording the measurement would take up too much time.

As it is HdrHistogram is much faster than that, so we don’t have to worry! From the readme:

Measurements show value recording times as low as 3-6 nanoseconds on modern (circa 2012) Intel CPUs.

So how does it achieve this, well it does a few things:

It doesn't do any memory allocations when storing a value, all allocations are done up front when you create the histogram. Upon creation you have to specify the range of measurements you would like to record and the precision. For instance if you want to record timings covering the range from 1 nanosecond (ns) to 1 hour (3,600,000,000,000 ns), with 3 decimal places of resolution, you would do the following:
Histogram histogram = new Histogram(3600000000000L, 3);
Uses a few low-level tricks to ensure that storing a value can be done as fast as possible. For instance putting the value in the right bucket (array location) is a constant lookup (no searching required) and on top of that it makes use of some nifty bit-shifting to ensure it happens as fast as possible.
Implements a slightly strange class-hierarchy to ensure that fields are laid out in the right location. It you look at the source you have AbstractHistogram and then the seemingly redundant class AbstractHistogramBase, why split up the fields up like that? Well the comments give it away a little bit, it's due to false-sharing

False sharing

Update (2014-09-03): As pointed out by Nitsan in the comments, I got the wrong end of the stick with this entire section. It’s not about false-sharing at all, it’s the opposite, I’ll quote him to make sure I get it right this time!

The effort made in HdrHistogram towards controlling field ordering is not about False Sharing but rather towards ensuring certain fields are more likely to be loaded together as they are clumped together, thus avoiding a potential extra read miss.

So what is false sharing, to find out more I recommend reading Martin Thompson’s excellent post and this equally good one from Nitsan Wakart. But if you’re too lazy to do that, it’s summed up by the image below (from Martin’s post).

Image from the Mechanical Sympathy blog

The problem is that a CPU pulls data into its cache in lines, even if your code only wants to read a single variable/field. If 2 threads are reading from 2 fields (X and Y in the image) that are next to each other in memory, the CPU running a thread will invalidate the cache of the other CPU when it pulls in a line of memory. This invalidation costs time and in high-performance situations can slow down your program.

The opposite is also true, you can gain performance by ensuring that fields you know are accessed in succession are located together in memory. This means that once the first field is pulled into the CPU cache, subsequent accesses will be cheaper as the fields will be “Hot”. It is this scenario HdrHistogram is trying to achieve, but how do you know that fields in a .NET object are located together in memory?

Analysing the memory layout of a .NET Object

To do this you need to drop down into the debugger and use the excellent SOS or Son-of-Strike extension. This is because the .NET JITter is free to reorder fields as it sees fit, so the order you put the fields in your class does not determine the order they end up. The JITter changes the layout to minimise the space needed for the object and to make sure that fields are aligned on byte boundaries, it does this by packing them in the most efficient way.

To test out the difference between the Histogram with a class-hierarchy and without, the following code was written (you can find HistogramAllInOneClass in this gist):

Histogram testHistogram = new Histogram(3600000000000L, 3);
HistogramAllInOneClass combinedHistogram = new HistogramAllInOneClass();

Debugger.Launch();

GC.KeepAlive(combinedHistogram); // put a breakpoint on this line
GC.KeepAlive(testHistogram);

Then to actually test it, you need to perform the following steps:

Set the build to Release and x86
Build the test and then launch your .exe from OUTSIDE Visual Studio (VS), i.e. by double-clicking on it in Windows Explorer. You must not be debugging in VS when it starts up, otherwise the .NET JITter won't perform any optimisations.
When the "Just-In-Time Debugger" prompt pops up, select the instance of VS that is already opened (not a NEW one)
Then check "Manually choose the debugging engines." and click "Yes"
Finally make sure "Managed (...)", "Native" AND "Managed Compatibility Mode" are checked

Once the debugger has connected back to VS, you can type the following commands in the “Immediate Window”:

.load sos
!DumpStackObjects
DumpObj <ADDRESS> (where ADDRESS is the the value from the "Object" column in Step 2.)

If all that works, you will end up with an output like below:

Update (2014-09-03)

Since first writing this blog post, I came across a really clever technique for getting the offsets of fields in code, something that I initially thought was impossible. The full code to achieve this comes from the Jil JSON serialiser and was written to ensure that it accessed fields in the most efficient order.

It is based on a very clever trick, it dynamically emits IL code, making use of the Ldflda instruction. This is code you could not write in C#, but are able to write directly in IL.

The ldflda instruction pushes the address of a field located in an object onto the stack. The object must be on the stack as an object reference (type O), a managed pointer (type &), an unmanaged pointer (type native int), a transient pointer (type *), or an instance of a value type. The use of an unmanaged pointer is not permitted in verifiable code. The object's field is specified by a metadata token that must refer to a field member.

By putting this code into my project, I was able to verify that it gives exactly the same field offsets that you can see when using the SOS technique (above). So it’s a nice technique and the only option if you want to get this information without having to drop-down into a debugger.

Results

After all these steps we end up with the results shown in the images below, where the rows are ordered by the “Offset” value.

AbstractHistogramBase.cs -> AbstractHistogram.cs -> Histogram.cs

You can see that with the class hierarchy in place, the fields remain grouped as we want them to (shown by the orange/green/blue highlighting). What is interesting is that the JITter has still rearranged fields within a single group, preferring to put Int64 (long) fields before Int32 (int) fields in this case. This is seen by comparing the ordering of the “Field” column with the “Offset” one, where the values in the “Field” column represent the original ordering of the fields as they appear in the source code.

However when we put all the fields in a single class, we lose the grouping:

Equivalent fields all in one class

Alternative Technique

To achieve the same effect you can use the StructLayout attribute, but this requires that you calculate all the offsets yourself, which can be cumbersome:

[StructLayout(LayoutKind.Explicit, Size = 28, CharSet = CharSet.Ansi)]
public class HistogramAllInOneClass
{
  // "Cold" accessed fields. Not used in the recording code path:
  [FieldOffset(0)]
  internal long identity;

  [FieldOffset(8)]
  internal long highestTrackableValue;
  [FieldOffset(16)]
  internal long lowestTrackableValue;
  [FieldOffset(24)]
  internal int numberOfSignificantValueDigits;

  ...
}

If you are interested, the full results of this test are available

The post Know thy .NET object memory layout (Updated 2014-09-03) first appeared on my blog Performance is a Feature!

Measuring the impact of the .NET Garbage Collector - An Update

2014-06-23T00:00:00+00:00

tl;dr

Measuring performance accurately is hard. But it is a whole lot easier if someone with experience takes the time to explain your mistakes to you!!

This is an update to my previous post, if you haven’t read that, you might want to go back and read it first.

After I published that post, Gil Tene (@GilTene) the author of jHiccup, was kind enough to send me an email pointing out a few things I got wrong! It’s great that he took the time to do this and so (with his permission), I’m going to talk through his comments.

Firstly he pointed out that the premise for my investigation wasn’t in-line what jHiccup reports. So instead answering the question:

what % of pauses do what?

jHiccup answers a different question:

what % of my operations will see what minimum possible latency levels?

He also explained that I wasn’t measuring only GC pauses. This was something which I alluded to in my post, but didn’t explicitly point out.

...I suspect that your current data is somewhat contaminated by hiccups that are not GC pauses (normal blips of 2+ msec due to scheduling, etc.). Raising the 2 msec recording threshold (e.g. to 5 or 10msec) may help with that, but then you may miss some actual GC pauses in your report. There isn't really a good way around this, since "very short" GC pauses and "other system noise" overlap in magnitude.

So in summary, it is better to describe my tests as measuring any pauses in a program, not just GC pauses. Again quoting from Gil:

Over time (and based on experience), I think you may find that just using the jHiccup approach of "whatever is stopping my apps from running" will become natural, and that you'll stop analyzing the pure "what percent of GC pauses do what" question (if you think about it, the answer to that question is meaningless to applications).

This is so true, it really doesn’t matter what is slowing your app down or causing the user to experience unacceptable pauses. What matters is finding out if and how often this is happening and then doing something about it.

Tweaks made

He also suggested some tweaks to make to the code (emphasis mine):

Record everything (good and bad): You current code only records pauses (measurements above 2msec). To report from a "% of operations" viewpoint, you need to record everything, unconditionally. As you probably see in jHiccup, what I record as hiccups is the measured time minus the expected sleep time. Recording everything will have the obvious effect of shifting the percentile levels to the right.

Correct for coordinated omission. My "well trained" eye sees clear evidence of coordinated omission in your current charts (which is fine for "what % of pauses" question, but not for a "what % of operations" question): any vertical jumps in latency on a percentile chart are a strong indication of coordinated omission. While it is possible to have such jumps be "valid" and happening without coordinated omission in cases where the concurrently measured transactions are "either fast or slow, without blocking anything else" (e.g. a web page takes either 5msec or 250msec, and never any other number in between), these are very rare in the wild, and never happen in a jHiccup-like measurement. Then, whenever you see a 200 msec measurement, it also means that you "should have seen" measurements with the values 198, 196, 194, ... 4, but never got a chance to.

Based on these 2 suggestions, the code to record the timings becomes the following:

var timer = new Stopwatch();
var sleepTimeInMsecs = 1;
while (true)
{
  timer.Restart();
  Thread.Sleep(sleepTimeInMsecs);
  timer.Stop();

  // Record the pause (using the old method, for comparison)
  if (timer.ElapsedMilliseconds > 2)
    _oldhistogram.recordValue(timer.ElapsedMilliseconds);  

  // more accurate method, correct for coordinated omission
  _histogram.recordValueWithExpectedInterval(
        timer.ElapsedMilliseconds - sleepTimeInMsecs, 1);
}

To see what difference this made to the graphs I re-ran the test, this time just in Server GC mode. You can see the changes on the graph below, the dotted lines are the original (inaccurate) mode and the solid lines show the results after they have been corrected for coordinated omission.

Correcting for Coordinated Omission

This is an interesting subject and after becoming aware of it, I’ve spent some time reading up on it and trying to understand it more deeply. One way to comprehend it, is to take a look at the code in HdrHistogram that handles it:

recordCountAtValue(count, value);
if (expectedIntervalBetweenValueSamples <= 0)
    return;

for (long missingValue = value - expectedIntervalBetweenValueSamples;
     missingValue >= expectedIntervalBetweenValueSamples;
     missingValue -= expectedIntervalBetweenValueSamples) 
{
    recordCountAtValue(count, missingValue);
}

As you can see it fills in all the missing values, from 0 to the value you are actually storing.

It is comforting to know that I’m not alone in making this mistake, the authors of Disruptor and log4j2 both made the same mistake when measuring percentiles in their high-performance code.

Finally if you want some more information on Coordinated Omission and the issue it is trying to prevent, take a look at this post from the Java Advent calendar (you need to scroll down past the calendar to see the actual post). The main point is that without correcting for it, you will be getting inaccurate percentile values, which kind-of defeats the point of making accurate performance measurements in the first place!

The post Measuring the impact of the .NET Garbage Collector - An Update first appeared on my blog Performance is a Feature!

Performance is a Feature!

Analysing .NET start-up time with Flamegraphs

Code Sample

Data Collection

Data Processing

Anaylsis of .NET Runtime Startup

Under the hood of "Default Interface Methods"

Background

Development Timeline and PRs

Initial work, Prototype and Timeline

Interesting PR’s done after the prototype (newest -> oldest)

Bug fixes done since the Prototype (newest -> oldest)

Possible future work

Default Interface Methods ‘in action’

Enabling Methods on an Interface

Resolving the Method Dispatch

Analysis of FindDefaultInterfaceImplementation(..)

Diamond Inheritance Problem

Summary

Research based on the .NET Runtime

.NET Runtime as a Case-Study

Microsoft Research

Mono Runtime

Shared Source Common Language Infrastructure (SSCLI) - a.k.a ‘Rotor’

"Stubs" in the .NET Runtime

What are stubs?

Why are stubs needed?

CLR ‘Application Binary Interface’ (ABI)

Stub Management

Types of stubs

Precode

‘Just-in-time’ (JIT) and ‘Tiered’ Compilation

Stubs-as-IL

P/Invoke, Reverse P/Invoke and ‘calli’

Marshalling

Generics

Delegates

Singlecast Delegates

Shuffle Thunks

Unboxing

Arrays

Tail Calls

Virtual Stub Dispatch (VSD)

Other Types of Stubs

Stubs in the Mono Runtime

Conclusion

ASCII Art in .NET Code

Table of Contents

Dave Cutler

Syntax Trees

Timelines

Logic Tables

Class Hierarchies

Component Diagrams

Algorithms

Bit Packing

Data Structures

State Machines

RFC’s and Specs

Dates & Times

Stack Layouts

The Rest

Is C# a low-level language?

Line-by-line port

Performance

Results after a ‘naive’ line-by-line port

Results after simple performance improvements

Remove in-line array initialisation

Using MathF functions instead of Math

Further performance improvements

Is C# a low-level language?

Further Reading

"Stack Walking" in the .NET Runtime

Where does the CLR use ‘Stack Walking’?

Common Scenarios

Debugging/Diagnostics

Obscure Scenarios

Stack Crawl Marks

Exception Handling

The ‘Stack Walking’ API

Analysis of `FindDefaultInterfaceImplementation(..)`