Friday, September 30, 2011

GHC Projects for All

I just recently got back from the ICFP and related conferences for 2011. It was a great time, both on the social and work fronts. At the end of the Haskell Symposium there was a future of Haskell discussion session. One of the points I tried to make during the discussion is that we should try to solve the problem of getting more people involved in contributing to the various Haskell infrastructure projects (Cabal, Hackage, Haskell Platform...) by creating a central location where we keep a list of things people could help out with. Currently a lot of this information only exists in the heads of people. I realized that I also have a list of potential tasks in my head for work on GHC's LLVM backend. This post is about putting some information out there into the community about the current state of the LLVM backend and work that can be done to improve it. This is partially just to be informative but also a call for volunteers! There is a lot of fun work that can be done on the LLVM backend for GHC or at the backend level of GHC in general. I sadly don't think I'll have time to do any of it. I keep thinking I will but given that I've maybe managed to find a couple of hours a month in the last year to work on the LLVM backend, history doesn't support me. All the tiny amount of time I find to work on the LLVM backend is instead taken up by support work. Fixing the odd bug and making sure the latest version of LLVM is supported.

To begin, the LLVM backend has been a mixed success. It generates as good code as the native code generator (NCG) and is about a quarter of the size, with the code being far simpler. We get improvements to it for free all the time with new releases of LLVM (although so far none of the releases have really changed the performance numbers much for GHC). A lot of core members of GHC believe it is the way forward and there is no desire to put any significant work into the native code generator.

The main issue is that LLVM doesn't seem to be achieving its full potential. Probably for 85% of the code or more that you might throw at the LLVM backend, it only generates as good as code as the NCG. This is despite the fact that the NCG does nearly no optimization work while LLVM is running around 30 passes or more of optimizations. The extra optimization passes slows down the compilation by around 50%. Occasionally we see LLVM 'kick in', and when this happens its not unusual to see the code produced by LLVM outperform the NCG by 2 - 4 times.

The question is, why doesn't LLVM kick in more often? We don't have a definitive answer for that yet, which is part of the problem. There are some theories and I suspect all three are correct, just at different times and to varying amounts.

1. Bad alias analysis
The initial thinking for the bad performance was that LLVM wasn't doing a good job of figuring out the aliasing properties of the code generated by GHC. About a year ago Roman Leshchinskiy and I spent a few days trying to use the mechanisms that LLVM provides to tell it about what can and can't alias what in the generated code. This work was prompted by some bad code being generated by LLVM for Data Parallel Haskell (DPH) benchmarking code. From memory this work gave about a 1 - 2 % speed up on the nofib Haskell benchmark suite. It didn't improve the DPH benchmark at all though. As a test I wrote a custom alias analysis pass for LLVM that simply said nothing ever aliased anything else. The code generated for the DPH benchmark was basically the same still. This led me to tentatively conclude at the time that the issue was caused by two things. One is that there was some bad interplay between optimal instruction selection and register allocation. The other (which is linked to the former) is a problem with large functions in LLVM that have high register pressure as LLVM didn't do proper live range splitting. (e.g see this llvm bug)

There was still however definitely a problem with the alias analysis that LLVM was doing. The conclusion was more that we had found potentially another issue, not replaced one with another.

2. Live range splitting
The other problem that I came to think we might be having with LLVM was the lack of proper live range splitting, as discussed above. My low level compiler foo is pretty weak, so I can't support this argument particularly well other than the linked LLVM bug and testing I did with the custom 'no-alias' alias analysis pass.

3. Haskell itself
When discussions in regards to the backend performance of GHC and low level optimisation work on Haskell crop up I've often heard people talk about the nature of the language itself. The claim is that the kind of code you get from a lazy functional language is very often of a simple form. That is:
function f() {
garbage collection check
move/copy some memory around
jump g();
}

Lots of functions that are simply straight line code with little register pressure and lots of memory traffic. Having looked at a fair amount of assembly code generated by GHC, there is definitely a decent amount of code generated that fits this model. If this is inherent to the Haskell language (e.g Ruby is known to be slow partially as the language itself is slow due to all the dynamics) I don't know. Given the nature of laziness and thunks I would imagine so but strictness analysis, inlining, unboxing... etc can all fix this. This might mean that for some styles of Haskell programming there may be little improvements that can be made to the code at the final stage that LLVM operates at.

There are at least three reasons that the LLVM backend doesn't achieve the performance we hoped for. The question is, what kind of work can be done to improve the situation? Well the good news is we aren't short of ideas here, just time and volunteers.

1. Custom alias analysis pass for LLVM
Simple. Write a custom alias analysis pass for LLVM that knows specifically about GHC. This has been near the top of my todo list for a long time. The work should be pretty straight forward and potentially very beneficial.

Update: I have been preparing this post over a day or two, and this morning noticed an email from Max Bolingbroke to the LLVM mailing list. The extremely productive Max seems to have gone ahead and done this work! Can't wait to see the results.

2. Greedy register allocator in LLVM 3.0
This one is easy. LLVM 3.0 (to be released around November) will include a completely new register allocator called the 'greedy' register allocator. You can find a nice blog post on it here. It does proper live range splitting and I'm hoping it will bring some nice and free improvements to GHC.

3. Track live STG register information
The final IR that GHC uses, Cmm (a simple, procedural, C like language) has a few components to the language specifically designed to support the STG machine layer above it. You can find more information on the STG machine here. One of the language features in Cmm for supporting STG are the so called STG registers. These are basically a set of special global variables in the Cmm language. They are used by the STG machine to model things like the stack, heap, current usage of both of these for managing GC... etc. They reflect many of the real hardware registers you find on a machine up into the language. We use this to explicitly manage our own stack in GHC, instead of using the C stack. These STG registers are expected by GHC to be mapped / pinned to real hardware registers. On X86-32, the STG Stack Pointer register lives in the X86 hardware register '%ebp'. Given how frequently they are accessed this pinning of them to hardware registers instead of storing them on the heap is a big performance gain.

Unfortunately we don't do this in GHC in a particularly smart manner. The NCG for example permanently stores the STG registers in their corresponding hardware registers. They are exclusively reserved for the task and not used for anything else. The LLVM backend does a slightly better job and instead passes them around as function parameters. A custom calling convention is then used so that the parameters are passed in the correct hardware registers. The LLVM code looks a little like this:
function void f (Base, Sp, Hp, R1) {
...
g (Base, Sp', Hp', R1');
}

This technique allows LLVM to use the hardware registers that usually used for storing the STG registers, for something else in the middle of a function. A better register allocation can be made. However there is a problem here. On X86-32, only 4 STG registers are pinned to hardware registers and they are STG registers that are always used. On X86-64 16 STG registers are mapped to hardware registers. Many of these STG registers are ones that aren't frequently used. Yet the LLVM backend is always passing them around, threading through function calls. A lot of the time these STG registers are dead, storing nothing valuable but LLVM doesn't know this and may waste valuable time spilling and reloading them. It will also use less than optimal register allocations for functions since it needs to do this spill/reload to use most of the registers available to it.

A task that needs to be done at some point in GHC is to add the correct information to the Cmm representation to carry around the STG register liveness information. Then the LLVM backend can generate code like so:
function void f (Base, Sp, Hp, R1, R2, R3, R4, R5, R6, SpLim, F1, F2, F3, F4, D1, D2) {
...
g (Base', Sp', Hp', R1', undef, undef, undef, undef, undef, undef, SpLim', undef, undef, undef, undef, undef, undef)
}

And stop a lot of pointless spill/reload code being generated and allow LLVM to produce far better register allocations in many situations.

4. Fix up the LLVM representation used by the backend
The LLVM bindings I wrote aren't the best. They're mostly fine and easy to work with but could do with some love. I learned Haskell while writing the LLVM backend for GHC, so while the code I wrote is clean, its not ideal and explicitly avoids some of the more 'advance' language features I either didn't know or wasn't comfortable with at the time. Mostly this just corresponds to a more 'wordy' LLVM backend then is really needed. The main priority to fix up here is that the LLVM bindings I wrote uses Strings / Show for printing. GHC internally uses a Doc type for printing. There are some ugly transformations going on at times where you values parsed around like:
Doc -> SDoc -> String -> Doc

Or something similar. So a lot of pointless casting between types for no purpose. The LLVM bindings / backend should be changed to use GHC's internal types, such as FastString, LitString and Doc / SDoc. This will result in cleaner code but should also give a compile time speed improvement.

5. Use LLVM LTO
This one is a far less focused then the other tasks and more something to try for fun / experimentation. LLVM includes a whole program optimization framework. Its Link Time Optimization stuff. I'm not sure how much of a benefit this would be to GHC as it already does a whole lot of cross module optimization work. It may or may not be that much work though and would be worth trying out. The larget part of the challenge is getting GHC working with the Gold linker. This linker needs to be used for LTO to be used. I know Roman tried to use Gold a year or two back and it didn't support some features GHC required. Perhaps it does now. Gold is also meant to be much faster than the traditional binutils ld linker.

6. Figure out why LLVM doesn't work as well as hoped
I've offered some ideas on why LLVM doesn't work as well as we had hoped. They are mostly based on a small amount of data and a lot of intuition. While I would be very surprised to find the ideas are wrong I'd be more surprised to find that they're the full story. A lot of work could be done to really understand the mismatch between GHC and LLVM. This is a hard task and also probably very boring for most people. It requires 3 skills that are all difficult skills in themselves and very difficult to find in one person:

1. Haskell Guru: Need to really understand the Haskell language. I.e Don Stewart, Johan Tibell kind of level.
2. GHC Guru: Need to have a thorough understanding of how Haskell is compiled by GHC: STG machine, Cmm, the various optimisations done... ect
3. Backed Compiler knowledge: Need to have a good understanding of how native code generators like LLVM, GCC and all work: register allocation, instruction selection, GVN, CSE, strength reduction... ect. I wouldn't say you need to be a Guru, but you need a solid understanding of the landscape here.

I myself am decent in all three areas but wouldn't call myself a guru in any of them. I've found it pretty difficult to really understand the mismatch. The number of people meeting this criteria is very low.

6. Implement support for Tables-Next-To-Code in LLVM
GHC does this trick called Tables-Next-To-Code (TNTC). The trick is to place some top level data (so think a gobal variable containing a structure) immediately before a function. The global variable contains metadata about the function, such as information used by the garbage collector. In assembly code it looks something like this:
.text
.long B_a_srt-(B_a_info)+0
.long 0
.quad 0
.quad 4294967318
.globl B_a_info
.type B_a_info, @object
B_a_info:
.LcnF:
leaq -16(%rbp), %rax
...

Notice the code and data are in the same section and adjacent so we are guaranteed that they will be next to each other when loaded into memory. GHC does this to allow looking up the code and the metadata for the code with one pointer. This saves some heap space (one less pointer needed) and also gives some speed benefits. Partially from now only needing one indirect lookup instead of two but I believe its mostly from the locality gained in having both adjacent.

We can't implement this scheme in LLVM. We resort to post processing the assembly it produces. This works pretty well these days. The Haskell code for it is only about 80 lines, is the same code for x86-32, x86-64, ARM, Windows, Mac, Linux, and for LLVM versions 2.8 - 3.0. It's also pretty quick these days. From memory it processes a 500,000 line assembly file in 0.3 seconds on my machine.

It would be nice to do this just using LLVM. Two approaches here. One is to add some support to LLVM to allow ordering of functions / globals to be controlled. This would be the simplest solution but arguably a little of a hack. The correct way would be to add explicit support to LLVM for associating a global variable with a function such that LLVM produced code like above. The various LLVM optimisation could then know about the relation and understand that a negative pointer offset to the function is referring to the global variable. This would be a fair amount of work though and I'm not sure how keen the LLVM people would be to accept patches for it. It does seems like a useful feature for other compilers using LLVM, not just a specific trick for GHC. If someone is serious about undertaking this work then they would need to hold a serious discussion with the LLVM folks. They are a nice bunch and have been very receptive to us in the past, but as with any piece of work it would be up to us on the GHC side to manage and drive the work forward. A one of email won't cut it.

Closing

That about covers it. There are a few other smaller points of work that can be done on the LLVM backend. For a complete list, see here where I've had a list for a long time on this stuff. If anyone is inspired by this post (and I hope you are!) and wants to work on this stuff since you'll be waiting a long time for me to do it, then please contact me and I'm very happy to give you any support you need in tackling a project.

13 comments:

posts said...

Hi David,

Is it possible that the code generated by GHC is largely memory bandwidth bound, and as such largely impervious to optimisations that do not affect this bottleneck?

I'm not an expect in any of the 3 domains you list, but I do write a fair amount of high performance code. Some times you can move the code around as much as you like, but if you are bandwidth bound it's unlikely to make much of a difference.

I don't know if this is the case though, or even if it is, whether there are LLVM-level optimisations that could make a significant impact.

Does this sound plausible to you?

Cheers,
Sam

David Terei said...

Hi Sam,

Yes at times that is the case. This was basically what I was saying in point 3, about the nature of the Haskell language. Perhaps I didn't express that point very well. So I think there is a lot of code for which it may be hard to optimise the code due to the memory bandwidth but there is still a lot of code for which LLVM should be doing a better job than it is.

Mikhail Glushenkov said...

Thanks for posting this! On the topic of fixing the LLVM representation, what about ditching the strings altogether and generating LLVM bitcode directly instead? This will give even bigger compile speed improvement, but we'll have to depend on LLVM for building GHC.

abi said...

Thanks for the great post, I don't fully understand all the different aspects since I don't know enough about GHC but I had a question about the custom alias analyser that you suggested. If you found that replacing the alias analyser with something that basically told LLVM that nothing was aliased ever and found that the code was pretty much the same, how do you think the alias analysis was off?

David Terei said...

Mikhail: Yes this has been suggested at various times. It may be worth pursuing but I'm not convinced. I looked at the LLVM bindings at one stage and I found that they looked much more work to deal with then the simple representation I use. I've tried to measure as well in the past the amount of time taken up by printing out the string representation and having LLVM parse it. I think it's pretty minor. So I don't think it would be worth the effort.

abi: I'm not sure I understand the question. Is the question 'what prompted you to look into alias analysis in the first place?'. If that is the question, it was basically that LLVM was reloading some data from the stack / heap that it didn't need to. We though it was reloading it as it thought that some previous operation could have altered the data (assuming two pointers aliased). However it seems instead a bad instruction selection / register allocation trade off caused it to reload from memory instead to free up some registers for other code temporarily. I think this was the belief at the time anyway, I'm a little rusty :).

abi said...

Yep, that was pretty close to what I meant. I suppose the current state then is that alias analysis is still kinda inaccurate in the LLVM implementation but you're not really sure of the exact cause?

Marco Devillers said...

Is it possible that the code generated by GHC is largely memory bandwidth bound, and as such largely impervious to optimisations that do not affect this bottleneck?

I think this is the heart of the problem for most FP compilers. Somehow, one would like to optimize global transformations on the graph away to long but local bursts of activity. I have no idea how to achieve this, except for that targeting most VM will probably not work since one cannot convey the necessary information to the VM.

Anyone aware of any work on this issue?

Rohan said...

I created a patch for an older version of Gold so that it would work with GHC. The associated bug seems to be in limbo

http://sourceware.org/bugzilla/show_bug.cgi?id=11182

David Terei said...

Rohan: Cool. I'm not sure if Gold is the way to go anymore though. LLVM is starting to work on their own linker so that they won't be using GNU bin tools at all anymore soon.

George Colpitts said...

Hi David

Now that llvm 3.0 is out are there any updates wrt the above issues?

Regards
George

George Colpitts said...

Hi David

Now that llvm 3.0 is out are there any updates wrt the above issues?

Regards
George

David Terei said...

Yes. I did the work to implement tasks 1 & 3. Both work well and show promise but are somewhat experimental right now and need another push on them to be ready for prime time.

The greedy register allocator is a slight improvement in some situations but not as big a win as I was hoping. The earlier version of the greedy register allocator in 2.9 actually seems to produce better code in the small sample of situations I've looked at.

George Colpitts said...

Did your work on 3 ever make it into GHC?