Intro so um so i'm going to talk about a front end for a gpu assembly language that takes assembly converts it to llvm and then you can do some interesting things and i'm going to tell you a little bit about one of them and this work was done by me and two interns one of them sean he's from sydney and other ones from georgia tech and it was a lot of fun for us to do this and so give you some introduction to so just a remark that this talk was prepared for non-ptx and non-lrd audience originally so you might see something about llvm uh but it's probably you know more than that so ptx is basically an assembly language it abstracts the computations of a gpu in a virtual instruction set like it's not like byte codes but it's more like bit codes and our compilers cuda compilers optics is a new ray tracing api from nvidia that generates ptx and then the nvidia opencl also uses uh as this output ptx and then this ptx is jitted on the gpu using the driver typically or it can be statically compiled and loaded into the driver and llvm as you know is basically at from the purpose of this talk is a pretty mature optimization and code generation infrastructure and we wanted to leverage that Goals and Motivations so now getting to the goals of the so the three kind of categories one is we wanted to build a bridge from ptx to llbm uh and that gives you basically llvm ir machine independent and that gives us a lot of interesting benefits for example in our opencl implementation we have an llvm it's also based on lvm and we have an llvm to ptx backend that is getting to be production quality so as a side effect of building this bridge and we can actually have a p text to pdx optimizer which is offline we can also build some analysis tools after ptx and that i'm not going to talk about today but we have done some interesting work and maybe in future we can tell you what it is and the thing that i'm going to talk about later is ptx to multi-core cpu so we basically convert ptx to llvm it has a variety of back-ends x86 is one of them and actually we can compile ptx to x86 and run it on multi-core cpus the only thing we have to do is since the semantic model of the two languages is different that is x86 and ptx we have to do some threading model transformation but it's not that complicated at least the basics efficiency is a kind of interesting problem so that's kind of the structure of our p lang so we get ptx file which is a textual file like a typical assembly language we have something called ptx parse and this is part of a product source basis which has a parser and it has its own internal representation and a standard library that we use underneath of it we wrote ptx to llvm and that uses all the libraries that are needed for generating ir and then out comes a bc file or an ll file and it currently is designed PTX Instruction Set to just drop it into the project file under llvm 2.5 distribution and just configure and rebuild and it just works except for the intrinsics that we have to add on top of it so a little bit about ptx instruction set so in ptx and cuda and opencl there's this concept of a kernel so one way to think about a kernel is that it's essentially the description of a single thread that the gpu runs so it's essentially one thread in addition there is a concept of memory space on a gpu there's shared memory space that is shared by threads this global this private and ptx expresses that concept for data and then the idea is that multiple threads start synchronously and they keep executing the same instruction and they may diverge uh later or not and they may synchronize using sync threads or barrier kind of semantics so that's basically the high level idea here LLVM Internal Representation so as i said this was for non-compiler audience originally so just to kind of put things in perspective uh llvm has strict type rules this ssa and that's an interesting one and in the last one which is kind of useful that in 2.5 there's a concept of address spaces and that is what we use to express the notion of memory spaces in btx and that was that worked out very nicely the scalar and vector types come in handy in a few cases and i'll describe that later so it looks like there's a pretty good match at the abstraction level not perfect but you know there are cases where we had some problems but nothing major so the first one is that in llvmir there's no concept of a scalar register coming in so basically there's no move instruction right that's what it boils down to so in ptx you can express a register and you can move from one to the other you can't do that in lvm at least coming out of the front end so here's an example so what we had to do was so the ptx line says it declares 10 registers of size unsigned 32 which are numbered from r1 through r10 so sorry r0 through r9 so when we convert it to llvmir at the beginning of the function for the kernel we basically do allocates it it works fine except it looks kind of weird coming out and we basically do memphis immediately to to flatten that llk uh so here's an example of an actual ptx instruction um what the it says that add two registers r5 and r3 and put it into our seven and llvm ir we generate is basically doing d references through temporaries and then doing an actual store into the memory object for r7 since these are separately allocated there's really no problem about aliasing they can be scalarized pretty easily so that's pretty much what comes out of uh the p-lag front-end LLVM Intrinsic Functions now in ptx and in our gpu uh there are the special registers uh for example block block dem thread id these mean something special to the gpu they're part of the programming model so a kernel can use this variable to index into data and different threads can do different things on different data using the same code sequence so we had to basically get all of those things so the built-in variable accessors become intrinsic they are like function calls all the transcendental functions that are part of pta specification uh naturally fall into math functions and they become intrinsics synchronization is bar dot sync is basically a barrier synchronization operation among threads so all these things like texture sampling atomic operations they basically become intrinsic so we have a special definition for intrinsics in llvm so the parser or the ptx parser will essentially map these things to calls so here's an example of two instructions at ptx there's a cta id cta stands for cooperative thread array this is a special register that a kernel can use to find out which thread array it belongs to so it looks like a register move at the ptx level and the move says it's a 16 bit unsigned move and rh1 is a declared register of the right side size and second one is a texture reference which loads up four registers with the contents of uh the texture unit uh so we basically just turn that into intrinsics very straight forward PTX Types in LLVM types there's a slight notion of types which is different than two languages in ptx values are signed or unsigned so you can say u32 i32 and so on in rbm is just i32 so not a big deal just the ir sequence looks you know like lengthy because you have to do zero extensions and sign extensions uh and it gets cleaned up by the optimizer if the sizes are matching but when we generate the ir we wanted to keep the parser and the code generator for the parts simple so we don't do anything fancy there so so that's a slight kind of disconnect but not not a big deal it's expressible Kernel Translation so this is kind of the overview of the kernel how translation works we basically have intrinsics and the type notions are different we use truncation or extension to represent that registers are represented by memory objects local memory objects on stack and then we just go through the ptx vc instructions find the corresponding translation and generate labels and branches a lot of detailed engineering work but nothing complicated so that's kind of the summary of the the ptx or the p-lang and we spent with two interns and me part-time in a month or so we were able to get pretty good quality uh and that speaks a lot for the ease of use of llvm uh especially by people who are already familiar with the compilers so coming up to speed was you know for us about roughly a week of just reading the documentation trying out things and then the rest was just a large number of test cases roughly 700 test cases from cuda programs we made sure that everything works without any errors uh so that's the first part of the talk uh the rest of the stuff i'm going to talk about is one application and uh so the basic idea is that ptx currently Overview could uh which anything that generates ptx runs only on nvidia gpus and our customers like it but some other people don't like it uh so people have been saying wouldn't it be nice if you could run cuda on an x86 cpu and use ssc intrinsics for instructions and when we thought of doing p-lang that was the first thing we wanted to do they have a good code generator for x86 well maintained and pretty good quality so that's what we really wanted to do basically so that is so p lang was an excuse to do that but it also has a benefit it has other applications basically leveraging it's all about leveraging existing effort and letting others leverage what we do so basically the three parts to this i just want to go over quickly the execution model for cpus and gpu is different so we have to essentially take ptx kernel and make it do this cooperative thread transformation nvcc is our cuda compiler which currently generates ptx and then invokes the gpu compiler underneath we modified that under a special option to go through p lag and our code generator for uh for x86 to generate an x86 binary that you could test to see if it works correctly and so on and then we have a special runtime library that that implements the runtime model for a cpu so in cuda and ptx we have a notion of a cta cooperative thread array which consists of a lot of threads and these are called thread blocks there are multiple thread blocks so in this model what we do is each thread block will run on on a windows or linux p thread or regular thread so you basically within a thread block we go sequential in a loop and then we go wide that's the only difference between cuda and and multi-core Execution Model Translation so the way we do this uh i'm not going to go into all the details but some of the interesting things also so we basically have a piece of code the kernel it runs and then it synchronizes it runs again and then runs so we take the section between sink threads barriers and put a thread loop around it uh it goes you know just for loop and we break up the program like that after we get get it out of ptx and into llb ir so we that's the main thread model transformation then we also need to do scalar expansion because in the model all the variables that are stacks on stack then the gpu runs them there's multiple instances of those so we have to create them in software so any scalar variable that's used across a barrier you have to have an array of those that's scalar expansion and the rest is basically either special intrinsics without thread id they become part of the runtime and we have to map them into the appropriate runtime mechanism and then we have to allocate these expanded variables on the stack of the kernel so that's basically the summary of the whole procedure so here i will give you sort Scalar Expansion & Thread Loop Placement of a pseudo code example i didn't want to put a ptx lvm ir example here because it looks too big so so we have a kernel and there's a variable in red index and over here in the original cuda program there was a barrier and there was no thread loop so syntactically the thread loop transformation is take that region and surrounded by a thread loop that counts the thread up to the block dimension and except that this variable index is used over here and so it cannot be a scalar so you have to make one instance per thread so the way we do that is uh we basically have a scalar index variable that's really mallocked uh in the generated code and then we replace the red variable by this green index into this uh storage uh allocated for the whole set of uh indices so that is what basically happens conceptually but all of this is done at lrvm ir level so it's pretty much machine independent this transformation we do this as separate llvm passes and then we invoke the x86 code generator and before that we run all the optimizations in opt and mostly works so we tested this in all the sdk samples there are about 30 sdk samples that are available for cuda and we made it work on all of them at the ptx level Performance Scaling Multicore performance normalized to single-core so here are some of the interesting ones there are five or six applications and i'm just going to show you the scaling across four cores on a penryn machine the generated code is pretty clean code it hasn't been vectorized but if you don't yet use sse on that and that is something that we are planning to do and almost everything scales quite nicely except black shells and i think the reason for that is that the workload is just too tiny uh the one that we're using so it basically just stays flat but we haven't really investigated so the results are quite encouraging and so i could not show you Demo a demo but i can show you the the actual snapshot screen images of the demos running on x86 without a gpu and if it was running here you just see that they're running significantly slower but the look and feel is exactly the same of the same application running in cuda and and these are also pretty interactive demos and we had to use smaller problem sizes because they wouldn't scale so that's pretty much it thank you thank you