How I Optimized F-1 Spirit for Mobile Processors

msg-5182-0-89764400-1405266061

A while ago I had found that the old MSX (and Gameboy) racing game F-1 Spirit had been recreated in open source (since 2004) for modern systems, using OpenGL. I thought I would try to do a port. It was easy. At least it was what I thought in the first place. After some quick conversion of GL calls to GLES (since the Open Pandora’s PowerVR chip only supports OpenGLES), it was up and running a couple of hours after I started. I was ready to test it. The intro screen went fine, and I was jumping in the game… Everything seemed to be for the best, except for a minor issue. It was way too slow in-game to even play it. I could only get a couple of frames per second, which is hardly ideal when you play a racing game. That’s when I decided to look at the code and see where it could be optimized.

[ This is a guest article from PtitSeb, the most prolific porter on the Open Pandora scene. Amen. ]

The game is supposed to run on Desktop Graphic GPUs, and therefore it wasn’t mobile friendly at all. Not so easy as a port. I needed a total of 3 days to optimize the game (to reach a 4 fold speed increase overall) to get playable framerates. Here are some of the methods I used to optimize this game. These optimizations are mostly focused on GLES (but most should apply to full GL too).

A word about optimization: you will not find here optimizations based on some assembly or NEON code, but only changes in the algorithm, in the way to do things. The assembly or NEON code should come just after that, in fact.

GLES Optimization Guideline

First, it’s good to know what and where to optimize and spend time on. In GLES and with the SGX graphic chip of the Open Pandora, there are a few things to keep in mind:

  • Try to limit the number of calls. It’s better to call 1 drawing command for 2 triangles than 2 drawing commands for 1 triangle each.

I found that out when porting Jedi Knight 2 – Jedi Outcast. There is that smart little function called R_DrawStripElements  inside tr_shade.cpp that take all triangles and try to make some triangle_strip out of them. It’s smart, and must give some boost on a desktop computer. But not on the Pandora! I have done testing (and porting of that function) only to find out that it was more efficient to just draw all individual triangles than to assemble them in a strip. I have not bothered with that function in my other idTech3 engine port, but if you are curious, that ported function is still in JKII sources on my github (but not used).

  • Avoid as much as possible the glEnable(GL_ALPHA) for GLES1.1 or discard if it is a shader for GLES2.0.

The now well known “Alpha Hack” is one of the thing I look on emus / games I ports. The effect is large.

  • Avoid state changing: the less you change Texture or other GLES state, the better. State changes can be a performance killer.

This is good for all type of GL driver anyway, that why Atlas are often used (more on that later in this article). As mentioned on Stack Overflow: “Textures are about the most expensive source of data to switch. So as a basic rule you sort your scene by use of textures, to switch textures as few as possible.”

GL to GLES Conversion

At first, I decided to convert the GL calls to GLES, and then optimize them. I’m more fluent with the limited function of GLES than having to figure out what (emulated with glshim) GL call cost more. But, now that’s it’s done, I know I could have stayed on GL… Still, converting GL to GLES is, most of the time, not too difficult as long as the software don’t use fancy GL functions (so sometimes it can be a real pain in the butt, and that’s why glshim can save the day). The basics for that is covered in the Pandora wiki, and with that, you have 80% of what is needed to convert GL to GLES cases. But sometimes, just converting is not enough, and you have to change things to get better framerates. In many cases, when converting GL games, you have to deal with Immediate mode and GL_QUADS. For example:

    glBegin(GL_QUADS);
    glTexCoord2f(0, 0);
    glVertex3f(float(x[i] - hot_x), float(y[i] - hot_y), 0);

    glTexCoord2f(0, tex_coord_y[i]);
    glVertex3f(float(x[i] - hot_x), float(y[i] + dy[i] - hot_y), 0);

    glTexCoord2f(tex_coord_x[i], tex_coord_y[i]);
    glVertex3f(float(x[i] + dx[i] - hot_x), float(y[i] + dy[i] - hot_y), 0);

    glTexCoord2f(tex_coord_x[i], 0);
    glVertex3f(float(x[i] + dx[i] - hot_x), float(y[i] - hot_y), 0);

    glEnd();

This will draw a simple Textured Quad on the Screen. Using the wiki, you can quickly convert that simple Quad using a GL_TRIANGLE_FAN. Basically, you define a QUAD with 4 vertex A, B, C and D, and transform it to a “triangle fan” of 2 triangles, ABC and ACD.

msg-5182-0-21000300-1405266062

Easy. Now, to optimize things in GLES (and in general), you have to limit the number of calls. So let’s take the previous code, and put it in a loop were the game draw many objects (of the same texture) in the same loop:

glBegin(GL_QUADS);
for (i = 0;i < nparts;i++) {
    glTexCoord2f(0, 0);
    glVertex3f(float(x[i] - hot_x), float(y[i] - hot_y), 0);

    glTexCoord2f(0, tex_coord_y[i]);
    glVertex3f(float(x[i] - hot_x), float(y[i] + dy[i] - hot_y), 0);

    glTexCoord2f(tex_coord_x[i], tex_coord_y[i]);
    glVertex3f(float(x[i] + dx[i] - hot_x), float(y[i] + dy[i] - hot_y), 0);

    glTexCoord2f(tex_coord_x[i], 0);
    glVertex3f(float(x[i] + dx[i] - hot_x), float(y[i] - hot_y), 0);
} 
glEnd();

But trying to convert the same way, using TRIANGLE_FAN, will not work without breaking the loop. With 2 Quads, you will get something like that:

msg-5182-0-32223800-1405266060

Of course, the easy solution would be to draw each quad individually. But you will lose performance by multiplying drawing calls. So here, the solution is to not use glDrawArrays, but a glDrawElements. DrawElements can draw Arrays, but with an Index table. So here, you keep the Vertex (and TexCoord, Color, Normal) as-is, and you create an index table that will describe the triangles to draw. So, for 1 Quad (that is represented with indexes: 0 1 2 3), you can make 2 triangles, one is 0 1 2, and the other is 0 2 3. So going from a list a QUADS to a list of TRIANGLES is a quite simple loop:

// if nbr is the size of the QUADS list in vertex
GLushort indices[nbr*6/4];
GLushort ind = 0;
for (int i=0; i<nbr; i+=4) {
    // 1st triangle
    indices[ind++] = i;
    indices[ind++] = i+1;
    indices[ind++] = i+2;
    // 2nd triangle
    indices[ind++] = i;
    indices[ind++] = i+2;
    indices[ind++] = i+3;
}
glDrawElements(GL_TRIANGLES, ind, indices);

So now you can batch QUADS drawing in GLES too.

How to find where to optimize

When you run a game and it’s slow as hell, the first issue is to find where to optimize the code. Finding that a game is slow is just a matter of running it, but finding what’s driving the slowdown, what function to optimize is always a bit more tricky.

Most of the time, I use the tool “perf” that I run on an SSH terminal while the game is running.

Using putty on Windows, I use perf contained in codeblocks. Using the command “sudo perf top”, you get the list of most CPU intensive functions on top. You can even see the detailed assembly code of each function in case your are converting it to NEON or assembly.

In this particular case, the function names are sufficient.

msg-5182-0-21319000-1406474806

In that picture, you can see some examples of perf reading in different scenario. Here are some hints on how to read them:

Most of the examples are emulators that use a JIT/Dynarec engine (same goes for Java or Mono based games). The recompiled code is not seen in detail, so you can only find a big “perf_XXXX_map” that eats most of the CPU time. This line is the resulting JITed code.

Along with this, you can sometimes see another line like “omap3_enter_idle” from the kernel. This line is important. It means the CPU is idle, waiting for something. Because the Pandora is a single CPU machine, seeing the CPU Idle means most of the time it’s waiting for the GPU (unless it’s really waiting for the user of course).

So in the samples perf profiles, you can see the first, third and fifth are Idle, and they are waiting for the GPU (it was PPSSPP with Ridge Racer 2 car selection screen, Mupen64Plus with MarioKart demo with dual player, and F1-Spririt in the first demo loop). On the contrary, the second and fourth are more CPU limited, and most CPU cycles are eaten be the JITed code (it was PPSSPP with Ridge Racer 2 in race and Mupen64Plus on GoldenEye with Rice plugin GLES 1.1).

So for F1-Spirit, it was clear that most of the performance was coming from the GPU.

So in order to find out what’s slowing down the GPU, you can try to use PVRTrace and other tools, or read the code to understand what’s happening. Reading the source code of F1-Spirit made it very clear that one of the main issues was with the texture handling. Each tile on screen (a piece of grass, a block of road) was loading it’s own texture picture and GL Texture handle.

So, the trick I used and where I could get the largest FPS gain was to plug a “Texture Manager” in the code. So, if the same texture is loaded twice, the same GL Texture handler can be used. With some modifications in the rendering loop, you can spare many state changes, and this is where you can gain a much better framerate on a GPU limited game.

To implement the Texture Manager, one of the main challenge I had was to find out which texture had already been loaded. In fact, where I implemented it, the name of the file is long lost, and only the binary texture is available. So the trick here was to use a Hash of the texture to determine which texture was used. I took the algorithm from the Glide64 plugin (from Mupen64plus). It’s a simple CRC32 of texture, plus the size of it. In the source code, look inside “SDL_glutaux.cpp” for most of the code I added.

msg-5182-0-54628600-1406474810

Using a Texture Manager, and reorganizing a bit the rendering loop (look inside RoadPieces.cpp) to avoid State change (mainly to use glBindTexture as little as possible), I managed to go from 5 FPS to more than 30 FPS.

So now it’s CPU limited?!

After that kind of performance improvement, I decided to publish the game on the repository. It went well, but then some users reported about the level “Dirt” running too slow.

I had already simplified the weather effect (the rain, in its initial code, was too dense and slowed things down way too much). So I did some additional perf analysis on this level to see what was going on.

Interestingly enough, this level was not GPU limited, but CPU limited. After a quick look qt the code, the problem was with the “holes” in the road that ate too much CPU time. The issue was located inside “CTrack::over_rock” (in track.cpp).

This method is used to see if the car is over a rock. Sounds simple. They just parse the current list of rock and stop if the car is over one, and continue until the list is over. The thing is, this “Dirt” level has many rocks, and this list is parsed at every frame, eating all the CPU cycles.

I decided to change the algorithm. Most of the time, we are not on any rock, and usually we only check for 1 potential rock in the surroundings. So I sorted the rocks, with their X coordinates, and put them in a 1D array. And I also used another 1D array to have a quicker access from an arbitrary X coordinate. So, with the car have coordinates {X,Y}, I first get with the quick access the index in the rock list. Then I check all the rocks starting at this index if they are under the car or not. When the X coordinate of the tested rock is too far from the car, I can stop testing since there’s no way such rocks would be nearby the car.

msg-5182-0-85081700-1406474807

Here’s a simple drawing to explain what’s happening. Of course, don’t take the values literally, or else you’ll find the picture a bit misleading (only 2 rocks should be tested according to the picture, but not according to the numbers).

This way, you can dramatically limit the number of collision tests. Naturally, there are probably smarter and more effective methods (like trying to put rocks in a 2D arrays). But it was enough to ensure the game would be GPU limited again…

Going further

So, this is the current state of F1-Spirit on the Pandora. Take a look at the code and you’ll see there is not much NEON code (a little bit, but since the game is GPU limited most of the time, it’s useless to spend time here).

To get even better performance (and get solid 50fps on a Gigahertz Pandora for example), the next step would be to implement a Texture Atlas.

A Texture Atlas is a great tool to avoid Texture state change. Instead of having each texture separated, you put them in a bigger texture.

msg-5182-0-04989000-1406474809

But to handle a texture atlas, you also need to handle texture coordinates for each texture (see the above picture for a crude and simplified explanation), requiring more engine modifications. So this part has not been done (yet?).

So that conclude that article, hoping you find it useful, or at least entertaining.

2 thoughts on “How I Optimized F-1 Spirit for Mobile Processors

  1. Antartica

    Thank you for this excellent post.
    Being curious about OpenGL and wanting to familiarize myself with the terminology, “best practices” and general way of structuring an OpenGL program, this has felt like a little gift. Moreover, it has been a pleasure to read with the simple explanations and nice diagrams. Thanks again… and more of these, please :-).

    Reply
  2. Pickle

    Another possible opto of the triangle batches is to keep them as strips and use degenerate indices to batch them together. Ive used this in quake 1 and 2.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *