Monday, July 7, 2014

WebGL and Performance 2 - A lumpy start

Yes, I meant to say lumpy (as opposed to bumpy).

I have a basic experiment up and running, but I have not organized or cleaned the code and am not quite ready to start our performance journey.  Nonetheless, I wanted to explain my experiences thus far.

 

You will recall from the previous post that I am an experienced OpenGL programmer (as well as DirectX) using several languages (C++, Java, C#). However, I have written very little Javascript in recent years beyond just a few hours experimenting with HTML5 features (WebGL, WebSockets, WebWorkers, WebRTC).  The point of this is that I expect this to be a slow start as I re-acquaint myself with this language, environment and tools.

A bit about Native development

The first thing I did was consider the over-simplified mental model I use when programming native GL apps. A version, simplified for our purpose here looks something like this:

 

 The rendering process at this level is simple:
  1. My code sends commands to the GPU's driver
  2. The driver passes them to the GPU (after some likely re-packaging)
  3. The GPU stores my data (when I issue storage commands) and executes shaders, drawing fragments (pixels) into the back buffer
  4. When I am done with a scene, I call a method--such as glSwapBuffers, Present, or some other API-specific call that switches the to the two buffers: back becomes front, front becomes back



In this way, while my code is rendering to the back-buffer, the monitor is reading the front buffer.

Note: I keep the driver in this model only to remind myself the [occasionally] large differences between them. For example, some drivers do direct DMA transfers when you buffer data (send it to the GPU), whereas others stage it in separate memory (resulting in an extra copy); some drivers are more picky about shader syntax, etc.

Buffer swapping and V-sync

The monitor will only read the buffer contents once every time it updates it's display, which is typically every 1/60th of a second, or 1/120th on a few high end displays.  So, it is ideal to have a new scene rendered during that time and then swap buffers right before the monitor reads the next frame.

Locking rendering to the monitor's refresh rate is called v-sync,  or "vertical refresh synchronization."

However, it is often useful to render much faster than the display can show.

Why?

There are two reasons:

  1. Gamers will attest that higher FPS (Frames Per Second) often gives an edge in competitive gaming. This seems to make no sense, as the player cannot see pixels that are never shown on the display, but this isn't why it helps.  Many games are naively built with one large game loop in which physics, player movement, networking and other logic operates in lock-step with the screen updates (or at least with the frames being rendered inside the GPU).  So, turning up the frame rate improves the responsiveness of the game even though it does not help the visual experience (and in fact it can hurt it, as sometimes the buffers will be swapped while the monitor is reading a frame, resulting in visual tearing across the screen).
  2. While developing a graphics intensive app, the GPU will often be the performance bottleneck; this is the case with most popular 3D games.  Letting the GPU render frames as fast as possible gives you--the developer--a sense of changes in performance as you change your code.  For example, say my new competitor to Call of Duty is rendering on my development machine at 250 FPS, and after some optimization I see it rise to 400 FPS.  This is much easier than locking updates to 60 FPS and then using external tools to analyze GPU workload during tests.

And now on to the browser

So, with this model in mind I built some quads onto which I would draw circles, then set about seeing how fast they would render.

This is where my intuition malfunctioned because of the java/script environment, and I went through what I presume is a common learning process regarding how to get fast frames.

I found Chrome and Opera's on-screen FPS meter, and turned them on.

I tried a never-ending loop, resulting in a never-responsive browser.

I tried setTimeout(), which worked, but didn't give great performance.

I tried requestAnimationFrame(), which also didn't meet my expectations.

Finally, I found postMessage(), but while my own timing was showing around 1000 FPS, the browsers FPS meter still showed never more than 60 FPS.

And then I understood

At this point I realized the browsers--all of them--are keeping control of the buffer swap, and this is what they are reporting on their FPS meter.  After all, there is no reason to think that my code is the only thing in the browser using OpenGL, so the browser makes sure they are all in sync by keeping strict control over the buffer swap.  What's more, it turns out that if the browser can't consistently render right around 60 FPS, some will deliberately drop to 30 because studies show that a constant rate is more important than a high rate in terms of user experience.

But, what about my 1000 fps?

I wondered: if the browser is taking such control, what is happening with all my additional frames? Is it dropping them, or are they executing fully?

To test this I did two things:
  1. I watched the load of my GPU when rendering 1000 FPS via postMessage, vs. when rendering 60 FPS using requestAnimationFrame.  The result: the 1000 FPS caused much more load.
  2. I turned off double-buffering and rendered squares, animating them diagonally at one pixel per frame and confirmed that (a) they moved 1000 pixels per second, and (2) they all rendered (I could snapshot the screen and see each box just one pixel from the other).
Result: all 1000 frames are being rendered; only the buffer swap is postponed.

This is good news as we can use this "unlimited FPS" approach to see how quickly we can render frames, and in doing so deduce any additional overhead caused by the browser.

A few words about the physics

My goal in this exploration is to test the hypothesis that it is possible to approach native GL performance in the browser; I had no intention of working with general Javascript optimization except insofar as it was critical to my goal.

Well, it turns out it's important.  For example, my simple integrator used real time delta's (via performance.now()) for all calculations.  But, integrators have the characteristic that the larger the simulated forces are, the higher must be the frequency of integration, and if the integration frequency is even near the lower limit for the simulation, erratic changes in the time delta between steps will cause strange behavior--like pulsing motions.

So it was with my simple demo: the garbage collection cycles changed the integration step just enough to de-stabilize the simulation, giving it a heart-beat appearance, and the browser was so sensitive to other things done on my computer--such as touching another window or scrolling an editor--that those too would cause a "pulse" effect.

So, for now I have set the integrator to use 16 millisecond time increments, no matter the actual time between them; this stabilized it, but forces me to re-think the importance of tackling javascript--and in particular, web workers--during this exploation.

Conclusion, and what's next

So perhaps now it's clear why I said this was a "lumpy" start: I got a few lumps along the way, and performance in the browser tends to be lumpy.

The first demo is here (although I cannot promise it will stay there, I promise this page will continue to have a valid link):
http://experiments.uhdcoder.com/circles1/

Next up, I'll clean the code, prepare some performance numbers (different browsers, with and without physics), probably add scaling and then lay out why this implementation is slow and how we will improve it.

No comments:

Post a Comment