Sunday, February 10, 2013

Debugging a Resource Leak in WebKit


From time to time a really interesting bug comes up. Here's one that I investigated recently.

There have been multiple reports that in Chrome, pages using WebGL were holding on to GPU memory too long. The symptom was that as the page was reloaded, the WebGL canvas "de-rezzed", so that it got blockier and blockier as the page was refreshed. The code is actually designed to do this in certain circumstances. Chrome limits the amount of video memory it will dedicate to WebGL canvases' back buffers, and will allocate the canvas at lower resolution if necessary. However, it was clear that this limit was being hit too often. When navigating away from a page, the canvas and WebGL context should have been promptly garbage collected, and this wasn't happening in Chrome because of how the generational garbage collector in V8 worked. Recently, one fix was committed for this problem, which basically forced the JavaScript garbage collector to run after navigating away from a page containing WebGL. This mitigated the problem, but didn't really fix it.

My colleague Brandon Jones has been working on a better fix for this problem, which involves forcibly shutting down the WebGL context when navigating away from the page, rather than relying on the JavaScript garbage collector to clean up the Canvas element and its context. Since there had been multiple bug reports by people writing real-world applications, we wanted to create a regression test. One application in particular, Chaos to Perfection, triggered the problem repeatably, even with the hack in place to trigger the garbage collector. Brandon contacted the authors of the demo, studied the code, and created a reduced test case which demonstrated, incredibly, that creation of a Web Worker was causing the problem. The worker didn't even need to do anything; if it was just created and immediately exited, WebGL contexts started leaking. There was no good explanation for this behavior, and Brandon immediately filed another bug, which is the subject of this post.

For background, many of the C++ classes in WebKit implement a reference counting protocol through a few template base classes (RefCounted, ThreadSafeRefCounted, TreeShared) and smart pointer classes (RefPtr, PassRefPtr). All DOM nodes in WebKit are reference counted. Reference counting these objects simplifies their memory management. Unfortunately, one "leak" of a reference can keep an entire graph of objects alive.

Given that a particular object (the WebGLRenderingContext) was leaking, the question was: who was causing it to leak? Who was holding on to a stray reference either to this object, or to another object which pointed to it? It would be great if, for any reference-counted C++ object, we could ask the question, "who is holding a reference to this object?" Unfortunately, there is no way to answer this question after the fact. Once the reference has been taken (the ref() method has been called on the target object), the information about who took the reference is lost. WebKit's infrastructure can not be easily modified to walk backward from a given reference-counted object to the objects which have incremented its reference count. Virtual machines for garbage collected languages like JavaScript inherently keep track of this information, and provide tools to figure out why any given object is being kept alive. With WebKit's reference counting implementation, all that can be done is to watch the ref/deref calls as they happen, and try to figure out based on the absence of a deref call which object actually leaked.

The JavaScript engine adds another wrinkle to the reference counting mechanism. JavaScript can be thought of as a layer on top of the entire C++ object graph representing the DOM. When a given C++ object is exposed to the JavaScript engine, a "wrapper object" is created, and the C++ object's reference count is incremented. When the JavaScript wrapper object is garbage collected, the C++ object's reference count is decremented. JavaScript increments the reference count of the underlying C++ object at most once. Initially, it seemed that the JavaScript engine might somehow be holding on to the WebGL context longer than it should be. However, setting a breakpoint in CanvasRenderingContext::deref() showed that the wrapper object was being properly garbage collected, and that V8WebGLRenderingContext::derefObject was being called.

More debugging showed that the WebGLRenderingContext wasn't being destroyed because the HTMLCanvasElement which owned it had a non-null parent when the WebGL context was deref'ed for the last time. When the worker wasn't started, the HTMLCanvasElement didn't have a parent at that time, and it was destroyed, cleaning up the WebGL context along with it. The question then became: who was keeping the HTMLCanvasElement alive? Fortunately, the test case's DOM was simple, and a little logging proved that the HTMLDocument owning the canvas was never being destroyed if a worker was started. In other words, the entire DOM of the web page was leaking if a worker was ever used, even after the worker exited.

The next question was: who was keeping the HTMLDocument alive? This was a much harder question to answer. One might think to just set breakpoints in HTMLDocument::ref and deref, grab call stacks at each, and try to match them up. Unfortunately, one does not simply "set breakpoints in HTMLDocument::ref and deref".

HTMLDocument inherits its ref and deref methods through a template base class called TreeShared, and ref and deref are non-virtual. Therefore it isn't possible to simply override them in HTMLDocument. Because TreeShared is a template class, though, one might think that calling code might use RefPtr<HTMLDocument>, calling HTMLDocument::ref and deref directly rather than Node's ref and deref methods, and therefore that there might be some debugging information allowing a breakpoint to be set. However, attempting to set one in gdb resulted in the following output:

[0] cancel
[1] all
[2] WebCore::EventTarget::ref() at EventTarget.h:105
[3] WebCore::EventTarget::ref() at EventTarget.h:105
[4] WebCore::EventTarget::ref() at EventTarget.h:105
[5] WebCore::EventTarget::ref() at EventTarget.h:105
[6] WebCore::TreeShared<WebCore::Node, WebCore::ContainerNode>::ref() at TreeShared.h:63
[7] WebCore::TreeShared<WebCore::Node, WebCore::ContainerNode>::ref() at TreeShared.h:63
[8] WebCore::TreeShared<WebCore::Node, WebCore::ContainerNode>::ref() at TreeShared.h:63
[9] WebCore::TreeShared<WebCore::Node, WebCore::ContainerNode>::ref() at TreeShared.h:63
...

(There were 32 variants of this ref() call in total.) This would not work.

In order to be able to set a breakpoint when ref() was called against an HTMLDocument, I added a flag to the base class TreeShared to track whether the instance was actually an HTMLDocument, and set that flag in HTMLDocument's constructor. (In hindsight, ref() and deref() could have simply been made virtual in TreeShared, overridden in HTMLDocument, and a breakpoint set there. At the time, I had been concerned about changing the calling convention of these methods, but I hadn't realized that call sites which called ref() and deref() could not parameterize over completely unrelated types like TreeShared and RefCounted. For this reason, it would have been perfectly fine, and less code, to make TreeShared's ref and deref methods virtual.)

After making these code changes, I found that a lot of calls were being made to HTMLDocument::ref(). Examples included HTML parsing (WebCore::DocumentWriter::begin and WebCore::V8DOMWindowShell::updateDocument), WebCore::LiveNodeListBase::LiveNodeListBase (called from WebKit::WebDocument::forms and autofill::PasswordAutofillManager::SendPasswordForms), parsing scripts, sending events, and others. There were too many calls being made to see a difference between when a worker was being started and when it wasn't.

To try to whittle down the ref() calls to only the essential ones, I added a mechanism to prevent my new code from firing during the next call to ref(). This helped, but only marginally. The problem was that nested calls were made to ref() and deref(), and it was infeasible to instrument all of the code involved. Instead, I added a stack-allocated helper class to suppress the calling of the new code in nested fashion. This worked much better. (Think about this – it's basically an "anti-RefPtr".) Tediously, I examined all of the calls to HTMLDocument::ref in the debugger and suppressed those apparently unrelated to debugging the task at hand. This involved sprinkling the suppression calls in various places such as document parsing, script execution, and event dispatch.

After reducing the calls to ref() to a manageable number for the given WebGL-related test case, I examined the call stacks of each in gdb during a reload of the test case, both with and without the creation of a worker, hoping that there would be a difference when the worker was started, and that that would indicate the leak. Unfortunately, the calls were exactly the same. I had clearly suppressed too many of the calls to ref(), and missed the essential one causing the leak.

This was disappointing. A significant amount of time had already been invested to this point, involving many full rebuilds of WebKit and Chromium, and it wasn't clear which of the code changes so far had prevented catching the bug. Instead, I focused on calls to HTMLDocument::deref(), hoping that one of them would provide a clue about the leak. Luckily, after examining only a few call stacks, the following showed up:

#5  0x06e8cd2d in WTF::RefPtr<WebCore::ScriptExecutionContext>::~RefPtr (this=0x7b438d7c) at RefPtr.h:56
#6  0x06e7435b in WTF::RefPtr<WebCore::ScriptExecutionContext>::~RefPtr (this=0x7b438d7c) at RefPtr.h:56
#7  0x01acc740 in WebCore::WorkerMessagingProxy::~WorkerMessagingProxy (this=0x7b438d70) at ../../third_party/WebKit/Source/WebCore/workers/WorkerMessagingProxy.cpp:272

Finally, this was a worker-related call to ref/deref. Looking at the related code, I found that WebCore's WorkerMessagingProxy class was being instantiated from Chromium-specific code: specifically, a class called WebWorkerClientImpl. Setting an appropriate breakpoint, I found that the stack trace constructing the WebWorkerClientImpl was as follows:

#0  WebCore::WorkerMessagingProxy::WorkerMessagingProxy (this=0x7a692b70, workerObject=0x7a6b1de0) at ../../third_party/WebKit/Source/WebCore/workers/WorkerMessagingProxy.cpp:262
#1  0x01acc189 in WebCore::WorkerMessagingProxy::WorkerMessagingProxy (this=0x7a692b70, workerObject=0x7a6b1de0) at ../../third_party/WebKit/Source/WebCore/workers/WorkerMessagingProxy.cpp:265
#2  0x03c4a6fa in WebKit::WebWorkerClientImpl::WebWorkerClientImpl (this=0x7a697aa0, worker=0x7a6b1de0, webFrame=0x7a64f4f0) at ../../third_party/WebKit/Source/WebKit/chromium/src/WebWorkerClientImpl.cpp:251
#3  0x03c48e87 in WebKit::WebWorkerClientImpl::WebWorkerClientImpl (this=0x7a697aa0, worker=0x7a6b1de0, webFrame=0x7a64f4f0) at ../../third_party/WebKit/Source/WebKit/chromium/src/WebWorkerClientImpl.cpp:252
#4  0x03c48da5 in WebKit::WebWorkerClientImpl::createWorkerContextProxy (worker=0x7a6b1de0) at ../../third_party/WebKit/Source/WebKit/chromium/src/WebWorkerClientImpl.cpp:81
#5  0x03c4f46c in WebCore::WorkerContextProxy::create (worker=0x7a6b1de0) at ../../third_party/WebKit/Source/WebKit/chromium/src/WorkerContextProxy.cpp:44
#6  0x01abcef2 in WebCore::Worker::Worker (this=0x7a6b1de0, context=0x7a8ce658) at ../../third_party/WebKit/Source/WebCore/workers/Worker.cpp:56
#7  0x01abc879 in WebCore::Worker::Worker (this=0x7a6b1de0, context=0x7a8ce658) at ../../third_party/WebKit/Source/WebCore/workers/Worker.cpp:58
#8  0x01abb8f6 in WebCore::Worker::create (context=0x7a8ce658, url=@0xbff38f10, ec=@0xbff38f54) at ../../third_party/WebKit/Source/WebCore/workers/Worker.cpp:65
#9  0x00373da9 in WebCore::V8Worker::constructorCallback (args=@0xbff39120) at V8Worker.cpp:99
#10 0x08d549c2 in v8::internal::HandleApiCallHelper<true> (args={<v8::internal::Arguments> = {<v8::internal::Embedded> = {<No data fields>}, length_ = 3, arguments_ = 0xbff3929c}, <No data fields>}, isolate=0x7d88fc00) at ../../v8/src/builtins.cc:1372
#11 0x08d5434a in v8::internal::Builtin_Impl_HandleApiCallConstruct (args={<v8::internal::Arguments> = {<v8::internal::Embedded> = {<No data fields>}, length_ = 3, arguments_ = 0xbff3929c}, <No data fields>}, isolate=0x7d88fc00) at ../../v8/src/builtins.cc:1395
#12 0x08d4a2cc in v8::internal::Builtin_HandleApiCallConstruct (args={<v8::internal::Arguments> = {<v8::internal::Embedded> = {<No data fields>}, length_ = 3, arguments_ = 0xbff3929c}, <No data fields>}, isolate=0x7d88fc00) at ../../v8/src/builtins.cc:1394
#13 0x3e90a3f6 in ?? ()
#14 0x3e9222f7 in ?? ()
#15 0x3e92daa9 in ?? ()
#16 0x3e9223b9 in ?? ()
#17 0x3e9134ea in ?? ()
#18 0x08dd4e8c in v8::internal::Invoke (is_construct=false, function={location_ = 0x7e0bce28}, receiver={location_ = 0x7e0bce2c}, argc=0, args=0x0, has_pending_exception=0xbff39547) at ../../v8/src/execution.cc:118
#19 0x08dd47b9 in v8::internal::Execution::Call (callable={location_ = 0x7e0bce28}, receiver={location_ = 0x7e0bce2c}, argc=0, argv=0x0, pending_exception=0xbff39547, convert_receiver=false) at ../../v8/src/execution.cc:179
#20 0x08cbf911 in v8::Script::Run (this=0x7e0bce1c) at ../../v8/src/api.cc:1726
#21 0x01008e01 in WebCore::ScriptRunner::runCompiledScript (script={val_ = 0x7e0bce1c}, context=0x7a8ce658) at ../../third_party/WebKit/Source/WebCore/bindings/v8/ScriptRunner.cpp:52
#22 0x00fd8c32 in WebCore::ScriptController::compileAndRunScript (this=0x7a8bc87c, source=@0xbff39980) at ../../third_party/WebKit/Source/WebCore/bindings/v8/ScriptController.cpp:281
#23 0x00fd8f29 in WebCore::ScriptController::evaluate (this=0x7a8bc87c, sourceCode=@0xbff39980) at ../../third_party/WebKit/Source/WebCore/bindings/v8/ScriptController.cpp:305
#24 0x06e6f202 in WebCore::ScriptElement::executeScript (this=0x7b4297f0, sourceCode=@0xbff39980) at ../../third_party/WebKit/Source/WebCore/dom/ScriptElement.cpp:305
#25 0x06e6d9c4 in WebCore::ScriptElement::prepareScript (this=0x7b4297f0, scriptStartPosition=@0xbff39c58, supportLegacyTypes=WebCore::ScriptElement::DisallowLegacyTypeInTypeAttribute) at ../../third_party/WebKit/Source/WebCore/dom/ScriptElement.cpp:242
#26 0x042752b8 in WebCore::HTMLScriptRunner::runScript (this=0x7cd3fc00, script=0x7b4297b0, scriptStartPosition=@0xbff39c58) at ../../third_party/WebKit/Source/WebCore/html/parser/HTMLScriptRunner.cpp:290
#27 0x04275013 in WebCore::HTMLScriptRunner::execute (this=0x7cd3fc00, scriptElement=@0xbff39c40, scriptStartPosition=@0xbff39c58) at ../../third_party/WebKit/Source/WebCore/html/parser/HTMLScriptRunner.cpp:170
#28 0x0425ba16 in WebCore::HTMLDocumentParser::runScriptsForPausedTreeBuilder (this=0x7e0e2400) at ../../third_party/WebKit/Source/WebCore/html/parser/HTMLDocumentParser.cpp:200
#29 0x0425bb09 in WebCore::HTMLDocumentParser::canTakeNextToken (this=0x7e0e2400, mode=WebCore::HTMLDocumentParser::AllowYield, session=@0xbff39d40) at ../../third_party/WebKit/Source/WebCore/html/parser/HTMLDocumentParser.cpp:217

The reason this hadn't shown up in the previous stack traces of calls to HTMLDocument::ref was that I had added one key suppression which was too eager: suppressing calls to HTMLDocument::ref from within execution of script nodes. The worker was naturally being instantiated from JavaScript during execution of a script node, and by suppressing all calls to ref and deref from these script nodes, I'd accidentally hidden the cause of the resource leak. In hindsight, this should have been obvious; since the leak was happening when a worker was started, I should have been looking specifically for worker-related objects being construted which referred to the document.

Why was the leak happening? After examining the code, the answer was simple. Chromium-specific code wrapped the WebCore class WorkerContextProxy in order to implement both dedicated (in-process) and shared (out-of-process) workers. The WebWorkerClientImpl was a thin wrapper object which delegated most of its methods to the WorkerContextProxy instance. Unfortunately, the WorkerContextProxy managed its own lifetime. When both the main thread and the worker thread had exited, the WorkerContextProxy would delete itself. It did not send out any notifications when this occurred, so the Chromium-specific WebWorkerClientImpl wrapper object would live forever. The WebWorkerClientImpl contained a RefPtr to the HTMLDocument in the form of a RefPtr<WebCore::ScriptExecutionContext> to prevent the document from being destroyed before the worker it had started. This was the cause of the leak.

How was the leak fixed? The fix involved splitting up Chromium's implementations of dedicated and shared workers and a moderate amount of surgery to the class hierarchy to avoid creating a wrapper object for WorkerContextProxy in the case of dedicated workers. Refer to the WebKit bug for the patch implementing the final fix, in addition to the patch used to debug the problem.

I hope that this post provided some insight into techniques used to track down a moderately tricky resource leak in WebKit. If you have comments on how this diagnosis was done or suggestions on how it could have been done more efficiently, please leave them in the comments or send email.