1 “WALK IN” SLIDE August 14-15, 2006
2 Windows® Performance Topics for GamesKev Gee Game Technology Group Steve Pronovost GPU Team Microsoft
3 Introduction The GTG team helps many games during production in our performance lab We often identify age-old performance problems as well as new ones Our goal is to help games reach their full performance potential
4 Performance SpecificsCPU I/O Graphics and Display Driver Models
5 CPU Best Practices Majority of games are CPU-boundThis is usually because: The CPU is a valuable resource for running the game The CPU is also a resource for feeding the GPU Failing to optimize common CPU-heavy code paths Misuse of APIs Over reliance on convenient code techniques during Budget the CPU across the entire scope of the game
6 Performance PrerequisitesProfile early and often Allow time for changes in both the engine and the artwork Profile on multiple hardware configurations Use automated profiling to identify serious regressions earlier PIX can help with this – see the docs Clearly define performance goals You need to know what target you are aiming for in order to hit it e.g.
7 Profiling Lots of profile tools available for CPU performance(Visual Studio® Team System Profiler, VTUNE, etc.) Ensure you are profiling the right build Release-level optimizations enabled You need to generate symbols (pdb files) This will not affect the runtime performance Storing these in a symbol server is a good idea Remove asserts, logging, or debug code included that might pollute results Debug output can seriously impact the performance profile Don’t forget to remove the string formatting too Gracefully handle low frame rates for delta time computation Run as fast as you can to stress the game and rendering systems Disable VSync as this will lock to the refresh rate
8 Performance SpecificsCPU I/O Graphics and Display Drivers
9 I/O File I/O performance often gets less attention than graphics or other aspects of the game As a result we often see Long load being an issue in titles Heavy use of on-the-fly conversions and/or suboptimal formats e.g. Shader compilation during load times Issues are sometimes inherited from adoption of file format libraries Check they are optimized for windows best practices Development convenience is an early goal, but plans must be made for addressing the inevitable performance impact
10 I/O Performance Implementing high-performance I/O requires use of Win32 specific operations Generally you should: Preprocess assets Bundle resources Use Asynchronous I/O and/or worker threads Use Memory Mapped files where VAddress space allows Profiling your file system usage to see what your application is really doing
11 File I/O Internals (simplified)Disk Driver Memory Manager I/O Req Packet (IRP) File System Driver Cache Manager FAST I/O I/O Req Packet (IRP) or Fast I/O Req. Interactions between the system File Cache / Memory Manager and the File System Driver for the following 3 modes of operation: 1) Fast I/O (as with Memory mapped files or sequential scan where the pages are in the cache) 2) Creation and processing of an IRP (as with a readFile request) I/O Manager NTOSKRNL.EXE Kernel Mode User Mode Application Kernel32.dll \ NTDll.dll
12 File Cache Hints Use SEQUENTIAL_SCAN and RANDOM_ACCESS flags when appropriate These are valuable hints for the file cache system Allows the cache manager to prefetch and discard more efficiently
13 Debugging File I/O The SysInternal’s FileMon tool is very useful for analyzing IO behavior -
14 I/O Best Practice OpenFile() / CreateFile() is expensiveInreasing cost due to security features and on-demand virus scanners Limit the number of files you need to open Avoid repeated opening and closing of the same file File exists tests involve opening the file Cheaper to try to open the file and handle the error If your title supports mods or overridden files Don’t check for them unless the game is run in a specific mode to do so can be as simple as a command-line switch Saves majority of users from suffering the cost of looking for non-existing files
15 Performance SpecificsCPU I/O Graphics and Display Driver Models
16 Graphics and Display DriversThe following slides detail the internals of XPDM and WDDM Focused on resource management and scheduling. Useful to understand how the GPU is feed by the driver and what it involves.
17 Session Space Display Driver Miniport Driver Win32k & Dxg Videoport Kernel Mode User Mode Application D3D Runtime
18 Session Space CDD Kernel Driver Win32k Dxgkrnl Kernel Mode User Mode Application D3D Runtime DWM Application Process User Mode Driver DWM Process
19 Work is sent to the GPU by accumulating commands in buffers which are read directly by the GPU There is a limited amount of DMA buffer available, once it’s full CPU has to wait for GPU to consume some. Progress from the GPU is indicated in the form of fences
20 40 47 41 46 GPU 42 45 43 44
21 Wait Global Lock GPU Display Driver Win32k & Dxg Dma BufferKernel mode Global Lock User Mode Application D3D Runtime DP2 Buffer
22 GPU Scheduler Database1 2 Wait DMA Buffer Win32k & dxgkrnl KMD Scheduling Thread Kernel Mode Process Lock User Mode D3D Runtime UMD D3D Runtime UMD Application #1 Command Buffer Application #2 Command Buffer
23 XPDM: Designed for a single application making heavy use of the GPU. Global lock and insertion directly in GPU ring make sharing GPU problematic. WDDM: Designed for multiple applications using the GPU simultaneously. Applications don’t block on each other. GPU work is scheduled rather than first come, first served.
24 GPU memory used to be allocated on a first-come, first-served basis. Pool managed was introduced to allow application to over-commit video memory and gracefully handle lost device. Pool managed limited to a single application, resource manager is not cross process.
25 … CPU Physical Address Space Application Virtual Address SpaceSystem RAM Application EXE Application heap Application DLL … Surface Non-local video memory (256 MB) Non-local video memory (256 MB) Local video memory (256 MB) Local video memory (256 MB) 2 GB 4 GB
26 … CPU Physical Address Space Application Virtual Address SpaceSystem RAM Application EXE Application heap Application DLL … Surface Surface backing store Non-local video memory (256 MB) Non-local video memory (256 MB) Local video memory (256 MB) Local video memory (256 MB) 2 GB 4 GB
27 With WDDM, video memory is now virtualized Limited only by application address space Virtualization on all API, over-commitment allowed on D3D10 Limit available memory through old API to avoid breaking old assumptions Pool managed removed from newer API Usually use less VA space than XPDM Not true on some 512-MB adapters with pool managed
28 … CPU Physical Address Space Application Virtual Address SpaceSystem RAM Application EXE Application heap Application DLL … Non-local video memory (256 MB) Surface Surface VidMm backing store Local video memory (256 MB) 2 GB 4 GB
29 … CPU Physical Address Space Application Virtual Address SpaceSystem RAM Application EXE Application heap Application DLL … Non-local video memory (256 MB) Surface Surface runtime backing store Surface VidMm backing store Local video memory (256 MB) 2 GB 4 GB
30 Lost Device ImprovementsDevice lost only occur rarely Mode switch (ex: screen saver) Unexpected error But now we have device removed i.e. physically removed similar but much less frequent Device removed occurs when PNP stop (ex: driver upgrade) GPU Reset (aka TDR)
31 No GPU virtual address, yet , Uses a handle to virtualize DMA buffers are associated with list of allocations referenced and patch location Scheduling thread ensures all referenced allocations are visible to GPU Patches DMA buffer with actual placement location.
32 GPU Scheduler Database1 2 3 Patch DMA buffer Find location for each resources Bring resources in memory Scheduling Thread
33 Call to Action! Make performance analysis a habit throughout development Identify and fix issues early Test your games on Windows Vista Move to Direct3D 10 using WDDM for the best graphics performance
34 © 2006 Microsoft Corporation. All rights reserved.End Slide 12/8/2017 1:37 AM August 14-15, 2006 DirectX Developer Center Game Development MSDN Forums Xbox 360 Central XNA Web site © 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. 34
35 Backup Slides
36 Using Profilers Common CPU problems to look forAvoidable string and memory copy/comparison operations Improper API usage causing extra work in OS components Expensive operators, type casts, or math functions High-frequency small functions that should have been inlined Use the profiler to guide your optimization efforts if it doesn’t appear in the profile it isn’t likely to be worth the effort 36
37 Algorithm Selection Has the largest impact on performance, but is hard to change late in the project Utilize the GPU for computations where possible rather than relying solely on the CPU Approximate math routines may be good enough Quaternion’s can often be faster than matrices We often see tests like polygon/line intersections being a major CPU hit indicates poor choice / optimization of collision / culling algorithm Over engineering can cause additional overhead 37
38 Dynamic-Link LibrariesApplications often link to many dynamic-link libraries that are loaded on start of the executable To improve the speed of application startup Use the “Delay Load” linker setting Alternatively, use explicit LoadLibrary() calls where possible to load DLLs Statically initialize global variables rather than using code in DllMain Call DisableThreadLibraryCalls() when possible This Avoids extra messages sent to the DLL when threads are created / destroyed Always check that your application isn’t linking to DLLs it doesn’t actually need
39 Compiler The compiler is your best performance tool, but is often overlooked Newer versions typically have much better optimizers Crank up the warning level Understand why the compiler is generating a warning rather than just silencing it Where available, take advantage of static code analysis tools
40 Compiler Settings Explicitly set the CPU targetrather than relying on the default Target P-III or better generation processors Compiler makes use of SSE instructions on all modern CPUs Specify the type of exception-handling used in your code e.g. C++ vs. structured Optimize for size then favor fast code Fitting into the code cache is usually a big win These settings still perform aggressive inlining Intrinsic functions can yield a performance win Use Whole Program Optimization (WPO) and Profile-Guided Optimization (PGO) on test / release builds can give a significant performance boost PGO is critical for native 64 bit
41 C++ Language Features Run-Time Type Information and dynamic_cast are powerful, but not free Try to design systems that don’t require heavy use of this feature Be aware of when the compiler is creating temporaries; use language best practices that avoid unnecessary temps Naive use of C++ operator overloading, passing parameters by value instead of by reference, etc. References “Effective C++”, Scott Meyers “More Effective C++”, Scott Meyers 41
42 STL Used properly, the STL is a huge productivity boostBe mindful of the performance and memory behavior of libraries such as the STL Be aware of STL best practices generally, for games, and your platform Be wary about using one container type for all problems Try to resist the temptation to ‘roll your own’ until you know your need really demands it Making good use of a template library requires a high proficiency with C++ VS 2005 solves the debugging nightmare References “C++ Coding Standards” – Herb Sutter “Effective STL” – Scott Meyers 42
43 Multithreading Well designed multithreading is the only way to get a speed-up from multiple cores Have only 1 heavyweight worker thread per core Use lightweight threads for lightweight tasks References “Programming applications for Windows”, Jeffery Richter “Microsoft Windows Internals”, Russinovich, Solomon Avoid using D3DCREATE_MULTITHREADED Correct use of thread synchronization primitives is key to M/T performance lock free algorithms are non trivial to implement Hold locks for as short a time as possible to avoid uncessary sync costs 43
44 Memory Per-frame allocation and/or deallocation of memory will kill performance Use memory pools, fixed-sized allocation, and other techniques to manage temporary memory needs Misaligned memory reads can cause serious problems with performance Cache line size plays a big part in this Tools like VTUNE can help find memory alignment issues
45 Memory Copying Be watchful for wasteful memory copiesHeavy use of strings can cause a lot of copying string classes typically make use of heap memory Hash table, token lists are cheaper STL containers holding large structures; use pointers to such structures instead since STL uses a copy semantic Large and complex C++ constructors can often hide copying from the class user
46 Virtual Memory Windows titles can take advantage of Large physical memories and Virtual Memory support means your title can make use of more precomputation and lookup on Windows However, Be careful about random memory access patterns as these could result in a lot of paging and/or thrashing Use of const will help keep your read/write and read data more densely packed and in separate VM pages
47 Memory Protections Often memory is allocated, filled once on load, and then used in a read only mode for a long period of time Ideally you would mark the memory as READ only after initializing it Helps both performance of the VM system and to quickly catch errant writing code Requires using the VirtualAlloc, VirtualProtect, VirtualFree API calls and a custom memory allocator for some data Read only can be faster too 47
48 CPU and Audio Audio Mixing is increasingly being handled by the CPUBudget cycles for this purpose If sample rate conversion ends up being a serious CPU hit you are doing something wrong On XP watch for KMIXER.SYS as a ‘hot spot’ in the profile Try to stick with the same sample rate format for all your samples and/or streaming channels Only use lower quality rates if it truly gives you a perf win 16bit 44KHz totally acceptable on modern CPUs