Rendering Battlefield 4 with Mantle - Home - AMD

Rendering Battlefield 4 with Mantle - Home - AMD

RENDERING BATTLEFIELD 4 WITH MANTLE Johan Andersson Yuriy ODonnell Electronic Arts 2 DX11 Mantle Avg: 78 fps Min: 42 fps

Avg: 120 fps Min: 94 fps +58%! Core i7-3970x, AMD Radeon R9 290x, 1080p ULTRA 3 BF4 MANTLE GOALS Goals: Significantly improve CPU performance More consistent & stable performance Non-goals:

Design new renderer from scratch for Mantle Improve GPU performance where possible Add support for a new Mantle rendering backend in a live game Take advantage of asymmetric MGPU (APU+discrete) Minimize changes to engine interfaces Compatible with built PC content Work on wide set of hardware APU to quad-GPU

But x64 only (32-bit Windows needs to die) 4 Optimize video memory consumption BF4 MANTLE STRATEGIC GOALS Prove that low-level graphics APIs work outside of consoles Push the industry towards low-level graphics APIs everywhere Build a foundation for the future that we can build great games on 5

AGENDA Shaders Pipelines Memory Resources Command buffers Queues Multiple GPUs 6 SHADERS

7 SHADER CONVERSION DX11 bytecode shaders gets converted to AMDIL & mapping applied using ILC tool Done at load time Dont have to change our shaders! Have full source & control over the process Could write AMDIL directly or use other frontends if wanted 8 SHADER RESOURCES Shader resource bind points replaced with a resource table object - descriptor set This is how the hardware accesses the shader resources

Flat list of images, buffers and samplers used by any of the shader stages Vertex shader streams converted to vertex shader buffer loads Engine assign each shader resource to specific slot in the descriptor set(s) Can share slots between shader stages = smaller descriptor sets The mapping takes a while to wrap ones head around 9 DESCRIPTOR SETS Very simple usage in BF4: for each draw call write flat list of resources Essentially direct replacement of SetTexture/SetConstantBuffer/SetInputStream Single dynamic descriptor set object per frame Sub-allocate for each draw call and write list of resources ~15000 resource slots written per frame in BF4, still very fast

10 DESCRIPTOR SETS 11 DESCRIPTOR SETS FUTURE OPTIMIZATIONS Use static descriptor sets when possible Reduce resource duplication by reusing & sharing more across shader stages Nested descriptor sets 12 PIPELINES

13 COMPUTE PIPELINES 1:1 mapping between pipeline & shader No state built into pipeline Can execute in parallel with rendering ~100 compute pipelines in BF4 14 GRAPHICS PIPELINES All graphics shader stages combined to a single pipeline object together with important graphics

state ~10000 graphics pipelines in BF4 on a single level, ~25 MB of video memory Could use smaller working pool of active state objects to keep reasonable amount in memory Have not been required for us 15 PRE-BUILDING PIPELINES Graphics pipeline creation is expensive operation, do at load time instead of runtime! Creating one of our graphics pipelines take ~10-60 ms each Pre-build using N parallel low-priority jobs Avoid 99.9% of runtime stalls caused by pipeline creation!

Requires knowing the graphics pipeline state that will be used with the shaders Primitive type Render target formats Render target write masks Blend modes Not fully trivial to know all state, may require engine changes / pre-defining use cases Important to design for! 16 PIPELINE CACHE Cache built pipelines both in memory cache and disk cache Improved loading times Max 300 MB

Simple LRU policy LZ4 compressed Database signature: Driver version Vendor ID Device ID 17 DYNAMIC STATE OBJECTS Graphics state is only set with the pipeline object and 5 dynamic state objects State objects: color blend, raster, viewport, depth-stencil, MSAA No other parameters such as in DX11 with stencil ref or SetViewport functions

Frostbite use case: Pre-create when possible Otherwise on-demand creation (hash map) Only ~100 state objects! Still possible to end up with lots of state objects Esp. with state object float & integer values (depth bounds, depth bias, viewport) But no need to store all permutations in memory, objects are fast to create & app manages lifetimes 18 MEMORY 19 MEMORY MANAGEMENT

Mantle devices exposes multiple memory heaps with characteristics Can be different between devices, drivers and OS:es Type Size Page CPU access GPU Read GPU Write

CPU Read CPU Write Local 256 MB 65535 CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined

130 170 0.0058 2.8 Local 4096 MB 65535 130

180 0 0 Remote 16106 MB 65535 CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined

2.6 2.6 0.1 3.3 Remote 16106 MB 65535 CpuVisible|CpuGpuCoherent

2.6 2.6 3.2 2.9 User explicitly places resources in wanted heaps Driver suggests preferred heaps when creating objects, not a requirement 20 FROSTBITE MEMORY HEAPS

System Shared Mapped CPU memory that is GPU visible. Write combined & persistently mapped = easy & fast to write to in parallel at any time System Shared Pinned CPU cached for readback. Not used much Video Shared GPU memory accessible by CPU. Used for descriptor sets and dynamic buffers Max 256 MB (legacy constraint) Avoid keeping persistently mapped as VidMM doesnt like this and can decide to move it back to CPU memory

Video Private GPU private memory. Used for render targets, textures and other resources CPU does not need to access 21 MEMORY REFERENCES WDDM needs to know which memory allocations are referenced for each command buffer In order to make sure they are resident and not paged out Max ~1700 memory references are supported Overhead with having lots of references Engine needs to keep track of what memory is referenced while building the command buffers Easy & fast to do

Each reference is either read-only or read/write We use a simple global list of references shared for all command buffers. 22 MEMORY POOLING Pooling memory allocations was required for us Sub allocate within larger 1 32 MB chunks All resources stored memory handle + offset Not as elegant as just void* on consoles Fragmentation can be a concern, not too much issues for us in practice GPU virtual memory mapping is fully supported, can simplify & optimize management 23

OVERCOMMITTING VIDEO MEMORY Avoid overcommitting video memory! Will lead to severe stalls as VidMM moves blocks and moves memory back and forth VidMM is a black box One of the biggest issues we ran into during development Recommendations Balance memory pools Make sure to use read-only memory references Use memory priorities 24 MEMORY PRIORITIES Setting priorities on the memory allocations helps VidMM choose what to page out when it has to

5 priority levels Very high = Render targets with MSAA High = Render targets and UAVs Normal = Textures Low = Shader & constant buffers Very low = vertex & index buffers 25 MEMORY RESIDENCY FUTURE For best results manage which resources are in video memory yourself & keep only ~80% used Avoid all stalls Can async DMA in and out We are thinking of redesigning to fully avoid possibility of overcommitting

Hoping WDDMs memory residency management can be simplified & improved in the future 26 RESOURCE MANAGEMENT 27 RESOURCE LIFETIMES App manages lifetime of all resources Have to make sure GPU is not using an object or memory while we are freeing it on the CPU How weve always worked with GPUs on the consoles Multi-GPU adds some additional complexity that consoles do not have

We keep track of lifetimes on a per frame granularity Queues for object destruction & free memory operations Add to queue at any time on the CPU Process queues when GPU command buffers for the frame are done executing Tracked with command buffer fences 28 LINEAR FRAME ALLOCATOR We use multiple linear allocators with Mantle for both transient buffers & images Used for huge amount of small constant data and other GPU frame data that CPU writes Easy to use and very low overhead Dont have to care about lifetimes or state Fixed memory buffers for each frame

Super cheap sub-allocation from from any thread If full, use heap allocation (also fast due to pooling) Alternative: ring buffers Requires being able to stall & drain pipeline at any allocation if full, additional complexity for us 29 TILING Textures should be tiled for performance Explicitly handled in Mantle, user selects linear or tiled Some formats (BC) cant be accessed as linear by the GPU On consoles we handle tiling offline as part of our data processing pipeline We know the exact tiling formats and have separate resources per platform

For Mantle Tiling formats are opaque, can be different between GPU architectures and image types Tile textures with DMA image upload from SystemShared to VideoPrivate Linear source, tiled destination Free 30 COMMAND BUFFERS 31 COMMAND BUFFERS Command buffers are the atomic unit of work dispatched to the GPU Separate creation from execution

No immediate context a la DX11 that can execute work at any call Makes resource synchronization and setup significantly easier & faster Typical BF4 scenes have around ~50 command buffers per frame Reasonable tradeoff for us with submission overhead vs CPU load-balancing 32 COMMAND BUFFER SOURCES Frostbite has 2 separate sources of command buffers World rendering Rendering the world with tons of objects, lots of draw calls. Have all frame data up front All resources except for render targets are read-only No resource state transitions

Generated in parallel up front each frame Immediate rendering (the rest) Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc Managing resource state, memory and running on different queues (graphics, compute, DMA) Sequentially generated in a single job, simulate an immediate context by splitting the command buffer Both are very important and have different requirements 33 RESOURCE TRANSITIONS Key design in Mantle to significantly lower driver overhead & complexity Explicit hazard tracking by the app/engine Drives architecture-specific caches & compression

AMD: FMASK, CMASK, HTILE Enables explicit memory management Examples: Optimal render target writes Graphics shader read-only Compute shader write-only DrawIndirect arguments Mantle has a strong validation layer that tracks transitions which is a major help 34 MANAGING RESOURCE TRANSITIONS Engines need a clear design on how to handle state transitions Multiple approaches possible: Sequential in-order command buffers

Generate one command buffer at the time in order Transition resources on-demand when doing operation on them, very simple Recommendation: start with this Out-of-order multiple command buffers Track state per command buffer, fix up transitions when order of command buffers is known Hybrid approaches & more 35 MANAGING RESOURCE TRANSITIONS IN FROSTBITE Current approach in Frostbite is quite basic: We keep track of a single state for each resource (not subresource) The immediate rendering transition resources as needed depending on operation The out of order world rendering command buffers dont need to transition states

Already have write access to RTs and read-access to all resources setup outside them Avoids the problem of them not knowing the state during generation Works now but as we do more general parallel rendering it will have to change Track resource state for each command buffer & fixup between command buffers 36 QUEUES 37 QUEUES Universal queue can do both graphics, compute and presents We use also use additional queues to parallelize GPU operations:

DMA queue Improve perf with faster transfers & avoiding idling graphics while transfering Compute queue - Improve perf by utilizing idle ALU and update resources simultaneously with gfx More GPUs = more queues! 38 QUEUES SYNCHRONIZATION Order of execution within a queue is sequential Synchronize multiple queues with GPU semaphores (signal & wait) Also works across multiple GPUs Compute Graphics

39 Wait S S W QUEUES SYNCHRONIZATION CONT Started out with explicit semaphores Error prone to handle when having lots of different semaphores & queues Difficult to visualize & debug Switched to more representation more similar to a job graph

Just a model on top of the semaphores 40 GPU JOB GRAPH Each GPU job has list of dependencies (other command buffers) Dependencies has to finish first before job can run on its queue The dependencies can be from any queue DMA Compute Graphics 1 Graphics 2

Was easier to work with, debug and visualize Really extendable going forward 41 Graphics 2 ASYNC DMA AMD GPUs have dedicated hardware DMA engines, lets use them! Uploading through DMA is faster than on universal queue, even if blocking DMA have alignment restrictions, have to support falling back to copies on universal queue Use case: Frame buffer & texture uploads Used by resource initial data uploads and our UpdateSubresource Guaranteed to be finished before the GPU universal queue starts rendering the frame

Use case: Multi-GPU frame buffer copy Peer-to-peer copy of the frame buffer to the GPU that will present it 42 ASYNC COMPUTE Frostbite has lots of compute shader passes that could run in parallel with graphics work HBAO, blurring, classification, tile-based lighting, etc Running as async compute can improve GPU performance by utilizing free ALU For example while doing shadowmap rendering (ROP bound) 43 ASYNC COMPUTE TILE-BASED LIGHTING

Compute Graphics Wait Gbuffer S TileZ Cull lights Shadowmaps Reflection 3 sequential compute shaders

Input: zbuffer & gbuffer Output: HDR texture/UAV Runs in parallel with graphics pipeline that renders to other targets 44 Lighting S Distort W Transp ASYNC COMPUTE TILE-BASED LIGHTING Compute Graphics

Wait Gbuffer S TileZ Cull lights Shadowmaps Lighting S Reflection Distort

W Transp We manually prepare the resources for the async compute Important to not access the resources on other queues at the same time (unless read-only state) Have to transition resources on the queue that last used it Up to 80% faster in our initial tests, but not fully reliable But is a pretty small part of the frame time Not in BF4 yet 45 MULTI-GPU 46

MULTI-GPU Multi-GPU alternatives: AFR Alternate Frame Rendering (1-4 GPUs of the same power) Heterogeneous AFR 1 small + 1 big GPU (APU + Discrete) SFR Split Frame Rendering Multi-GPU Job Graph Primary strong GPU + slave GPUs helping Frostbite supports AFR natively No synchronization points within the frame For resources that are not rendered every frame: re-render resources for each GPU Example: sky envmap update on weather change With Mantle multi-GPU is explicit and we have to build support for it ourselves 47

MULTI-GPU AFR WITH MANTLE All resources explicitly duplicated on each GPU with async DMA Hidden internally in our rendering abstraction Every frame alternate which GPU we build command buffers for and are using resources from Our UpdateSubresource has to make sure it updates resources on all GPU Presenting the screen has to in some modes copy the frame buffer to the GPU that owns the display Bonus: Can simulate multi-GPU mode even with single GPU to debug AFR issues! Multi-GPU works in windowed mode! 48 MULTI-GPU ISSUES

GPUs are independently rendering & presenting to the screen can cause micro-stuttering Frames are not presented in a regular intervals Frame rate can be high but presentation & gameplay is not smooth FCAT is a good tool to analyse this GPU0 GPU1 49 Frame 0 Frame 1 P

Frame 2 P P Frame 3 P Irregular presentation interval MULTI-GPU ISSUES GPUs are independently rendering & presenting to the screen can cause micro-stuttering Frames are not presented in a regular intervals Frame rate can be high but presentation & gameplay is not smooth

FCAT is a good tool to analyse this GPU0 GPU1 Frame 0 P Frame 1 Frame 2 P P Frame 3

P Ideal presentation interval We need to introduce dependency & dampening between the GPUs to alleviate this frame pacing 50 FRAME PACING Measure average frame rate on each GPU Short history (10-30 frames) Filter out spikes

Insert delay on the GPU before each present Force the frame times to become more regular and GPUs to align Delay value is based on the calculate avg frame rate GPU0 GPU1 51 Frame 0 P Frame 2

Frame 1 D P Delay P Frame 3 P CONCLUSION 52

MANTLE DEV RECOMMENDATIONS The validation layer is a critical friend! Youll end up with a lot of object & memory management code, try share with console code Make sure you have control over memory usage and can avoid overcommitting video memory Build a robust solution for resource state management early Figure out how to pre-create your graphics pipelines, can require engine design changes Build for multi-GPU support from the start, easier than to retrofit 53 FUTURE Second wave of Frostbite Mantle titles Adapt Frostbite core rendering layer based on learnings from Mantle

Refine binding & buffer updates to further reduce overhead Virtual memory management More async compute & async DMAs Multi-GPU job graph R&D Linux Would like to see how our Mantle renderer behaves with different memory management & driver model 54 QUESTIONS? Email: [email protected] Web: http://frostbite.com

Twitter: @repi 55 MANTLE SQUAD Frostbite AMD Johan Andersson Brian Bennett Jasper Bekkers

Michael Grossfeld Yuriy ODonnell Guennadi Riguer Arne Schober Graham Wihlidal 56

Recently Viewed Presentations

  • missaileenserafin.weebly.com

    missaileenserafin.weebly.com

    Least Common Multiple (LCM) - The smallest number that they both divide evenly into. ... Use a Venn Diagram to find the common factors . Identify the largest factor as the GCF. Creating a Venn Diagram to find GCF. What...
  • A Cell-Based Placement Tool Considering Triple Patterning ...

    A Cell-Based Placement Tool Considering Triple Patterning ...

    The simplest case is double patterning lithography which uses two litho-etch steps to manufacture the layout in a single layer ... D. Ding, K. Lucas, and D. Z. Pan, "A high-performance triple patterning layout decomposer with balanced density", in ......
  • Title Slide - WikiSpaces

    Title Slide - WikiSpaces

    Note: All employees must sign an agreement when hired acknowledging that they have been informed of the different options that will be optional for them to participant in. Part-time employees will be informed of the benefits that are available to...
  • OSHA Bloodborne Pathogens Awareness Training

    OSHA Bloodborne Pathogens Awareness Training

    Exposure Incident means a specific eye, mouth, other mucous membrane, non-intact skin, or parenteral contact with blood or other potentially infectious materials that results from the performance of an employee's duties. Exposure Incident
  • EXPLORING PSYCHOLOGY (7th Edition in Modules) David Myers

    EXPLORING PSYCHOLOGY (7th Edition in Modules) David Myers

    Sexual Motivation. Affiliation and Achievement. Basic Motivational Concepts. Motivation is defined as need or desire that energizes and directs behavior. Four perspectives for understanding motivated behaviors: Instinct theory
  • PNA's Sediments and Sludges Experience

    PNA's Sediments and Sludges Experience

    sediment lagoons 300,000 cu.yd. of waste to be managed 40-acre site 1 MM cu.yd. of waste ... ratio adjustments reduce costs greatly Work with multiple stakeholders toward redevelop goals Rigorous scheduling and budget controls Identify appropriate remedial solution and include...
  • Creation

    Creation

    Teacher Notes Page 1. Cover: God Created the World (Genesis 1:1 through 2:3) 2. Did you know that a long, long time ago there was nothing? No trees, no people, no buildings, not even sound or smell? Darkness was everywhere....
  • 3D Reconstruction from Two 2D Images - CAE Users

    3D Reconstruction from Two 2D Images - CAE Users

    3D Reconstruction from Two 2D Images Ted Shultz and Luis A. Rodriguez ECE 533 - Image Processing Fall 2003 Goal Approach Acquire digital images Match up corresponding points in both images Estimate a depth mask Generate an image from a...