The Server Architecture Behind Killzone Shadow Fall

The Server Architecture Behind Killzone Shadow Fall

The Server Architecture Behind Killzone Shadow Fall Jorrit Rouw Lead Game Tech, Guerrilla Games 1 Overview Hosting Choices Server Architecture Match Making Zero Downtime

Updating Without Patching How Telemetry Helped Fix Issues 2 Introduction Killzone Shadow Fall is a First Person Shooter 24 player team based gameplay Warzone = Customizable Game Selection of maps / missions / rules Official and community created Warzones Multiple active games in a Warzone -

Bottom right screenshot shows 300+ players in 1 warzone 3 Hosting Choices 4 Hosting Choices - Need for Scaling Killzone 3 used traditional hosting Have to manage many environments Adding / scaling environment needs planning Shared load testing environment with other titles - Example environments: Guerrilla Internal, FPQA, FQA, Demo, Press, Production, Load Test 5

Hosting Choices - Need for Scaling (2) For Killzone Shadow Fall we needed flexibility Launch title = no Public Beta Effect PlayStationPlus pay wall? Used Amazon for easy scaling Load testing 100K sims 100 real players - Some servers still physically hosted in Sony data centers 6 Server Architecture 7

Server Architecture - Overview Most components use standard web technology Java Tomcat REST Simpler hosting by using Amazon components Storage: Dynamo DB / S3 / Redshift Deployment: Cloud Formation / Elastic Beanstalk Monitoring: Cloud Watch 8 Server Architecture - Overview (2) Regional: Turbine clusters Dynamo DB

Webapp cluster killzone.com Create game Load balancer Announce Live Tiles Matchmaking, leaderboards, warzones, Gameplay Game - Webapps are all running the same components

Turbine hosted partially in Sony data centers, partially in AWS Hazelcast is peer to peer communication library = HTTPS REST = Hazelcast = UDP 9 Server Architecture - Gameplay (Turbine) Game 1 Player 1 Player N broadcast Turbine

C++ / libRUDP Messaging hub between players in a game Game logic runs on the PS4TM Advantages: Prevents NAT issues Reduces required bandwidth Disadvantages: Player 1 Game N Player N - We pay for bandwidth Introduces extra hop

libRUDP is part of the PS4 SDK and provides reliable UDP communication 10 Match Making 11 Match Making - General Principles Linear scaling State stored in memory Minimal communication between servers Fault tolerant All servers equal Client repeats requests

Client sends match updates - One server not aware of games hosted on other server Since state is stored in memory clients need to repeat requests to reconstruct state in case of server outage 12 Match Making - Algorithm Load balancer Webapp 1: Warzone B Webapp 2: Warzone A,

partition 1 - Warzone B is small Warzone A is too big to be hosted by 1 server so is split up in 2 partitions Webapp 3: Warzone A, partition 2 13 Match Making - Algorithm Request ID X: Join Warzone A Load balancer

Webapp 1: Warzone B Webapp 2: Warzone A, partition 1 - Client creates request with random ID X Webapp 3: Warzone A, partition 2 14 Match Making - Algorithm Request ID X: Join

Warzone A random Webapp 1: Warzone B Load balancer Webapp 2: Warzone A, partition 1 - Load balancer is Amazon ELB, so random forwarding Webapp 3: Warzone A, partition 2

15 Match Making - Algorithm Request ID X: Join Warzone A Load balancer Shared map: Who hosts what warzone Webapp 1: Warzone B Webapp 2: Warzone A, partition 1 -

Using shared Hazelcast map Webapp 3: Warzone A, partition 2 16 Match Making - Algorithm Request ID X: Join Warzone A Load balancer In memory: Pending request Webapp 1: Warzone B Forward: Warzone A,

partition consistent_hash(X) Webapp 2: Warzone A, partition 1 - Lets say match was not found first time, so webapp 2 returns match making progress status Webapp 3: Warzone A, partition 2 17 Match Making - Algorithm Repeat Request ID X: Join Warzone A

Load balancer In memory: Pending request Forward Webapp 1: Warzone B Webapp 2: Warzone A, partition 1 - After 5 seconds client repeats request Request has same ID so ends up on webapp 2 again Webapp 3: Warzone A, partition 2

18 Match Making - Algorithm Load balancer Assign Game Y Webapp 1: Warzone B Webapp 2: Warzone A, partition 1 - Game found, assignment sent back

Webapp 3: Warzone A, partition 2 19 Match Making - Algorithm Join Turbine Load balancer Webapp 1: Warzone B Webapp 2: Warzone A, partition 1

- Game uses IP / port returned by game assignment to connect to Turbine Webapp 3: Warzone A, partition 2 20 Match Making - Algorithm Game Y, Warzone A: Match Update Load balancer In memory: players in Game Y

Forward: Warzone A, Webapp 2: partition consistent_hash(Y) Webapp 3: Warzone A, Warzone A, partition 1 partition 2 Webapp 1: Warzone B - One client in game send match update when player joins (and periodically in 30 second intervals). Game can be hosted on different webapp because consistent_hash(Y) is used 21 Match Making - Algorithm

Load balancer Webapp 1: Warzone B Webapp 2: Warzone A, partition 1 Webapp 3: Warzone A, partition 2 No Cross Partition Matching - Max partition size around 10K players Requests only match with games on same partition

Requests can create new game on different node 22 Match Making - Finding Game Greedy algorithm for finding best game Weighted scoring, in priority order: Region based on ping Good / bad ping players Biggest games first Long unbalanced first ELO Glicko skill

Dont join game you just left - Ping all data centers on application startup Match outside closest region after period of time Good ping < 100 ms 23 Match Making - Round End Merge small game into large Put players back into own region Balance team sizes

Balance skill - Take best 1 or 2 players from 1 team and exchange for worst players on other team 24 Zero Downtime 25 Zero Downtime - Overview Killzone 3 often had hour long maintenance Zero downtime deployment / scaling Manage our own deployments Being on call 24/7 makes us want to fix issues! Every service redundant

So far only 2 outages (< 1 hr) due to our servers failing Survived multiple cable pulls 26 Zero Downtime Deployment - Webapp Turbine Game DNS lookup Load balancer Webapp Production environment DynamoDB 27

Zero Downtime Deployment - Webapp Turbine Game DNS lookup Load balancer Webapp Production environment - Load balancer DynamoDB Webapp Staging environment

Use Amazon to create second server set Database is shared New server is backwards compatible with old server 28 Zero Downtime Deployment - Webapp Turbine Game DNS lookup Load balancer Webapp Production environment -

Load balancer Clone DynamoDB Webapp Staging environment Clone handles request on production environment and replicates the request to the staging environment (but ignores result). Clone warms caches / load balancer. Take care 29 Zero Downtime Deployment - Webapp Turbine Game

QA DNS lookup DNS override Load balancer Webapp Production environment - Load balancer Clone DynamoDB

Webapp Staging environment QA can now test the staging environment 30 Zero Downtime Deployment - Webapp Turbine Game DNS lookup Load balancer Load balancer Forward

Webapp Production environment - DynamoDB Webapp Staging environment Forward does not process request on production environment but only on staging environment 31 Zero Downtime Deployment - Webapp Turbine Game DNS swap

Load balancer Load balancer Forward Webapp Production environment DynamoDB Webapp Staging environment 32 Zero Downtime Deployment - Webapp Wait 24h

Turbine Game DNS swap Load balancer Load balancer Forward Webapp Production environment - DynamoDB Webapp Staging environment

Some ISPs dont care about DNS record expiry, so you need to wait for 24h 33 Zero Downtime Deployment - Webapp Turbine Game DNS swap Load balancer DynamoDB - Delete old production environment Webapp

Staging environment 34 Zero Downtime Deployment - Turbine Games A Games B Turbine 1 Turbine 2 announce Load balancer 35 Zero Downtime Deployment - Turbine

Games A Games B Turbine 1 Turbine 2 maintenance Load balancer - 50% of machines put in maintenance mode Maintenance mode allows current games to keep playing / new players can still join game 36

Zero Downtime Deployment - Turbine Games A Games B round rollover Turbine 1 Turbine 2 maintenance Load balancer - On round rollover (after approx. 20 minutes) game is moved to new Turbine server 37

Zero Downtime Deployment - Turbine Games A Games B Turbine 1(new) Turbine 2 announce Load balancer - We can update turbine servers now 38

Zero Downtime Deployment - Turbine Games A Games B Turbine 1(new) Turbine 2 announce maintenance Load balancer - Other 50% put in maintenance mode 39

Zero Downtime Deployment - Turbine Games A Games B round rollover Turbine 1(new) announce Turbine 2 maintenance Load balancer 40 Zero Downtime Deployment - Turbine Games A

Games B Turbine 1(new) Turbine 2(new) announce Load balancer 41 Zero Downtime Deployment - Turbine Games A Games B round rollover Turbine 1(new)

Turbine 2(new) announce Load balancer 42 Zero Downtime Deployment - Turbine Games A Games B round rollover Turbine 1(new) Turbine 2(new) announce Load balancer

Do A/B testing! - Updating 1 server first and let it run for a couple of days Check telemetry results / crashes to validate behavior because validating in QA is difficult due to amount of testers 43 Update Without Patching 44 Update Without Patching Patch release cycle very long 1 week for emergency patches 4+ weeks for normal fixes Developed server side tweaks system Patching gameplay settings (e.g. weapon stats)

Adding collision volumes Changes versioned, each game round uses specific version Update to latest version on round rollover Fixed 50+ exploits! - Server tweak is simple key / value pair Using special notation to patch game settings e.g. (ClassName)ObjectName.AttributeName = Value 45 How Telemetry Helped Fix Issues 46 Telemetry Visualization

Plain text Aggregation Events Game S3 Collector Webapp Hive Redshift Tableau Very Powerful! Can drill down to individual user for

complaints - Example Player Spawn + Player Die event gets aggregated by Hive to Player Life table in Redshift 47 Errors Per Hour Over Time PlayStationPlus check PSNSM authentication Turbine fix - PlayStationPlus Check = check to see if you have purchased PlayStationPlus and can play online

48 Errors Per Hour Per Router Technicolor modem triggers errors on round roll over 49 Technicolor TG587n TG587n UDP socket fail Server IP X.X.X.X (first contacted)

Server IP Y.Y.Y.Y 1 UDP socket fails to send to multiple IPs! Need to allocate UDP socket per IP 50 And Then We Released Patch 1.10 Community complained about 5+ second lag We didnt see it, couldnt measure it 51 Key Metric - Kill Lag Player A Turbine Player B

Fire last bullet Server & client run at 60 Hz Processing 5 * 0.5 * 1/60 40 ms Best ping 5 ms PingA + PingB 10 ms Dies Kill Lag = PingA + PingB + Processing -

Best Kill Lag 50 ms Consider laggy kill > 500 ms Most game(r)s look at ping, ping is a bad metric! 52 Kill Lag Measured 95% of the kills are < 500 ms. Given measured pings, 98% is max. Kill lag starts around 50 ms - For every kill you make we send telemetry on what the kill lag

was Graph shows percentage of kills that took lower than X-axis 53 Laggy Kills Over Time US-East gateway failure Backup gateway high packet loss libRUDP resend bug libRUDP fix Release patch 1.10 Switch to Amazon Switch back A/B test - Vertical axis is percentage of laggy kills

Graph shows US-East only (other regions were unaffected which is why we didnt see it) 54 Amazon isnt perfect either One server triggers a lot of network errors Terminate, and its automagically replaced by a working instance - Vertical axis shows error count per server Color is region

55 Average Ping Between Regions No region where next closest region < 100 ms Matchmaking tries to avoid cross region! Smaller regions / warzones dont have enough players Friends may want to play cross region 98%+ success in large, 80% in smallest region - All pings > 100 ms are grey 56 Telemetry - Conclusion You NEED telemetry to fix your game

Difficult to draw conclusions from forums Previous examples very difficult to find Consider key metrics For us Kill Lag Tie into monitoring - Monitoring = real time check that number of laggy kills stays constant, if it increases we get called 57

Recently Viewed Presentations

  • Student Accounts &amp; Financial Aid

    Student Accounts & Financial Aid

    Understanding money management skills such as living within a budget and handling credit and debt. Money Management Checklist for College Students. Creating and Managing a Budget. Creating a Budget video. Responsible Borrowing. financialaid.unca.edu⎮studentaccounts.unca.edu ⎮transition.unca.edu/embark
  • Managing Fire Safety Tim Russell Essex County Fire

    Managing Fire Safety Tim Russell Essex County Fire

    Managing Fire Safety Tim Russell Essex County Fire and Rescue Service Introduction Tim Russell 20 years in FRS 10 years Operational 10 years as a Technical Fire Safety Officer IOSH Website Overview Managing Fire Safety FRSs Improve Fire Safety compliance...
  • Chapter 2

    Chapter 2

    Explain the process of neural communication. Explain how neurotransmitters work. Delineate the different steps of the neural chain. Analyze the difference between the neural and hormonal systems. Identify the parts of the brain and the functions of each. Describe the...
  • Fatal Facts - Hydraulic Pressure and the Dangers

    Fatal Facts - Hydraulic Pressure and the Dangers

    Fluid Power Safety Institute Hydraulic Pressure Fatal Facts - Hydraulic Pressure and the Dangers Description of the Accident: A machine operator was fatally injured while he was attempting to bleed trapped air from a hydraulic cylinder located on an automated...
  • Titanic By Alana Raper Highland Park Elementary May

    Titanic By Alana Raper Highland Park Elementary May

    Titanic By Alana Raper Highland Park Elementary May 2009 Titanic, Olympic, and Britannic So it couldn't sink because… Possible flooding Possible flooding conditions and titanic would still stay afloat Bow design of titanic Stern of titanic design Location and Description...
  • Chapter 8 Review A More Perfect Union Creating

    Chapter 8 Review A More Perfect Union Creating

    35. The Northwest Ordinance of 1787 established the procedures for the expansion of the United States and explains that new states admitted to the Union are equal to existing states.. Congress appoints governor. To rule territory. With 5,000 free adult...
  • System Models for Distributed and Cloud Computing

    System Models for Distributed and Cloud Computing

    Host H1 Router R1 Router R2 Router R3 Host H2 Ethernet FDDI Pt.-to-pt, Ethernet ETH IP 1400 FDDI IP 1400 P2P IP 512 P2P IP 512 P2P IP 376 ETH IP 512 ETH IP 512 ETH IP 376 IP Fragmentation...
  • The Art of Revision - Cuyamaca College

    The Art of Revision - Cuyamaca College

    Times New Roman Default Design The Art of Revision What are the steps of revision? Check each body paragraph Check the conclusion Check the order of your essay Check for unity and Coherence Proofreading or Editing for Errors More Proofreading...