Data-Oriented Design

Data-Oriented Design (DOD) is a design philosophy that focuses on the data that is being processed, rather than the code that processes it. This is in contrast to Object-Oriented Design (OOD), which focuses on the code that processes the data.

DOD is a useful design philosophy for game development and embedded systems, as they are often concerned with processing large amounts of data. DOD can help to improve performance, reduce memory usage, and make code easier to maintain.

Data-Oriented Design Principles

As a general rule, the purpose of all programs (and all parts of those programs) is to transform data from one form to another. This is true whether the program is a game, a web server, a database, or anything else. The data is the most important part of the program, and the code is just a means to this end.

  1. If you don’t understand the data you are transforming, you don’t understand the problem you are solving and you can’t write any amount of code to solve your problems effectively.
  2. If you do understand the data, then you under stand your problems and you can write code to solve them effectively.
  3. If the problem changes, then the data changes, and the code must change to reflect this. If the code changes, then the data must change to reflect this. And so for changes to the data. Simply, different problems require different solutions.
  4. If you don’t understand the cost of solving a problem, then you don’t understand the problem you are solving.
  5. If you don’t understand the platform (hardware) you can’t reason about the cost of solving the problem. Therefore, you don’t understand the problem. The set of platforms is finite (real or imagined) and so their characteristics must be understood.
  6. Everything is a data problem. If you can’t reason about the problem in terms of data, then you don’t understand the problem. If you can’t reason about the solution in terms of data, then you don’t problem. Usability, maintainability, and performance are all data problems.
  7. Solving problems you probably don’t have (or likely would never have) creates more problems you definitely will. You should know what problems you have because you should have analyzed and understand the data. (yagni)
  8. Latency and throughput are only the same in sequential systems.

We can clearly see a relationship between the problem, the data, and the solution (code). If the problem changes, then the data and code must change too. If the data changes, then then we have a different problem requiring a different solution. If the solution changes or is difficult to derive, then some part of the data or problem likely needs to change because we can’t reason about the problem effectively enough to come to a consensus of it’s solution.

Rules of thumb:

  • Where there is one, there are many. Evaluate you data on the axis of time and space and address the most common, real, cases as first priority. Don’t waste time on imagined problems you don’t have.
  • The more context you have, the better you can design the solution. Don’t throw away data you know you need.

    Where is data going to be used? How is it going to be used? What is the cost of using it? What is the cost of not using it? What is the cost of not having it or getting it or storing it?

  • Non-uniform memory access (NUMA) is a real problem. It extends to I/O and pre-built data all the way back to the data-source. Understand the cost of memory access and the cost of cache misses. Understand the cost of memory allocation and deallocation. Understand the cost of memory fragmentation. This is a problem for all systems, not just resource constrained systems. Cloud systems are not immune to this problem.
  • Think of you system wholistically from the first interaction of an agent (a user or another system) to the last interaction of that agent. From the first artifact to it’s retirement. What does the data access look like thought the pipeline over time?
  • Reason must prevail, if what is being done isn’t a sound, reasoned judgment then it has to be questioned.
  • The compiler is a tool, not a magic wand. It can only reason about 1-10% of your performance and cannot solve the significant problems you will face.
  • Ignoring facts that are inconvenient is not engineering, it is dogma.

The Three Big Lies

  1. Software is a platform in and of itself

    Hardware / infrastructure is the platform. If you have different hardware, you need different solutions. Writing software for a specific processor requires a different solution than writing software for a different processor. The same is true for different memory architectures, different network architectures, cloud services, etc. Differing platforms have differing constraints, differing costs, differing limitations, and, most importantly, differing solutions.

    You cannot create a solution that is independent of the platform or set of platforms. Reality is not the limitation you’re forced to deal with to solve an abstract, theoretical problem. Reality is the problem you’re solving.

  2. Code should be designed around a model of reality (the world)

    Hiding data is implicit in modeling the world. This is bad because because it confounds two things:

    • Maintenance (changes to data and data access)
    • Understanding properties of data (critical for solving problems)

    In attempting to hide the data in order to make it more maintainable, we have made it less understandable.

    Modeling data in this way implies some relationship to real data or transforming processes but in real life objects are fundamentally similar (a chair is a chair). In terms of data transformations, objects are only superficially similar. A ‘WoodenChair’ and a ‘StaticChair’ may derive from ‘Chair’ but they are not the same thing. They have different properties and different transformations and are treated differently.

    This type of data modeling is not useful for solving problems. It leads to monolithic, unrelated, data structures & transforms.

    You cant make a problem simpler than it is. Attempting to model the world is equivalent to attempting to solve a problem through storytelling or analogy rather than by engineering.

  3. Code is more important than data

    Code is a minor issue in the grand scheme of things. It is a means to an end. The data is the most important part of the program. The data is the thing that we are reasoning about to solve the problem. Code exists to transform data from one form to another.

    The job of a programmer is NOT to write code but to solve (data transformation) problems. To that end, we should only write code that has direct, provable value in solving the problem. We do this by transforming data in a meaningful way. There is no ideal, abstract solution to problems. You can’t “future proof” your code.

The Problems that The Lies Cause:

  • Poor Performance
  • Poor Concurrency
  • Poor Optimizeability
  • Poor Stability
  • Poor Testability

Solve for the transforming the data you have given the constraints of the platform and nothing else.

Example

Dictionary lookup

Given a dictionary of key-value pairs, we want to look up the value for a given key.

Statistically, most of the time will be spent iterating over the keys in the dictionary to find the one we are looking for. As we get more keys, this will take longer and longer and most of the data that is loaded into memory will be keys that we don’t care about.

We may have modeled this problem as an array of key-value pairs, but realistically these are two separate arrays. The keys are the data we are transforming and the values are the result of that transformation. We need mechanisms to find a key in an array of keys transform it and then use it to find the value in the array of values.

Solve for the ‘most common case’ first, not the most generic.

Simple wins

Simple, obvious things to look for and some back of the envelope calculations to make sound, reasoned judgments can yield substantial wins.

Organized data makes maintenance, debugging, and concurrency much easier

The Three Truths

  1. Hardware is the platform
  2. Design around the data, not an idealized world
  3. Solve the data transforms first, not the code design

Notes on Mike Acton’s talk “Data-Oriented Design and C++”

How do you remove code duplication without templates (generics in c++)?

This is almost certainly not as big a problem as people think it is (read: and imagined issue). People invent things to duplicate in order to use generics to solve the problem. For the cases in which generics would be useful, you could also generate the code. Templates or generics are just the poor man’s text processor. A text processor that you largely can’t reason about.

What if I don’t know the hardware that my code will run on?

It might be true that you don’t know the hardware that your code will run on, but that doesn’t mean that it isn’t some finite set of hardware. You can reason about the characteristics of that finite set of hardware. What is the most likely, least likely, best case, worst case. This finite set should be understood. General portability is a fool’s errend.

How can we apply data-oriented design to a code base that doesn’t this mindset?

One step at a time. Take any of the code and reason about the data. Find the most common case, the most common transformations, and the most common data access patterns. Start there. You can’t change everything at once, but you can change one thing at a time.

The key performance indicator at my company is not performance, but the ability for a programmer to get things done. We are constrained by our engineering resources not by time-to-interactive.

We all worry about our resources, who doesn’t. But our worries aren’t orthogonal to what we actually do. What we do are the things that are the most valuable and we focus on those things. We focus on understanding the problem and our constraints. We focus on understanding the data so that we can produce a solution that provides value to our customers.

And bottom line, performance matters. We pay for things in terms of performance so we may be willing to sacrifice performance efficiency in order to ‘do the right things’ and be effective at getting things done but don’t be surprised when your data is a mess and it takes forever for programs to load and run.

It’s better to thrive than to survive and better is good.