From 97cb8defd333ccf35c33fa1c3ed4372b5e724587 Mon Sep 17 00:00:00 2001
From: Richard G Brown <richard@r3cev.com>
Date: Sat, 23 Apr 2016 18:29:39 +0100
Subject: [PATCH] Minor edits to data model page and additional section on
 rationale for UTXO model (including some content from Mike)

---
 docs/source/conf.py        |   2 +-
 docs/source/data-model.rst | 166 ++++++++++++++++++++++++++++++++++---
 2 files changed, 155 insertions(+), 13 deletions(-)

diff --git a/docs/source/conf.py b/docs/source/conf.py
index 5ac80f6164..eeef39187a 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -45,7 +45,7 @@ source_suffix = '.rst'
 master_doc = 'index'
 
 # General information about the project.
-project = u'R3 Prototyping'
+project = u'R3 Corda'
 copyright = u'2016, Distributed Ledger Group, LLC'
 author = u'R3 DLG'
 
diff --git a/docs/source/data-model.rst b/docs/source/data-model.rst
index 1bd401b23b..205ced41b7 100644
--- a/docs/source/data-model.rst
+++ b/docs/source/data-model.rst
@@ -1,13 +1,13 @@
 Data model
 ==========
 
-Description
------------
-
 This article covers the data model: how *states*, *transactions* and *code contracts* interact with each other and
 how they are represented in the code. It doesn't attempt to give detailed design rationales or information on future
 design elements: please refer to the R3 wiki for background information.
 
+Overview
+--------
+
 We begin with the idea of a global ledger. In our model, although the ledger is shared, it is not always the case that
 transactions and ledger entries are globally visible. In cases where a set of transactions stays within a small subgroup of
 users it should be possible to keep the relevant data purely within that group.
@@ -19,16 +19,21 @@ consume/destroy, these are called **inputs**, and contains a set of new states t
 **outputs**.
 
 States contain arbitrary data, but they always contain at minimum a hash of the bytecode of a
-**code contract**, which is a program expressed in some byte code that runs sandboxed inside a virtual machine. Code
-contracts (or just "contracts" in the rest of this document) are globally shared pieces of business logic. Contracts
-define a **verify function**, which is a pure function given the entire transaction as input.
+**contract code** file, which is a program expressed in JVM byte code that runs sandboxed inside a Java virtual machine.
+Contract code (or just "contracts" in the rest of this document) are globally shared pieces of business logic.
 
-To be considered valid, the transaction must be **accepted** by the verify function of every contract pointed to by the
-input and output states. Beyond inputs and outputs, transactions may also contain **commands**, small data packets that
+Contracts define a **verify function**, which is a pure function given the entire transaction as input. To be considered
+valid, the transaction must be **accepted** by the verify function of every contract pointed to by the
+input and output states.
+
+Beyond inputs and outputs, transactions may also contain **commands**, small data packets that
 the platform does not interpret itself, but which can parameterise execution of the contracts. They can be thought of as
 arguments to the verify function. Each command has a list of **public keys** associated with it. The platform ensures
-that the transaction is signed by every key listed in the commands before the contracts start to execute. Public keys
-may be random/identityless for privacy, or linked to a well known legal identity via a *public key infrastructure* (PKI).
+that the transaction is signed by every key listed in the commands before the contracts start to execute. Thus, a verify
+function can trust that all listed keys have signed the transaction but is responsible for verifying that any keys required
+for the transaction to be valid from the verify function's perspective are included in the list. Public keys
+may be random/identityless for privacy, or linked to a well known legal identity, for example via a
+*public key infrastructure* (PKI).
 
 Commands are always embedded inside a transaction. Sometimes, there's a larger piece of data that can be reused across
 many different transactions. For this use case, we have **attachments**. Every transaction can refer to zero or more
@@ -50,8 +55,8 @@ attachment if the fact it's creating is relatively static and may be referred to
 
 As the same terminology often crops up in different distributed ledger designs, let's compare this to other
 distributed ledger systems you may be familiar with. You can find more detailed design rationales for why the platform
-differs from existing systems in `the R3 wiki <https://r3-cev.atlassian.net/wiki/>`_, but to summarise, the driving
-factors are:
+differs from existing systems in `the R3 wiki <https://r3-cev.atlassian.net/wiki/display/AWG/Platform+Stream%3A+Corda>`_,
+but to summarise, the driving factors are:
 
 * Improved contract flexibility vs Bitcoin
 * Improved scalability vs Ethereum, as well as ability to keep parts of the transaction graph private (yet still uniquely addressable)
@@ -114,3 +119,140 @@ Differences:
 * Ethereum claims to be a platform not only for financial logic, but literally any kind of application at all. Our
   platform considers non-financial applications to be out of scope.
 
+Rationale for and tradeoffs in adopting a UTXO-style model
+----------------------------------------------------------
+
+As discussed above, Corda uses the so-called "UTXO set" model (unspent transaction output). In this model, the database
+does not track accounts or balances. Instead all database entries are immutable. An entry is either spent or not spent
+but it cannot be changed. In Bitcoin, spentness is implemented simply as deletion – the inputs of an accepted transaction
+are deleted and the outputs created.
+
+This approach has some advantages and some disadvantages, which is why some platforms like Ethereum have tried
+(or are trying) to abstract this choice away and support a more traditional account-like model.  We have explicitly
+chosen *not* to do this and our decision to adopt a UTXO-style model is a deliberate one.  In the section below,
+the rationale for this decision and its pros and cons of this choice are outlined.
+
+Rationale
+---------
+
+Corda, in common with other blockchain-like platforms, is designed to bring parties to shared sets of data into
+consensus as to the existence, content and allowable evolutions of those data sets. However, Corda is designed with the
+explicit aim of avoiding, to the extent possible, the scalability and privacy implications that arise from those platforms'
+decisions to adopt a global broadcast model.
+
+Whilst the privacy implications of a global consensus model are easy to understand, the scalability implications are
+perhaps more subtle, yet serious. In a consensus system, it is critical that all processors of a transaction reach
+precisely the same conclusion as to its effects.  In situations where two transactions may act on the same data set,
+it means that the two transactions must be processed in the same *order* by all nodes. If this were not the case then it
+would be possible to devise situations where nodes processed transactions in different orders and reached different
+conclusions as to the state of the system.  It is for this reason that systems like Ethereum effectively run
+single-threaded, meaning the speed of the system is limited by the single-threaded performance of the slowest
+machine on the network.
+
+In Corda, we assume the data being processed represents financial agreements between identifiable parties and that these
+institutions will adopt the system only if a significant number of such agreements can be managed by the platform.
+As such, the system has to be able to support parallelisation of execution to the greatest extent possible,
+whilst ensuring correct transaction ordering when two transactions seek to act on the same piece of shared state.
+
+To achieve this, we must minimise the number of parties who need to receive and process copies of any given
+transaction and we must minimise the extent to which two transactions seek to mutate (or supersede) any given piece
+of shared state.
+
+A key design decision, therefore, is what should be the most atomic unit of shared data in the system.  This decision
+also has profound privacy implications: the more coarsely defined the shared data units, the larger the set of
+actors who will likely have a stake in its accuracy and who must process and observe any update to it.
+
+This becomes most obvious when we consider two models for representing cash balances and payments.
+
+A simple account model for cash would define a data structure that maintained a balance at a particular bank for each
+"account holder". Every holder of a balance would need a copy of this structure and would thus need to process and
+validate every payment transaction, learning about everybody else's payments and balances in the process.
+All payments across that set of accounts would have to be single-threaded across the platform, limiting maximum
+throughput.
+
+A more sophisticated example might create a data structure per account holder.
+But, even here, I would leak my account balance to anybody to whom I ever made
+a payment and I could only ever make one payment at a time, for the same reasons above.
+
+A UTXO model would define a data structure that represented an *instance* of a claim against the bank. An account
+holder could hold *many* such instances, the aggregate of which would reveal their balance at that institution.  However,
+the account holder now only needs to reveal to their payee those instances consumed in making a payment to that payee.
+This also means the payer could make several payments in parallel.   A downside is that the model is harder to understand.
+However, we consider the privacy and scalability advantages to overwhelm the modest additional cognitive load this places
+on those attempting to learn the system.
+
+In what follows, further advantages and disadvantages of this design decision are explored.
+
+Pros
+----
+
+The UTXO model has these advantages:
+
+* Immutable ledger entries gives the usual advantages that a more functional approach brings: it's easy to do analysis
+  on a static snapshot of the data and reason about the contents.
+* Because there are no accounts, it's very easy to apply transactions in parallel even for high traffic legal entities
+  assuming sufficiently granular entries.
+* Transaction ordering becomes trivial: it is impossible to mis-order transactions due to the reliance on hash functions
+  to identify previous states. There is no need for sequence numbers or other things that are hard to provide in a
+  fully distributed system.
+* Conflict resolution boils down to the double spending problem, which places extremely minimal demands on consensus
+  algorithms (as the variable you're trying to reach consensus on is a set of booleans).
+
+Cons
+----
+
+It also comes with some pretty serious complexities that in practice must be abstracted from developers:
+
+* Representing numeric amounts using immutable entries is unnatural. For instance, if you receive $1000 and wish
+  to send someone $100, you have to consume the $1000 output and then create two more: a $100 for the recipient and
+  $900 back to yourself as change. The fact that this happens can leak private information to an observer.
+* Because users do need to think in terms of balances and statements, you have to layer this on top of the
+  underlying ledger: you can't just read someone's balance out of the system. Hence, the "wallet" / position manager.
+  Experience from those who have developed wallets for Bitcoin and other systems is that they can be complex pieces of code,
+  although the bulk of wallets' complexity in public systems is handling the lack of finality (and key management).
+* Whilst transactions can be applied in parallel, it is much harder to create them in parallel due to the need to
+  strictly enforce a total ordering.
+
+With respect to parallel creation, if the user is single threaded this is fine, but in a more complex situation
+where you might want to be preparing multiple transactions in flight this can prove a limitation – in
+the worst case where you have a single output that represents all your value, this forces you to serialise
+the creation of every transaction. If transactions can be created and signed very fast that's not a concern.
+If there's only a single user, that's not a concern.
+
+Both cases are typically true in the Bitcoin world, so users don't suffer from this much. In the context of a
+complex business with a large pool of shared funds, in which creation of transactions may be very slow due to the
+need to get different humans to approve a tx using a signing device, this could quickly lead to frustrating
+conflicts where someone approves a transaction and then discovers that it has become a double spend and
+they must sign again. In the absolute worst case you could get a form of human livelock.
+
+The tricky part about solving these problems is that the simplest way to express a payment request
+("send me $1000 to public key X") inherently results in you receiving a single output, which then can
+prove insufficiently granular to be convenient. In the Bitcoin space Mike Hearn and Gavin Andresen designed "BIP 70"
+to solve this: it's a simple binary format for requesting a payment and specifying exactly how you'd like to get paid,
+including things like the shape of the transaction. It may seem that it's an over complex approach: could you not
+just immediately respend the big output back to yourself in order to split it? And yes, you could, until you hit
+scenarios like "the machine requesting the payment doesn't have the keys needed to spend it",
+which turn out to be very common. So it's really more effective for a recipient to be able to say to the
+sender, "here's the kind of transaction I want you to send me".  The :doc:`protocol framework <protocol-state-machines>`
+may provide a vehicle to make such negotiations simpler.
+
+A further challenge is privacy. Whilst our goal of not sending transactions to nodes that don't "need to know"
+helps, to verify a transaction you still need to verify all its dependencies and that can result in you receiving
+lots of transactions that involve random third parties. The problems start when you have received lots of separate
+payments and been careful not to make them linkable to your identity, but then you need to combine them all in a
+single transaction to make a payment.
+
+Mike Hearn wrote an article about this problem and techniques to minimise it in
+`this article <https://medium.com/@octskyward/merge-avoidance-7f95a386692f>`_ from 2013. This article
+coined the term "merge avoidance", which has never been implemented in the Bitcoin space,
+although not due to lack of practicality.
+
+A piece of future work for the wallet implementation will be to implement automated "grooming" of the wallet
+to "reshape" outputs to useful/standardised sizes, for example, and to send outputs of complex transactions
+back to their issuers for reissuance to "sever" long privacy-breaching chains.
+
+Finally, it should be noted that some of the issues described here are not really "cons" of
+the UTXO model; they're just fundamental.
+If you used many different anonymous accounts to preserve some privacy and then needed to
+spend the contents of them all simultaneously, you'd hit the same problem, so it's not
+something that can be trivially fixed with data model changes.