• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


10 Rules for scalable Performance in simple operation Datastores

Page history last edited by Nathan T Suver 5 years, 3 months ago



Michael Stonebraker , Rick Cattell, 10 rules for scalable performance in 'simple operation' datastores, Communications of the ACM, v.54 n.6, June 2011 




the relational model of data was proposed in 1970 by Ted Codd5 as the best solution for the DBMS problems of the day—business data processing. Early relational systems included System R 2 and Ingres,9 and almost all commercial relational DBMS (RDBMS) implementations today trace their roots to these two systems. As such, unless you squint, the dominant commercial vendors—Oracle, IBM, and Microsoft—as well as the major open source systems—MySQL and PostgreSQL— all look about the same today; we term these systems general-purpose traditional row stores, or GPTRS, 

sharing the following features: ˲˲ Disk-oriented storage; ˲˲ Tables stored row-by-row on disk, hence, a row store; ˲˲ B-trees as the indexing mechanism; ˲˲ Dynamic locking as the concurrency-control mechanism; ˲˲ A write-ahead log, or WAL, for crash recovery; ˲˲ SQL as the access language; and ˲˲ A “row-oriented” query optimizer and executor, pioneered in System R.



This article lists a number of approaches for optimizing performance in "simple operation" datastores.  The article is essentially just a list of "things" to look at when considering optimizing these types of data-stores, but some of their conclusions and approaches could be valid if applied to scaling out a process for a monolithic database system.  Here are some of the main take-aways:

  • Shared nothing scalability.  Ideally, for a node processing a unit of work, that node should be processed in isolation of other nodes, only connected via some sort of networking interface.
  • Relying on high level languages (such as Sql) are not necessarily a bad thing, query optimizers are usually pretty good at optimizing a given work unit, unless the query is extremely complex.  They also mention that running a transaction via a database access technology (such as JDBC, ODBC) can cause performance problems, because of the traffic required to maintain the transaction over TCP.  This is a useful observation, and something that should be addressed carefully when we attempt to solve the problem.  perhaps this can be extended to investigate the cost of potentially running a transaction in code outside of a stored procedure, especially when network latency becomes an issue.
  • Don't build ACID yourself: if you require true ACID compliance (specifically consistency), choose a DBMS that offers it, and leverage that instead of trying to come up with your own plan.


Expanding the approach:

I'm unsure there is anything that can be directly expanded upon here, but some of their guidance will be relevant when we come up with a solution (e.g. avoiding, when possible, running transactions over tcp/ip, ensuring as best we can that nodes can be processed independently of each other, etc..)



Relevant Content:

pg 74

These (eventual consistency) systems are driven by a variety of motivations. For some, it is dissatisfaction with the relational model or the “heaviness” of RDBMSs. For others, it is the needs of large Web properties with some of the most demanding SO problems around. Large Web properties were frequently start-ups lucky enough to experience explosive growth, the so-called hockey-stick effect. They typically use an open source DBMS, because it is free or already understood by the staff. A single-node DBMS solution might be built for version 1, which quickly exhibits scalability problems. The conventional wisdom is then to “shard,” or partitioning the application data over multiple nodes that share the load. A table can be partitioned this way; for example, employee names can be partitioned onto 26 nodes by putting all the “A”s on node 1 and so forth. It is now up to application logic to direct each query and update to the correct node. However, such sharding in application logic has a number of severe drawbacks: ˲˲ If a cross-shard filter or join must be performed, then it must be coded in the application; ˲˲ If updates are required within a transaction to multiple shards, then the application is responsible for somehow guaranteeing data consistency across nodes; ˲˲ Node failures are more common as the system scales. A difficult problem is how to maintain consistent replicas, detect failures, fail over to replicas, and replace failed nodes in a running system; ˲˲ Making schema changes without taking shards “offline” is a challenge; and ˲˲ Reprovisioning the hardware to add additional nodes or change the configuration is extremely tedious and, likewise, much more difficult if the shards cannot be taken offline.


page 76:

Rule 2. High-level languages are good and need not hurt performance. Work in a SQL transaction can include the following components: ˲˲ Overhead resulting from the optimizer choosing an inferior execution plan; ˲˲ Overhead of communicating with the DBMS; ˲˲ Overhead inherent in coding in a high-level language; ˲˲ Overhead for services (such as concurrency control, crash recovery, and data integrity); and ˲˲ Truly useful work to be performed, no matter what. Here, we cover the first three, leaving the last two for Rule 3. Hierarchical and network systems were the dominant DBMS solutions in the 1960s and 1970s, offering a low-level procedural interface to data. The high-level language of RDBMSs was instrumental in displacing these DBMSs for three reasons: ˲˲ A high-level language system requires the programmer write less code that is easier to understand; ˲˲ Users state what they want instead of writing a disk-oriented algorithm on how to access the data they need; a programmer need not understand complex storage optimizations; and ˲˲ A high-level language system has a better chance of allowing a program to survive a change in the schema without maintenance or recoding; as such, low-level systems require far more maintenance. 


Hence, one of the key lessons in the DBMS field over the past 25 years is that high-level languages are good and do not hurt performance. Some new systems provide SQL or a more limited higher-level language; others provide only a “database assembly language,” or individual index and object operations. This low-level interface may be adequate for very simple applications, but, in all other cases, high-level languages provide compelling advantages


Comments (0)

You don't have permission to comment on this page.