| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

10 Rules for scalable Performance in simple operation Datastores

This version was saved 5 years, 6 months ago View current version     Page history
Saved by Nathan T Suver
on March 10, 2019 at 12:02:55 pm
 

Abstract: 

the relatioNal MoDel of data was proposed in 1970 by Ted Codd5 as the best solution for the DBMS problems of the day—business data processing. Early relational systems included System R 2 and Ingres,9 and almost all commercial relational DBMS (RDBMS) implementations today trace their roots to these two systems. As such, unless you squint, the dominant commercial vendors—Oracle, IBM, and Microsoft—as well as the major open source systems—MySQL and PostgreSQL— all look about the same today; we term these systems general-purpose traditional row stores, or GPTRS, 

sharing the following features: ˲˲ Disk-oriented storage; ˲˲ Tables stored row-by-row on disk, hence, a row store; ˲˲ B-trees as the indexing mechanism; ˲˲ Dynamic locking as the concurrency-control mechanism; ˲˲ A write-ahead log, or WAL, for crash recovery; ˲˲ SQL as the access language; and ˲˲ A “row-oriented” query optimizer and executor, pioneered in System R.

 

Biblio:

 

Michael Stonebraker , Rick Cattell, 10 rules for scalable performance in 'simple operation' datastores, Communications of the ACM, v.54 n.6, June 2011 

 

Relevant Content:

pg 74

These (eventual consistency) systems are driven by a variety of motivations. For some, it is dissatisfaction with the relational model or the “heaviness” of RDBMSs. For others, it is the needs of large Web properties with some of the most demanding SO problems around. Large Web properties were frequently start-ups lucky enough to experience explosive growth, the so-called hockey-stick effect. They typically use an open source DBMS, because it is free or already understood by the staff. A single-node DBMS solution might be built for version 1, which quickly exhibits scalability problems. The conventional wisdom is then to “shard,” or partitioning the application data over multiple nodes that share the load. A table can be partitioned this way; for example, employee names can be partitioned onto 26 nodes by putting all the “A”s on node 1 and so forth. It is now up to application logic to direct each query and update to the correct node. However, such sharding in application logic has a number of severe drawbacks: ˲˲ If a cross-shard filter or join must be performed, then it must be coded in the application; ˲˲ If updates are required within a transaction to multiple shards, then the application is responsible for somehow guaranteeing data consistency across nodes; ˲˲ Node failures are more common as the system scales. A difficult problem is how to maintain consistent replicas, detect failures, fail over to replicas, and replace failed nodes in a running system; ˲˲ Making schema changes without taking shards “offline” is a challenge; and ˲˲ Reprovisioning the hardware to add additional nodes or change the configuration is extremely tedious and, likewise, much more difficult if the shards cannot be taken offline.

 

page 76:

Rule 2. High-level languages are good and need not hurt performance. Work in a SQL transaction can include the following components: ˲˲ Overhead resulting from the optimizer choosing an inferior execution plan; ˲˲ Overhead of communicating with the DBMS; ˲˲ Overhead inherent in coding in a high-level language; ˲˲ Overhead for services (such as concurrency control, crash recovery, and data integrity); and ˲˲ Truly useful work to be performed, no matter what. Here, we cover the first three, leaving the last two for Rule 3. Hierarchical and network systems were the dominant DBMS solutions in the 1960s and 1970s, offering a low-level procedural interface to data. The high-level language of RDBMSs was instrumental in displacing these DBMSs for three reasons: ˲˲ A high-level language system requires the programmer write less code that is easier to understand; ˲˲ Users state what they want instead of writing a disk-oriented algorithm on how to access the data they need; a programmer need not understand complex storage optimizations; and ˲˲ A high-level language system has a better chance of allowing a program to survive a change in the schema without maintenance or recoding; as such, low-level systems require far more maintenance. 

 

Hence, one of the key lessons in the DBMS field over the past 25 years is that high-level languages are good and do not hurt performance. Some new systems provide SQL or a more limited higher-level language; others provide only a “database assembly language,” or individual index and object operations. This low-level interface may be adequate for very simple applications, but, in all other cases, high-level languages provide compelling advantages

 

Comments (0)

You don't have permission to comment on this page.