(WIP) Notes from "Designing data-intensive applications"

These are compiled from notes of things I found interesting/want to go back to while reading Designing Data-Intensive Applications by Martin Kleppmann. I add notes as I’m reading the book (boi is it dense!)

Chapter 1: Reliable, scalable and maintainble applications

General purpose tools (database solutions, servers, etc) are composed together to build a special purpose tool or service. When you compose many general purpose tools to provide a service, you now have to think as a system designer rather than an application developer. When you design a data system, you have to think of:

  • Data integrity
  • Consistent performance
  • How does the system scale?
  • What’s a good API?

You also have to address the following concerns:

  • Scalability: Adapting to the growth of data or traffic volume of a system or its performance. It is multidemensional and very use-case specific.
  • Reliability: A reliable software works even when things go wrong (hardware or software failure, human error, a turnado knocks out your data centre).
  • Maintainability: This describes the operability, simplicity (managed complexity) and evolvability (software can be extended without sacrificing the first 2 properties)

In order to know how your system will perform in the event of load increase, increase the load without changing system resources. When tracking response times, do not get a single data point but rather a distribution.

Chapter 2: Data models and query languages

A domain is modeled in different levels from application/domain specific representation to more generic data structures with APIs hiding this complexity. Anything meaningful to human is subject to change.

SQL is declarative and hides the complexity of the database operations that it can optimize for you.

NoSQL databases usually don’t support joins. Joins have to be supported in the application layer.

Data tends to grow relationships as more features are added, and if you pick a NoSQL database, you might find designing relationships awkward and less performant. Entire documents will be rewritten on every update. Therefore, documents should be kept rather small. A schemaless database implies that the schema will be assumed by the code but not inforced by the database itself (schema-on-read). A document database model is appropriate when you have no relationships or mostly one-to-many relationships.

MapReduce is between imperative and declarative solutions. The query is in the code so it cannot be optimized by the query engine.

A (property) graph database is composed of vertex + edges with the following properties for each:

  • Vertex:
    • Unique identifier
    • Key:value
    • incoming edges
    • outgoing edges
  • Edge:
    • Unique identifier
    • key:value
    • Head vertex
    • Tail vertex
    • Relationship name