View on GitHub

wiki

Databases

Tables/collections

Tables (collections in non-relational stores) and indices generally store data on a one-file-per basis. Because of this, you can get extremely high query performance when the entire index or table fits into RAM. Indexes are particularly important to fit into RAM, since this is often where queries are run against. You might then ask yourself, “why don’t I just aggressively break my tables into tiny logical pieces? It should be easy to fit any particular table into RAM if they are small.”

The first reason not to do this is because you lose the ability to do broad queries across all of your data. When you have a single table, a search query can span across all rows in a single command. However, if you break your tables into many smaller ones (say, one table per separate user account), you may have to do one query per table. For similar reasons, this can complicate foreign keys and relational mappings across tables.

Another reason not to do this is dependent on your access patterns. Say you have a 100GB data set, and you split it into 100000 1MB tables. If you nearly always have to query across all 100000 tables, you’ve actually made your system slower, because 1) you have to do 100000 separate queries 2) that’s not going to fit in RAM anyway, and you will be constantly evicting data from your cache. However, if you generally only query the same 20 tables, and the rest are rarely accessed, your effective working set is only 20MB, and you’ve significantly improved your performance. This sort of behavior is common if you keep time series data and only access the most recent results, or if you have certain hotspot data. Intelligent breaking up of tables is critically important here.

As a rule, don’t break your system up into too many tables, as it becomes difficult to manage, and don’t prematurely optimize things. If your data set is always going to remain small, it’s better to just leave things as a unified set.

Columnar databases

Databases traditionally store data row-by-row, where if you write a few rows of data, the row is stored sequentially on disk. You can also store data column-by-column, where if you write a few rows of data, the columns are grouped together. Take a database that stores name-count pairings, for instance. This is how the data set would look on disk for the two methods of storage:

Row-based:

matt 3 jack 4 paul 5 dudley 6

Column-based:

matt jack paul dudley 3 4 5 6

This has a ton of implications, the first of which is how disk paging works here. Say you read a bunch of records from just one of the columns. In row-based, you will have to jump to many locations on the disk, since the data for that one column is all over the place. In column-based, you don’t have to jump around as much on the drive since the data is grouped together, and repeated access will be able to use page-cached data. The inverse is true if you are using data from most/all of the columns.

Columnar databases also tend to be able to compress their data better, as the data is typically more similar than in row-based data. Columnar databases are not great at writing data incrementally, as it may have to jump all over the place on-disk.

Via https://www.flydata.com/blog/whats-unique-about-a-columnar-database/

To summarize, columnar databases are good for:

    Queries that involve only a few columns
    Aggregation queries against vast amounts of data
    Column-wise compression

But are not so good at:

    Queries against only a few rows
    Incremental data loading
    Online Transaction Processing (OLTP) usage

Rules of thumb

http://mysql.rjweb.org/doc.php/ricksrots