| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270 |
- NTDB: Redesigning The Trivial DataBase
- Rusty Russell, IBM Corporation
- 19 June 2012
- Abstract
- The Trivial DataBase on-disk format is 32 bits; with usage cases
- heading towards the 4G limit, that must change. This required
- breakage provides an opportunity to revisit TDB's other design
- decisions and reassess them.
- 1 Introduction
- The Trivial DataBase was originally written by Andrew Tridgell as
- a simple key/data pair storage system with the same API as dbm,
- but allowing multiple readers and writers while being small
- enough (< 1000 lines of C) to include in SAMBA. The simple design
- created in 1999 has proven surprisingly robust and performant,
- used in Samba versions 3 and 4 as well as numerous other
- projects. Its useful life was greatly increased by the
- (backwards-compatible!) addition of transaction support in 2005.
- The wider variety and greater demands of TDB-using code has lead
- to some organic growth of the API, as well as some compromises on
- the implementation. None of these, by themselves, are seen as
- show-stoppers, but the cumulative effect is to a loss of elegance
- over the initial, simple TDB implementation. Here is a table of
- the approximate number of lines of implementation code and number
- of API functions at the end of each year:
- +-----------+----------------+--------------------------------+
- | Year End | API Functions | Lines of C Code Implementation |
- +-----------+----------------+--------------------------------+
- +-----------+----------------+--------------------------------+
- | 1999 | 13 | 1195 |
- +-----------+----------------+--------------------------------+
- | 2000 | 24 | 1725 |
- +-----------+----------------+--------------------------------+
- | 2001 | 32 | 2228 |
- +-----------+----------------+--------------------------------+
- | 2002 | 35 | 2481 |
- +-----------+----------------+--------------------------------+
- | 2003 | 35 | 2552 |
- +-----------+----------------+--------------------------------+
- | 2004 | 40 | 2584 |
- +-----------+----------------+--------------------------------+
- | 2005 | 38 | 2647 |
- +-----------+----------------+--------------------------------+
- | 2006 | 52 | 3754 |
- +-----------+----------------+--------------------------------+
- | 2007 | 66 | 4398 |
- +-----------+----------------+--------------------------------+
- | 2008 | 71 | 4768 |
- +-----------+----------------+--------------------------------+
- | 2009 | 73 | 5715 |
- +-----------+----------------+--------------------------------+
- This review is an attempt to catalog and address all the known
- issues with TDB and create solutions which address the problems
- without significantly increasing complexity; all involved are far
- too aware of the dangers of second system syndrome in rewriting a
- successful project like this.
- Note: the final decision was to make ntdb a separate library,
- with a separarate 'ntdb' namespace so both can potentially be
- linked together. This document still refers to “tdb” everywhere,
- for simplicity.
- 2 API Issues
- 2.1 tdb_open_ex Is Not Expandable
- The tdb_open() call was expanded to tdb_open_ex(), which added an
- optional hashing function and an optional logging function
- argument. Additional arguments to open would require the
- introduction of a tdb_open_ex2 call etc.
- 2.1.1 Proposed Solution<attributes>
- tdb_open() will take a linked-list of attributes:
- enum tdb_attribute {
- TDB_ATTRIBUTE_LOG = 0,
- TDB_ATTRIBUTE_HASH = 1
- };
- struct tdb_attribute_base {
- enum tdb_attribute attr;
- union tdb_attribute *next;
- };
- struct tdb_attribute_log {
- struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG
- */
- tdb_log_func log_fn;
- void *log_private;
- };
- struct tdb_attribute_hash {
- struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH
- */
- tdb_hash_func hash_fn;
- void *hash_private;
- };
- union tdb_attribute {
- struct tdb_attribute_base base;
- struct tdb_attribute_log log;
- struct tdb_attribute_hash hash;
- };
- This allows future attributes to be added, even if this expands
- the size of the union.
- 2.1.2 Status
- Complete.
- 2.2 tdb_traverse Makes Impossible Guarantees
- tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions,
- and it was thought that it was important to guarantee that all
- records which exist at the start and end of the traversal would
- be included, and no record would be included twice.
- This adds complexity (see[Reliable-Traversal-Adds]) and does not
- work anyway for records which are altered (in particular, those
- which are expanded may be effectively deleted and re-added behind
- the traversal).
- 2.2.1 <traverse-Proposed-Solution>Proposed Solution
- Abandon the guarantee. You will see every record if no changes
- occur during your traversal, otherwise you will see some subset.
- You can prevent changes by using a transaction or the locking
- API.
- 2.2.2 Status
- Complete. Delete-during-traverse will still delete every record,
- too (assuming no other changes).
- 2.3 Nesting of Transactions Is Fraught
- TDB has alternated between allowing nested transactions and not
- allowing them. Various paths in the Samba codebase assume that
- transactions will nest, and in a sense they can: the operation is
- only committed to disk when the outer transaction is committed.
- There are two problems, however:
- 1. Canceling the inner transaction will cause the outer
- transaction commit to fail, and will not undo any operations
- since the inner transaction began. This problem is soluble with
- some additional internal code.
- 2. An inner transaction commit can be cancelled by the outer
- transaction. This is desirable in the way which Samba's
- database initialization code uses transactions, but could be a
- surprise to any users expecting a successful transaction commit
- to expose changes to others.
- The current solution is to specify the behavior at tdb_open(),
- with the default currently that nested transactions are allowed.
- This flag can also be changed at runtime.
- 2.3.1 Proposed Solution
- Given the usage patterns, it seems that the“least-surprise”
- behavior of disallowing nested transactions should become the
- default. Additionally, it seems the outer transaction is the only
- code which knows whether inner transactions should be allowed, so
- a flag to indicate this could be added to tdb_transaction_start.
- However, this behavior can be simulated with a wrapper which uses
- tdb_add_flags() and tdb_remove_flags(), so the API should not be
- expanded for this relatively-obscure case.
- 2.3.2 Status
- Complete; the nesting flag has been removed.
- 2.4 Incorrect Hash Function is Not Detected
- tdb_open_ex() allows the calling code to specify a different hash
- function to use, but does not check that all other processes
- accessing this tdb are using the same hash function. The result
- is that records are missing from tdb_fetch().
- 2.4.1 Proposed Solution
- The header should contain an example hash result (eg. the hash of
- 0xdeadbeef), and tdb_open_ex() should check that the given hash
- function produces the same answer, or fail the tdb_open call.
- 2.4.2 Status
- Complete.
- 2.5 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
- In response to scalability issues with the free list ([TDB-Freelist-Is]
- ) two API workarounds have been incorporated in TDB:
- tdb_set_max_dead() and the TDB_VOLATILE flag to tdb_open. The
- latter actually calls the former with an argument of“5”.
- This code allows deleted records to accumulate without putting
- them in the free list. On delete we iterate through each chain
- and free them in a batch if there are more than max_dead entries.
- These are never otherwise recycled except as a side-effect of a
- tdb_repack.
- 2.5.1 Proposed Solution
- With the scalability problems of the freelist solved, this API
- can be removed. The TDB_VOLATILE flag may still be useful as a
- hint that store and delete of records will be at least as common
- as fetch in order to allow some internal tuning, but initially
- will become a no-op.
- 2.5.2 Status
- Complete. Unknown flags cause tdb_open() to fail as well, so they
- can be detected at runtime.
- 2.6 <TDB-Files-Cannot>TDB Files Cannot Be Opened Multiple Times
- In The Same Process
- No process can open the same TDB twice; we check and disallow it.
- This is an unfortunate side-effect of fcntl locks, which operate
- on a per-file rather than per-file-descriptor basis, and do not
- nest. Thus, closing any file descriptor on a file clears all the
- locks obtained by this process, even if they were placed using a
- different file descriptor!
- Note that even if this were solved, deadlock could occur if
- operations were nested: this is a more manageable programming
- error in most cases.
- 2.6.1 Proposed Solution
- We could lobby POSIX to fix the perverse rules, or at least lobby
- Linux to violate them so that the most common implementation does
- not have this restriction. This would be a generally good idea
- for other fcntl lock users.
- Samba uses a wrapper which hands out the same tdb_context to
- multiple callers if this happens, and does simple reference
- counting. We should do this inside the tdb library, which already
- emulates lock nesting internally; it would need to recognize when
- deadlock occurs within a single process. This would create a new
- failure mode for tdb operations (while we currently handle
- locking failures, they are impossible in normal use and a process
- encountering them can do little but give up).
- I do not see benefit in an additional tdb_open flag to indicate
- whether re-opening is allowed, as though there may be some
- benefit to adding a call to detect when a tdb_context is shared,
- to allow other to create such an API.
- 2.6.2 Status
- Complete.
- 2.7 TDB API Is Not POSIX Thread-safe
- The TDB API uses an error code which can be queried after an
- operation to determine what went wrong. This programming model
- does not work with threads, unless specific additional guarantees
- are given by the implementation. In addition, even
- otherwise-independent threads cannot open the same TDB (as in[TDB-Files-Cannot]
- ).
- 2.7.1 Proposed Solution
- Reachitecting the API to include a tdb_errcode pointer would be a
- great deal of churn, but fortunately most functions return 0 on
- success and -1 on error: we can change these to return 0 on
- success and a negative error code on error, and the API remains
- similar to previous. The tdb_fetch, tdb_firstkey and tdb_nextkey
- functions need to take a TDB_DATA pointer and return an error
- code. It is also simpler to have tdb_nextkey replace its key
- argument in place, freeing up any old .dptr.
- Internal locking is required to make sure that fcntl locks do not
- overlap between threads, and also that the global list of tdbs is
- maintained.
- The aim is that building tdb with -DTDB_PTHREAD will result in a
- pthread-safe version of the library, and otherwise no overhead
- will exist. Alternatively, a hooking mechanism similar to that
- proposed for[Proposed-Solution-locking-hook] could be used to
- enable pthread locking at runtime.
- 2.7.2 Status
- Incomplete; API has been changed but thread safety has not been
- implemented.
- 2.8 *_nonblock Functions And *_mark Functions Expose
- Implementation
- CTDB[footnote:
- Clustered TDB, see http://ctdb.samba.org
- ] wishes to operate on TDB in a non-blocking manner. This is
- currently done as follows:
- 1. Call the _nonblock variant of an API function (eg.
- tdb_lockall_nonblock). If this fails:
- 2. Fork a child process, and wait for it to call the normal
- variant (eg. tdb_lockall).
- 3. If the child succeeds, call the _mark variant to indicate we
- already have the locks (eg. tdb_lockall_mark).
- 4. Upon completion, tell the child to release the locks (eg.
- tdb_unlockall).
- 5. Indicate to tdb that it should consider the locks removed (eg.
- tdb_unlockall_mark).
- There are several issues with this approach. Firstly, adding two
- new variants of each function clutters the API for an obscure
- use, and so not all functions have three variants. Secondly, it
- assumes that all paths of the functions ask for the same locks,
- otherwise the parent process will have to get a lock which the
- child doesn't have under some circumstances. I don't believe this
- is currently the case, but it constrains the implementation.
- 2.8.1 <Proposed-Solution-locking-hook>Proposed Solution
- Implement a hook for locking methods, so that the caller can
- control the calls to create and remove fcntl locks. In this
- scenario, ctdbd would operate as follows:
- 1. Call the normal API function, eg tdb_lockall().
- 2. When the lock callback comes in, check if the child has the
- lock. Initially, this is always false. If so, return 0.
- Otherwise, try to obtain it in non-blocking mode. If that
- fails, return EWOULDBLOCK.
- 3. Release locks in the unlock callback as normal.
- 4. If tdb_lockall() fails, see if we recorded a lock failure; if
- so, call the child to repeat the operation.
- 5. The child records what locks it obtains, and returns that
- information to the parent.
- 6. When the child has succeeded, goto 1.
- This is flexible enough to handle any potential locking scenario,
- even when lock requirements change. It can be optimized so that
- the parent does not release locks, just tells the child which
- locks it doesn't need to obtain.
- It also keeps the complexity out of the API, and in ctdbd where
- it is needed.
- 2.8.2 Status
- Complete.
- 2.9 tdb_chainlock Functions Expose Implementation
- tdb_chainlock locks some number of records, including the record
- indicated by the given key. This gave atomicity guarantees;
- no-one can start a transaction, alter, read or delete that key
- while the lock is held.
- It also makes the same guarantee for any other key in the chain,
- which is an internal implementation detail and potentially a
- cause for deadlock.
- 2.9.1 Proposed Solution
- None. It would be nice to have an explicit single entry lock
- which effected no other keys. Unfortunately, this won't work for
- an entry which doesn't exist. Thus while chainlock may be
- implemented more efficiently for the existing case, it will still
- have overlap issues with the non-existing case. So it is best to
- keep the current (lack of) guarantee about which records will be
- effected to avoid constraining our implementation.
- 2.10 Signal Handling is Not Race-Free
- The tdb_setalarm_sigptr() call allows the caller's signal handler
- to indicate that the tdb locking code should return with a
- failure, rather than trying again when a signal is received (and
- errno == EAGAIN). This is usually used to implement timeouts.
- Unfortunately, this does not work in the case where the signal is
- received before the tdb code enters the fcntl() call to place the
- lock: the code will sleep within the fcntl() code, unaware that
- the signal wants it to exit. In the case of long timeouts, this
- does not happen in practice.
- 2.10.1 Proposed Solution
- The locking hooks proposed in[Proposed-Solution-locking-hook]
- would allow the user to decide on whether to fail the lock
- acquisition on a signal. This allows the caller to choose their
- own compromise: they could narrow the race by checking
- immediately before the fcntl call.[footnote:
- It may be possible to make this race-free in some implementations
- by having the signal handler alter the struct flock to make it
- invalid. This will cause the fcntl() lock call to fail with
- EINVAL if the signal occurs before the kernel is entered,
- otherwise EAGAIN.
- ]
- 2.10.2 Status
- Complete.
- 2.11 The API Uses Gratuitous Typedefs, Capitals
- typedefs are useful for providing source compatibility when types
- can differ across implementations, or arguably in the case of
- function pointer definitions which are hard for humans to parse.
- Otherwise it is simply obfuscation and pollutes the namespace.
- Capitalization is usually reserved for compile-time constants and
- macros.
- TDB_CONTEXT There is no reason to use this over 'struct
- tdb_context'; the definition isn't visible to the API user
- anyway.
- TDB_DATA There is no reason to use this over struct TDB_DATA;
- the struct needs to be understood by the API user.
- struct TDB_DATA This would normally be called 'struct
- tdb_data'.
- enum TDB_ERROR Similarly, this would normally be enum
- tdb_error.
- 2.11.1 Proposed Solution
- None. Introducing lower case variants would please pedants like
- myself, but if it were done the existing ones should be kept.
- There is little point forcing a purely cosmetic change upon tdb
- users.
- 2.12 <tdb_log_func-Doesnt-Take>tdb_log_func Doesn't Take The
- Private Pointer
- For API compatibility reasons, the logging function needs to call
- tdb_get_logging_private() to retrieve the pointer registered by
- the tdb_open_ex for logging.
- 2.12.1 Proposed Solution
- It should simply take an extra argument, since we are prepared to
- break the API/ABI.
- 2.12.2 Status
- Complete.
- 2.13 Various Callback Functions Are Not Typesafe
- The callback functions in tdb_set_logging_function (after[tdb_log_func-Doesnt-Take]
- is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read
- and tdb_check all take void * and must internally convert it to
- the argument type they were expecting.
- If this type changes, the compiler will not produce warnings on
- the callers, since it only sees void *.
- 2.13.1 Proposed Solution
- With careful use of macros, we can create callback functions
- which give a warning when used on gcc and the types of the
- callback and its private argument differ. Unsupported compilers
- will not give a warning, which is no worse than now. In addition,
- the callbacks become clearer, as they need not use void * for
- their parameter.
- See CCAN's typesafe_cb module at
- http://ccan.ozlabs.org/info/typesafe_cb.html
- 2.13.2 Status
- Complete.
- 2.14 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens,
- tdb_reopen_all Problematic
- The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB
- file should be cleared if the caller discovers it is the only
- process with the TDB open. However, if any caller does not
- specify TDB_CLEAR_IF_FIRST it will not be detected, so will have
- the TDB erased underneath them (usually resulting in a crash).
- There is a similar issue on fork(); if the parent exits (or
- otherwise closes the tdb) before the child calls tdb_reopen_all()
- to establish the lock used to indicate the TDB is opened by
- someone, a TDB_CLEAR_IF_FIRST opener at that moment will believe
- it alone has opened the TDB and will erase it.
- 2.14.1 Proposed Solution
- Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but
- see[TDB_CLEAR_IF_FIRST-Imposes-Performance].
- 2.14.2 Status
- Complete. An open hook is provided to replicate this
- functionality if required.
- 2.15 Extending The Header Is Difficult
- We have reserved (zeroed) words in the TDB header, which can be
- used for future features. If the future features are compulsory,
- the version number must be updated to prevent old code from
- accessing the database. But if the future feature is optional, we
- have no way of telling if older code is accessing the database or
- not.
- 2.15.1 Proposed Solution
- The header should contain a“format variant” value (64-bit). This
- is divided into two 32-bit parts:
- 1. The lower part reflects the format variant understood by code
- accessing the database.
- 2. The upper part reflects the format variant you must understand
- to write to the database (otherwise you can only open for
- reading).
- The latter field can only be written at creation time, the former
- should be written under the OPEN_LOCK when opening the database
- for writing, if the variant of the code is lower than the current
- lowest variant.
- This should allow backwards-compatible features to be added, and
- detection if older code (which doesn't understand the feature)
- writes to the database.
- 2.15.2 Status
- Complete.
- 2.16 Record Headers Are Not Expandible
- If we later want to add (say) checksums on keys and data, it
- would require another format change, which we'd like to avoid.
- 2.16.1 Proposed Solution
- We often have extra padding at the tail of a record. If we ensure
- that the first byte (if any) of this padding is zero, we will
- have a way for future changes to detect code which doesn't
- understand a new format: the new code would write (say) a 1 at
- the tail, and thus if there is no tail or the first byte is 0, we
- would know the extension is not present on that record.
- 2.16.2 Status
- Complete.
- 2.17 TDB Does Not Use Talloc
- Many users of TDB (particularly Samba) use the talloc allocator,
- and thus have to wrap TDB in a talloc context to use it
- conveniently.
- 2.17.1 Proposed Solution
- The allocation within TDB is not complicated enough to justify
- the use of talloc, and I am reluctant to force another
- (excellent) library on TDB users. Nonetheless a compromise is
- possible. An attribute (see[attributes]) can be added later to
- tdb_open() to provide an alternate allocation mechanism,
- specifically for talloc but usable by any other allocator (which
- would ignore the“context” argument).
- This would form a talloc heirarchy as expected, but the caller
- would still have to attach a destructor to the tdb context
- returned from tdb_open to close it. All TDB_DATA fields would be
- children of the tdb_context, and the caller would still have to
- manage them (using talloc_free() or talloc_steal()).
- 2.17.2 Status
- Complete, using the NTDB_ATTRIBUTE_ALLOCATOR attribute.
- 3 Performance And Scalability Issues
- 3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST
- Imposes Performance Penalty
- When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is
- placed at offset 4 (aka. the ACTIVE_LOCK). While these locks
- never conflict in normal tdb usage, they do add substantial
- overhead for most fcntl lock implementations when the kernel
- scans to detect if a lock conflict exists. This is often a single
- linked list, making the time to acquire and release a fcntl lock
- O(N) where N is the number of processes with the TDB open, not
- the number actually doing work.
- In a Samba server it is common to have huge numbers of clients
- sitting idle, and thus they have weaned themselves off the
- TDB_CLEAR_IF_FIRST flag.[footnote:
- There is a flag to tdb_reopen_all() which is used for this
- optimization: if the parent process will outlive the child, the
- child does not need the ACTIVE_LOCK. This is a workaround for
- this very performance issue.
- ]
- 3.1.1 Proposed Solution
- Remove the flag. It was a neat idea, but even trivial servers
- tend to know when they are initializing for the first time and
- can simply unlink the old tdb at that point.
- 3.1.2 Status
- Complete.
- 3.2 TDB Files Have a 4G Limit
- This seems to be becoming an issue (so much for“trivial”!),
- particularly for ldb.
- 3.2.1 Proposed Solution
- A new, incompatible TDB format which uses 64 bit offsets
- internally rather than 32 bit as now. For simplicity of endian
- conversion (which TDB does on the fly if required), all values
- will be 64 bit on disk. In practice, some upper bits may be used
- for other purposes, but at least 56 bits will be available for
- file offsets.
- tdb_open() will automatically detect the old version, and even
- create them if TDB_VERSION6 is specified to tdb_open.
- 32 bit processes will still be able to access TDBs larger than 4G
- (assuming that their off_t allows them to seek to 64 bits), they
- will gracefully fall back as they fail to mmap. This can happen
- already with large TDBs.
- Old versions of tdb will fail to open the new TDB files (since 28
- August 2009, commit 398d0c29290: prior to that any unrecognized
- file format would be erased and initialized as a fresh tdb!)
- 3.2.2 Status
- Complete.
- 3.3 TDB Records Have a 4G Limit
- This has not been a reported problem, and the API uses size_t
- which can be 64 bit on 64 bit platforms. However, other limits
- may have made such an issue moot.
- 3.3.1 Proposed Solution
- Record sizes will be 64 bit, with an error returned on 32 bit
- platforms which try to access such records (the current
- implementation would return TDB_ERR_OOM in a similar case). It
- seems unlikely that 32 bit keys will be a limitation, so the
- implementation may not support this (see[sub:Records-Incur-A]).
- 3.3.2 Status
- Complete.
- 3.4 Hash Size Is Determined At TDB Creation Time
- TDB contains a number of hash chains in the header; the number is
- specified at creation time, and defaults to 131. This is such a
- bottleneck on large databases (as each hash chain gets quite
- long), that LDB uses 10,000 for this hash. In general it is
- impossible to know what the 'right' answer is at database
- creation time.
- 3.4.1 <sub:Hash-Size-Solution>Proposed Solution
- After comprehensive performance testing on various scalable hash
- variants[footnote:
- http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94
- This was annoying because I was previously convinced that an
- expanding tree of hashes would be very close to optimal.
- ], it became clear that it is hard to beat a straight linear hash
- table which doubles in size when it reaches saturation.
- Unfortunately, altering the hash table introduces serious locking
- complications: the entire hash table needs to be locked to
- enlarge the hash table, and others might be holding locks.
- Particularly insidious are insertions done under tdb_chainlock.
- Thus an expanding layered hash will be used: an array of hash
- groups, with each hash group exploding into pointers to lower
- hash groups once it fills, turning into a hash tree. This has
- implications for locking: we must lock the entire group in case
- we need to expand it, yet we don't know how deep the tree is at
- that point.
- Note that bits from the hash table entries should be stolen to
- hold more hash bits to reduce the penalty of collisions. We can
- use the otherwise-unused lower 3 bits. If we limit the size of
- the database to 64 exabytes, we can use the top 8 bits of the
- hash entry as well. These 11 bits would reduce false positives
- down to 1 in 2000 which is more than we need: we can use one of
- the bits to indicate that the extra hash bits are valid. This
- means we can choose not to re-hash all entries when we expand a
- hash group; simply use the next bits we need and mark them
- invalid.
- 3.4.2 Status
- Ignore. Scaling the hash automatically proved inefficient at
- small hash sizes; we default to a 8192-element hash (changable
- via NTDB_ATTRIBUTE_HASHSIZE), and when buckets clash we expand to
- an array of hash entries. This scales slightly better than the
- tdb chain (due to the 8 top bits containing extra hash).
- 3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended
- TDB uses a single linked list for the free list. Allocation
- occurs as follows, using heuristics which have evolved over time:
- 1. Get the free list lock for this whole operation.
- 2. Multiply length by 1.25, so we always over-allocate by 25%.
- 3. Set the slack multiplier to 1.
- 4. Examine the current freelist entry: if it is > length but <
- the current best case, remember it as the best case.
- 5. Multiply the slack multiplier by 1.05.
- 6. If our best fit so far is less than length * slack multiplier,
- return it. The slack will be turned into a new free record if
- it's large enough.
- 7. Otherwise, go onto the next freelist entry.
- Deleting a record occurs as follows:
- 1. Lock the hash chain for this whole operation.
- 2. Walk the chain to find the record, keeping the prev pointer
- offset.
- 3. If max_dead is non-zero:
- (a) Walk the hash chain again and count the dead records.
- (b) If it's more than max_dead, bulk free all the dead ones
- (similar to steps 4 and below, but the lock is only obtained
- once).
- (c) Simply mark this record as dead and return.
- 4. Get the free list lock for the remainder of this operation.
- 5. <right-merging>Examine the following block to see if it is
- free; if so, enlarge the current block and remove that block
- from the free list. This was disabled, as removal from the free
- list was O(entries-in-free-list).
- 6. Examine the preceeding block to see if it is free: for this
- reason, each block has a 32-bit tailer which indicates its
- length. If it is free, expand it to cover our new block and
- return.
- 7. Otherwise, prepend ourselves to the free list.
- Disabling right-merging (step[right-merging]) causes
- fragmentation; the other heuristics proved insufficient to
- address this, so the final answer to this was that when we expand
- the TDB file inside a transaction commit, we repack the entire
- tdb.
- The single list lock limits our allocation rate; due to the other
- issues this is not currently seen as a bottleneck.
- 3.5.1 Proposed Solution
- The first step is to remove all the current heuristics, as they
- obviously interact, then examine them once the lock contention is
- addressed.
- The free list must be split to reduce contention. Assuming
- perfect free merging, we can at most have 1 free list entry for
- each entry. This implies that the number of free lists is related
- to the size of the hash table, but as it is rare to walk a large
- number of free list entries we can use far fewer, say 1/32 of the
- number of hash buckets.
- It seems tempting to try to reuse the hash implementation which
- we use for records here, but we have two ways of searching for
- free entries: for allocation we search by size (and possibly
- zone) which produces too many clashes for our hash table to
- handle well, and for coalescing we search by address. Thus an
- array of doubly-linked free lists seems preferable.
- There are various benefits in using per-size free lists (see[sub:TDB-Becomes-Fragmented]
- ) but it's not clear this would reduce contention in the common
- case where all processes are allocating/freeing the same size.
- Thus we almost certainly need to divide in other ways: the most
- obvious is to divide the file into zones, and using a free list
- (or table of free lists) for each. This approximates address
- ordering.
- Unfortunately it is difficult to know what heuristics should be
- used to determine zone sizes, and our transaction code relies on
- being able to create a“recovery area” by simply appending to the
- file (difficult if it would need to create a new zone header).
- Thus we use a linked-list of free tables; currently we only ever
- create one, but if there is more than one we choose one at random
- to use. In future we may use heuristics to add new free tables on
- contention. We only expand the file when all free tables are
- exhausted.
- The basic algorithm is as follows. Freeing is simple:
- 1. Identify the correct free list.
- 2. Lock the corresponding list.
- 3. Re-check the list (we didn't have a lock, sizes could have
- changed): relock if necessary.
- 4. Place the freed entry in the list.
- Allocation is a little more complicated, as we perform delayed
- coalescing at this point:
- 1. Pick a free table; usually the previous one.
- 2. Lock the corresponding list.
- 3. If the top entry is -large enough, remove it from the list and
- return it.
- 4. Otherwise, coalesce entries in the list.If there was no entry
- large enough, unlock the list and try the next largest list
- 5. If no list has an entry which meets our needs, try the next
- free table.
- 6. If no zone satisfies, expand the file.
- This optimizes rapid insert/delete of free list entries by not
- coalescing them all the time.. First-fit address ordering
- ordering seems to be fairly good for keeping fragmentation low
- (see[sub:TDB-Becomes-Fragmented]). Note that address ordering
- does not need a tailer to coalesce, though if we needed one we
- could have one cheaply: see[sub:Records-Incur-A].
- Each free entry has the free table number in the header: less
- than 255. It also contains a doubly-linked list for easy
- deletion.
- 3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
- Much of this is a result of allocation strategy[footnote:
- The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995
- ftp://ftp.cs.utexas.edu/pub/garbage/malloc/ismm98.ps
- ] and deliberate hobbling of coalescing; internal fragmentation
- (aka overallocation) is deliberately set at 25%, and external
- fragmentation is only cured by the decision to repack the entire
- db when a transaction commit needs to enlarge the file.
- 3.6.1 Proposed Solution
- The 25% overhead on allocation works in practice for ldb because
- indexes tend to expand by one record at a time. This internal
- fragmentation can be resolved by having an“expanded” bit in the
- header to note entries that have previously expanded, and
- allocating more space for them.
- There are is a spectrum of possible solutions for external
- fragmentation: one is to use a fragmentation-avoiding allocation
- strategy such as best-fit address-order allocator. The other end
- of the spectrum would be to use a bump allocator (very fast and
- simple) and simply repack the file when we reach the end.
- There are three problems with efficient fragmentation-avoiding
- allocators: they are non-trivial, they tend to use a single free
- list for each size, and there's no evidence that tdb allocation
- patterns will match those recorded for general allocators (though
- it seems likely).
- Thus we don't spend too much effort on external fragmentation; we
- will be no worse than the current code if we need to repack on
- occasion. More effort is spent on reducing freelist contention,
- and reducing overhead.
- 3.7 <sub:Records-Incur-A>Records Incur A 28-Byte Overhead
- Each TDB record has a header as follows:
- struct tdb_record {
- tdb_off_t next; /* offset of the next record in the list
- */
- tdb_len_t rec_len; /* total byte length of record */
- tdb_len_t key_len; /* byte length of key */
- tdb_len_t data_len; /* byte length of data */
- uint32_t full_hash; /* the full 32 bit hash of the key */
- uint32_t magic; /* try to catch errors */
- /* the following union is implied:
- union {
- char record[rec_len];
- struct {
- char key[key_len];
- char data[data_len];
- }
- uint32_t totalsize; (tailer)
- }
- */
- };
- Naively, this would double to a 56-byte overhead on a 64 bit
- implementation.
- 3.7.1 Proposed Solution
- We can use various techniques to reduce this for an allocated
- block:
- 1. The 'next' pointer is not required, as we are using a flat
- hash table.
- 2. 'rec_len' can instead be expressed as an addition to key_len
- and data_len (it accounts for wasted or overallocated length in
- the record). Since the record length is always a multiple of 8,
- we can conveniently fit it in 32 bits (representing up to 35
- bits).
- 3. 'key_len' and 'data_len' can be reduced. I'm unwilling to
- restrict 'data_len' to 32 bits, but instead we can combine the
- two into one 64-bit field and using a 5 bit value which
- indicates at what bit to divide the two. Keys are unlikely to
- scale as fast as data, so I'm assuming a maximum key size of 32
- bits.
- 4. 'full_hash' is used to avoid a memcmp on the“miss” case, but
- this is diminishing returns after a handful of bits (at 10
- bits, it reduces 99.9% of false memcmp). As an aside, as the
- lower bits are already incorporated in the hash table
- resolution, the upper bits should be used here. Note that it's
- not clear that these bits will be a win, given the extra bits
- in the hash table itself (see[sub:Hash-Size-Solution]).
- 5. 'magic' does not need to be enlarged: it currently reflects
- one of 5 values (used, free, dead, recovery, and
- unused_recovery). It is useful for quick sanity checking
- however, and should not be eliminated.
- 6. 'tailer' is only used to coalesce free blocks (so a block to
- the right can find the header to check if this block is free).
- This can be replaced by a single 'free' bit in the header of
- the following block (and the tailer only exists in free
- blocks).[footnote:
- This technique from Thomas Standish. Data Structure Techniques.
- Addison-Wesley, Reading, Massachusetts, 1980.
- ] The current proposed coalescing algorithm doesn't need this,
- however.
- This produces a 16 byte used header like this:
- struct tdb_used_record {
- uint32_t used_magic : 16,
- key_data_divide: 5,
- top_hash: 11;
- uint32_t extra_octets;
- uint64_t key_and_data_len;
- };
- And a free record like this:
- struct tdb_free_record {
- uint64_t free_magic: 8,
- prev : 56;
- uint64_t free_table: 8,
- total_length : 56
- uint64_t next;;
- };
- Note that by limiting valid offsets to 56 bits, we can pack
- everything we need into 3 64-byte words, meaning our minimum
- record size is 8 bytes.
- 3.7.2 Status
- Complete.
- 3.8 Transaction Commit Requires 4 fdatasync
- The current transaction algorithm is:
- 1. write_recovery_data();
- 2. sync();
- 3. write_recovery_header();
- 4. sync();
- 5. overwrite_with_new_data();
- 6. sync();
- 7. remove_recovery_header();
- 8. sync();
- On current ext3, each sync flushes all data to disk, so the next
- 3 syncs are relatively expensive. But this could become a
- performance bottleneck on other filesystems such as ext4.
- 3.8.1 Proposed Solution
- Neil Brown points out that this is overzealous, and only one sync
- is needed:
- 1. Bundle the recovery data, a transaction counter and a strong
- checksum of the new data.
- 2. Strong checksum that whole bundle.
- 3. Store the bundle in the database.
- 4. Overwrite the oldest of the two recovery pointers in the
- header (identified using the transaction counter) with the
- offset of this bundle.
- 5. sync.
- 6. Write the new data to the file.
- Checking for recovery means identifying the latest bundle with a
- valid checksum and using the new data checksum to ensure that it
- has been applied. This is more expensive than the current check,
- but need only be done at open. For running databases, a separate
- header field can be used to indicate a transaction in progress;
- we need only check for recovery if this is set.
- 3.8.2 Status
- Deferred.
- 3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support
- 3.9.1 Proposed Solution
- None. At some point you say“use a real database” (but see[replay-attribute]
- ).
- But as a thought experiment, if we implemented transactions to
- only overwrite free entries (this is tricky: there must not be a
- header in each entry which indicates whether it is free, but use
- of presence in metadata elsewhere), and a pointer to the hash
- table, we could create an entirely new commit without destroying
- existing data. Then it would be easy to implement snapshots in a
- similar way.
- This would not allow arbitrary changes to the database, such as
- tdb_repack does, and would require more space (since we have to
- preserve the current and future entries at once). If we used hash
- trees rather than one big hash table, we might only have to
- rewrite some sections of the hash, too.
- We could then implement snapshots using a similar method, using
- multiple different hash tables/free tables.
- 3.9.2 Status
- Deferred.
- 3.10 Transactions Cannot Operate in Parallel
- This would be useless for ldb, as it hits the index records with
- just about every update. It would add significant complexity in
- resolving clashes, and cause the all transaction callers to write
- their code to loop in the case where the transactions spuriously
- failed.
- 3.10.1 Proposed Solution
- None (but see[replay-attribute]). We could solve a small part of
- the problem by providing read-only transactions. These would
- allow one write transaction to begin, but it could not commit
- until all r/o transactions are done. This would require a new
- RO_TRANSACTION_LOCK, which would be upgraded on commit.
- 3.10.2 Status
- Deferred.
- 3.11 Default Hash Function Is Suboptimal
- The Knuth-inspired multiplicative hash used by tdb is fairly slow
- (especially if we expand it to 64 bits), and works best when the
- hash bucket size is a prime number (which also means a slow
- modulus). In addition, it is highly predictable which could
- potentially lead to a Denial of Service attack in some TDB uses.
- 3.11.1 Proposed Solution
- The Jenkins lookup3 hash[footnote:
- http://burtleburtle.net/bob/c/lookup3.c
- ] is a fast and superbly-mixing hash. It's used by the Linux
- kernel and almost everything else. This has the particular
- properties that it takes an initial seed, and produces two 32 bit
- hash numbers, which we can combine into a 64-bit hash.
- The seed should be created at tdb-creation time from some random
- source, and placed in the header. This is far from foolproof, but
- adds a little bit of protection against hash bombing.
- 3.11.2 Status
- Complete.
- 3.12 <Reliable-Traversal-Adds>Reliable Traversal Adds Complexity
- We lock a record during traversal iteration, and try to grab that
- lock in the delete code. If that grab on delete fails, we simply
- mark it deleted and continue onwards; traversal checks for this
- condition and does the delete when it moves off the record.
- If traversal terminates, the dead record may be left
- indefinitely.
- 3.12.1 Proposed Solution
- Remove reliability guarantees; see[traverse-Proposed-Solution].
- 3.12.2 Status
- Complete.
- 3.13 Fcntl Locking Adds Overhead
- Placing a fcntl lock means a system call, as does removing one.
- This is actually one reason why transactions can be faster
- (everything is locked once at transaction start). In the
- uncontended case, this overhead can theoretically be eliminated.
- 3.13.1 Proposed Solution
- None.
- We tried this before with spinlock support, in the early days of
- TDB, and it didn't make much difference except in manufactured
- benchmarks.
- We could use spinlocks (with futex kernel support under Linux),
- but it means that we lose automatic cleanup when a process dies
- with a lock. There is a method of auto-cleanup under Linux, but
- it's not supported by other operating systems. We could
- reintroduce a clear-if-first-style lock and sweep for dead
- futexes on open, but that wouldn't help the normal case of one
- concurrent opener dying. Increasingly elaborate repair schemes
- could be considered, but they require an ABI change (everyone
- must use them) anyway, so there's no need to do this at the same
- time as everything else.
- 3.14 Some Transactions Don't Require Durability
- Volker points out that gencache uses a CLEAR_IF_FIRST tdb for
- normal (fast) usage, and occasionally empties the results into a
- transactional TDB. This kind of usage prioritizes performance
- over durability: as long as we are consistent, data can be lost.
- This would be more neatly implemented inside tdb: a“soft”
- transaction commit (ie. syncless) which meant that data may be
- reverted on a crash.
- 3.14.1 Proposed Solution
- None.
- Unfortunately any transaction scheme which overwrites old data
- requires a sync before that overwrite to avoid the possibility of
- corruption.
- It seems possible to use a scheme similar to that described in[sub:TDB-Does-Not]
- ,where transactions are committed without overwriting existing
- data, and an array of top-level pointers were available in the
- header. If the transaction is“soft” then we would not need a sync
- at all: existing processes would pick up the new hash table and
- free list and work with that.
- At some later point, a sync would allow recovery of the old data
- into the free lists (perhaps when the array of top-level pointers
- filled). On crash, tdb_open() would examine the array of top
- levels, and apply the transactions until it encountered an
- invalid checksum.
- 3.15 Tracing Is Fragile, Replay Is External
- The current TDB has compile-time-enabled tracing code, but it
- often breaks as it is not enabled by default. In a similar way,
- the ctdb code has an external wrapper which does replay tracing
- so it can coordinate cluster-wide transactions.
- 3.15.1 Proposed Solution<replay-attribute>
- Tridge points out that an attribute can be later added to
- tdb_open (see[attributes]) to provide replay/trace hooks, which
- could become the basis for this and future parallel transactions
- and snapshot support.
- 3.15.2 Status
- Deferred.
|