Browse Source

tdb2: update design doc.

Rusty Russell 15 years ago
parent
commit
b8d05b195b
3 changed files with 387 additions and 20 deletions
  1. 152 4
      ccan/tdb2/doc/design.lyx
  2. 187 7
      ccan/tdb2/doc/design.lyx,v
  3. 48 9
      ccan/tdb2/doc/design.txt

+ 152 - 4
ccan/tdb2/doc/design.lyx

@@ -53,8 +53,8 @@ Rusty Russell, IBM Corporation
 
 
 \change_deleted 0 1283307542
 \change_deleted 0 1283307542
 26-July
 26-July
-\change_inserted 0 1284016854
-9-September
+\change_inserted 0 1284423485
+14-September
 \change_unchanged
 \change_unchanged
 -2010
 -2010
 \end_layout
 \end_layout
@@ -476,6 +476,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional
 
 
 \begin_layout Subsubsection
 \begin_layout Subsubsection
 Proposed Solution
 Proposed Solution
+\change_inserted 0 1284422789
+
+\begin_inset CommandInset label
+LatexCommand label
+name "attributes"
+
+\end_inset
+
+
+\change_unchanged
+
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
@@ -1289,13 +1300,69 @@ Proposed Solution
 
 
 \begin_layout Standard
 \begin_layout Standard
 
 
-\change_inserted 0 1284016847
+\change_inserted 0 1284422552
 We often have extra padding at the tail of a record.
 We often have extra padding at the tail of a record.
  If we ensure that the first byte (if any) of this padding is zero, we will
  If we ensure that the first byte (if any) of this padding is zero, we will
  have a way for future changes to detect code which doesn't understand a
  have a way for future changes to detect code which doesn't understand a
  new format: the new code would write (say) a 1 at the tail, and thus if
  new format: the new code would write (say) a 1 at the tail, and thus if
  there is no tail or the first byte is 0, we would know the extension is
  there is no tail or the first byte is 0, we would know the extension is
  not present on that record.
  not present on that record.
+\end_layout
+
+\begin_layout Subsection
+
+\change_inserted 0 1284422568
+TDB Does Not Use Talloc
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284422646
+Many users of TDB (particularly Samba) use the talloc allocator, and thus
+ have to wrap TDB in a talloc context to use it conveniently.
+\end_layout
+
+\begin_layout Subsubsection
+
+\change_inserted 0 1284422656
+Proposed Solution
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284423065
+The allocation within TDB is not complicated enough to justify the use of
+ talloc, and I am reluctant to force another (excellent) library on TDB
+ users.
+ Nonetheless a compromise is possible.
+ An attribute (see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "attributes"
+
+\end_inset
+
+) can be added later to tdb_open() to provide an alternate allocation mechanism,
+ specifically for talloc but usable by any other allocator (which would
+ ignore the 
+\begin_inset Quotes eld
+\end_inset
+
+context
+\begin_inset Quotes erd
+\end_inset
+
+ argument).
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284423042
+This would form a talloc heirarchy as expected, but the caller would still
+ have to attach a destructor to the tdb context returned from tdb_open to
+ close it.
+ All TDB_DATA fields would be children of the tdb_context, and the caller
+ would still have to manage them (using talloc_free() or talloc_steal()).
 \change_unchanged
 \change_unchanged
 
 
 \end_layout
 \end_layout
@@ -1875,7 +1942,7 @@ status open
 
 
 \begin_layout Plain Layout
 \begin_layout Plain Layout
 
 
-\change_inserted 0 1283310945
+\change_inserted 0 1284424151
 Using 
 Using 
 \begin_inset Formula $2^{16+N*3}$
 \begin_inset Formula $2^{16+N*3}$
 \end_inset
 \end_inset
@@ -1886,6 +1953,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal
 
 
  byte zone.
  byte zone.
  Zones range in factor of 8 steps.
  Zones range in factor of 8 steps.
+ Given the zone size for the zone the current record is in, we can determine
+ the start of the zone.
 \change_unchanged
 \change_unchanged
 
 
 \end_layout
 \end_layout
@@ -2330,6 +2399,8 @@ TDB Does Not Have Snapshot Support
 
 
 \begin_layout Subsubsection
 \begin_layout Subsubsection
 Proposed Solution
 Proposed Solution
+\change_deleted 0 1284423472
+
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
@@ -2342,7 +2413,23 @@ use a real database
 \begin_inset Quotes erd
 \begin_inset Quotes erd
 \end_inset
 \end_inset
 
 
+
+\change_inserted 0 1284423891
+ 
+\change_deleted 0 1284423891
 .
 .
+
+\change_inserted 0 1284423901
+ (but see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "replay-attribute"
+
+\end_inset
+
+).
+\change_unchanged
+
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
@@ -2365,6 +2452,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack
 \begin_layout Standard
 \begin_layout Standard
 We could then implement snapshots using a similar method, using multiple
 We could then implement snapshots using a similar method, using multiple
  different hash tables/free tables.
  different hash tables/free tables.
+\change_inserted 0 1284423495
+
 \end_layout
 \end_layout
 
 
 \begin_layout Subsection
 \begin_layout Subsection
@@ -2384,6 +2473,18 @@ Proposed Solution
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
+
+\change_inserted 0 1284424201
+None (but see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "replay-attribute"
+
+\end_inset
+
+).
+ 
+\change_unchanged
 We could solve a small part of the problem by providing read-only transactions.
 We could solve a small part of the problem by providing read-only transactions.
  These would allow one write transaction to begin, but it could not commit
  These would allow one write transaction to begin, but it could not commit
  until all r/o transactions are done.
  until all r/o transactions are done.
@@ -2569,6 +2670,53 @@ At some later point, a sync would allow recovery of the old data into the
  free lists (perhaps when the array of top-level pointers filled).
  free lists (perhaps when the array of top-level pointers filled).
  On crash, tdb_open() would examine the array of top levels, and apply the
  On crash, tdb_open() would examine the array of top levels, and apply the
  transactions until it encountered an invalid checksum.
  transactions until it encountered an invalid checksum.
+\change_inserted 0 1284423555
+
+\end_layout
+
+\begin_layout Subsection
+
+\change_inserted 0 1284423617
+Tracing Is Fragile, Replay Is External
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284423719
+The current TDB has compile-time-enabled tracing code, but it often breaks
+ as it is not enabled by default.
+ In a similar way, the ctdb code has an external wrapper which does replay
+ tracing so it can coordinate cluster-wide transactions.
+\end_layout
+
+\begin_layout Subsubsection
+
+\change_inserted 0 1284423864
+Proposed Solution
+\begin_inset CommandInset label
+LatexCommand label
+name "replay-attribute"
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284423850
+Tridge points out that an attribute can be later added to tdb_open (see
+ 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "attributes"
+
+\end_inset
+
+) to provide replay/trace hooks, which could become the basis for this and
+ future parallel transactions and snapshot support.
+\change_unchanged
+
 \end_layout
 \end_layout
 
 
 \end_body
 \end_body

+ 187 - 7
ccan/tdb2/doc/design.lyx,v

@@ -1,10 +1,15 @@
-head	1.9;
+head	1.10;
 access;
 access;
 symbols;
 symbols;
 locks; strict;
 locks; strict;
 comment	@# @;
 comment	@# @;
 
 
 
 
+1.10
+date	2010.09.14.00.33.57;	author rusty;	state Exp;
+branches;
+next	1.9;
+
 1.9
 1.9
 date	2010.09.09.07.25.12;	author rusty;	state Exp;
 date	2010.09.09.07.25.12;	author rusty;	state Exp;
 branches;
 branches;
@@ -56,9 +61,9 @@ desc
 @
 @
 
 
 
 
-1.9
+1.10
 log
 log
-@Extension mechanism.
+@Tracing attribute, talloc support.
 @
 @
 text
 text
 @#LyX 1.6.5 created this file. For more info see http://www.lyx.org/
 @#LyX 1.6.5 created this file. For more info see http://www.lyx.org/
@@ -116,8 +121,8 @@ Rusty Russell, IBM Corporation
 
 
 \change_deleted 0 1283307542
 \change_deleted 0 1283307542
 26-July
 26-July
-\change_inserted 0 1284016854
-9-September
+\change_inserted 0 1284423485
+14-September
 \change_unchanged
 \change_unchanged
 -2010
 -2010
 \end_layout
 \end_layout
@@ -539,6 +544,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional
 
 
 \begin_layout Subsubsection
 \begin_layout Subsubsection
 Proposed Solution
 Proposed Solution
+\change_inserted 0 1284422789
+
+\begin_inset CommandInset label
+LatexCommand label
+name "attributes"
+
+\end_inset
+
+
+\change_unchanged
+
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
@@ -1352,13 +1368,69 @@ Proposed Solution
 
 
 \begin_layout Standard
 \begin_layout Standard
 
 
-\change_inserted 0 1284016847
+\change_inserted 0 1284422552
 We often have extra padding at the tail of a record.
 We often have extra padding at the tail of a record.
  If we ensure that the first byte (if any) of this padding is zero, we will
  If we ensure that the first byte (if any) of this padding is zero, we will
  have a way for future changes to detect code which doesn't understand a
  have a way for future changes to detect code which doesn't understand a
  new format: the new code would write (say) a 1 at the tail, and thus if
  new format: the new code would write (say) a 1 at the tail, and thus if
  there is no tail or the first byte is 0, we would know the extension is
  there is no tail or the first byte is 0, we would know the extension is
  not present on that record.
  not present on that record.
+\end_layout
+
+\begin_layout Subsection
+
+\change_inserted 0 1284422568
+TDB Does Not Use Talloc
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284422646
+Many users of TDB (particularly Samba) use the talloc allocator, and thus
+ have to wrap TDB in a talloc context to use it conveniently.
+\end_layout
+
+\begin_layout Subsubsection
+
+\change_inserted 0 1284422656
+Proposed Solution
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284423065
+The allocation within TDB is not complicated enough to justify the use of
+ talloc, and I am reluctant to force another (excellent) library on TDB
+ users.
+ Nonetheless a compromise is possible.
+ An attribute (see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "attributes"
+
+\end_inset
+
+) can be added later to tdb_open() to provide an alternate allocation mechanism,
+ specifically for talloc but usable by any other allocator (which would
+ ignore the 
+\begin_inset Quotes eld
+\end_inset
+
+context
+\begin_inset Quotes erd
+\end_inset
+
+ argument).
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284423042
+This would form a talloc heirarchy as expected, but the caller would still
+ have to attach a destructor to the tdb context returned from tdb_open to
+ close it.
+ All TDB_DATA fields would be children of the tdb_context, and the caller
+ would still have to manage them (using talloc_free() or talloc_steal()).
 \change_unchanged
 \change_unchanged
 
 
 \end_layout
 \end_layout
@@ -1938,7 +2010,7 @@ status open
 
 
 \begin_layout Plain Layout
 \begin_layout Plain Layout
 
 
-\change_inserted 0 1283310945
+\change_inserted 0 1284424151
 Using 
 Using 
 \begin_inset Formula $2^{16+N*3}$
 \begin_inset Formula $2^{16+N*3}$
 \end_inset
 \end_inset
@@ -1949,6 +2021,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal
 
 
  byte zone.
  byte zone.
  Zones range in factor of 8 steps.
  Zones range in factor of 8 steps.
+ Given the zone size for the zone the current record is in, we can determine
+ the start of the zone.
 \change_unchanged
 \change_unchanged
 
 
 \end_layout
 \end_layout
@@ -2393,6 +2467,8 @@ TDB Does Not Have Snapshot Support
 
 
 \begin_layout Subsubsection
 \begin_layout Subsubsection
 Proposed Solution
 Proposed Solution
+\change_deleted 0 1284423472
+
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
@@ -2405,7 +2481,23 @@ use a real database
 \begin_inset Quotes erd
 \begin_inset Quotes erd
 \end_inset
 \end_inset
 
 
+
+\change_inserted 0 1284423891
+ 
+\change_deleted 0 1284423891
 .
 .
+
+\change_inserted 0 1284423901
+ (but see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "replay-attribute"
+
+\end_inset
+
+).
+\change_unchanged
+
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
@@ -2428,6 +2520,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack
 \begin_layout Standard
 \begin_layout Standard
 We could then implement snapshots using a similar method, using multiple
 We could then implement snapshots using a similar method, using multiple
  different hash tables/free tables.
  different hash tables/free tables.
+\change_inserted 0 1284423495
+
 \end_layout
 \end_layout
 
 
 \begin_layout Subsection
 \begin_layout Subsection
@@ -2447,6 +2541,18 @@ Proposed Solution
 \end_layout
 \end_layout
 
 
 \begin_layout Standard
 \begin_layout Standard
+
+\change_inserted 0 1284424201
+None (but see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "replay-attribute"
+
+\end_inset
+
+).
+ 
+\change_unchanged
 We could solve a small part of the problem by providing read-only transactions.
 We could solve a small part of the problem by providing read-only transactions.
  These would allow one write transaction to begin, but it could not commit
  These would allow one write transaction to begin, but it could not commit
  until all r/o transactions are done.
  until all r/o transactions are done.
@@ -2632,6 +2738,53 @@ At some later point, a sync would allow recovery of the old data into the
  free lists (perhaps when the array of top-level pointers filled).
  free lists (perhaps when the array of top-level pointers filled).
  On crash, tdb_open() would examine the array of top levels, and apply the
  On crash, tdb_open() would examine the array of top levels, and apply the
  transactions until it encountered an invalid checksum.
  transactions until it encountered an invalid checksum.
+\change_inserted 0 1284423555
+
+\end_layout
+
+\begin_layout Subsection
+
+\change_inserted 0 1284423617
+Tracing Is Fragile, Replay Is External
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284423719
+The current TDB has compile-time-enabled tracing code, but it often breaks
+ as it is not enabled by default.
+ In a similar way, the ctdb code has an external wrapper which does replay
+ tracing so it can coordinate cluster-wide transactions.
+\end_layout
+
+\begin_layout Subsubsection
+
+\change_inserted 0 1284423864
+Proposed Solution
+\begin_inset CommandInset label
+LatexCommand label
+name "replay-attribute"
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1284423850
+Tridge points out that an attribute can be later added to tdb_open (see
+ 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "attributes"
+
+\end_inset
+
+) to provide replay/trace hooks, which could become the basis for this and
+ future parallel transactions and snapshot support.
+\change_unchanged
+
 \end_layout
 \end_layout
 
 
 \end_body
 \end_body
@@ -2639,6 +2792,33 @@ At some later point, a sync would allow recovery of the old data into the
 @
 @
 
 
 
 
+1.9
+log
+@Extension mechanism.
+@
+text
+@d56 2
+a57 2
+\change_inserted 0 1284016854
+9-September
+d479 11
+d1303 1
+a1303 1
+\change_inserted 0 1284016847
+d1310 56
+d1945 1
+a1945 1
+\change_inserted 0 1283310945
+d1956 2
+d2402 2
+d2416 4
+d2421 12
+d2455 2
+d2476 12
+d2673 47
+@
+
+
 1.8
 1.8
 log
 log
 @Remove bogus footnote
 @Remove bogus footnote

+ 48 - 9
ccan/tdb2/doc/design.txt

@@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase
 
 
 Rusty Russell, IBM Corporation
 Rusty Russell, IBM Corporation
 
 
-9-September-2010
+14-September-2010
 
 
 Abstract
 Abstract
 
 
@@ -74,7 +74,7 @@ optional hashing function and an optional logging function
 argument. Additional arguments to open would require the 
 argument. Additional arguments to open would require the 
 introduction of a tdb_open_ex2 call etc.
 introduction of a tdb_open_ex2 call etc.
 
 
-2.1.1 Proposed Solution
+2.1.1 Proposed Solution<attributes>
 
 
 tdb_open() will take a linked-list of attributes:
 tdb_open() will take a linked-list of attributes:
 
 
@@ -519,6 +519,28 @@ understand a new format: the new code would write (say) a 1 at
 the tail, and thus if there is no tail or the first byte is 0, we 
 the tail, and thus if there is no tail or the first byte is 0, we 
 would know the extension is not present on that record.
 would know the extension is not present on that record.
 
 
+2.17 TDB Does Not Use Talloc
+
+Many users of TDB (particularly Samba) use the talloc allocator, 
+and thus have to wrap TDB in a talloc context to use it 
+conveniently.
+
+2.17.1 Proposed Solution
+
+The allocation within TDB is not complicated enough to justify 
+the use of talloc, and I am reluctant to force another 
+(excellent) library on TDB users. Nonetheless a compromise is 
+possible. An attribute (see [attributes]) can be added later to 
+tdb_open() to provide an alternate allocation mechanism, 
+specifically for talloc but usable by any other allocator (which 
+would ignore the “context” argument).
+
+This would form a talloc heirarchy as expected, but the caller 
+would still have to attach a destructor to the tdb context 
+returned from tdb_open to close it. All TDB_DATA fields would be 
+children of the tdb_context, and the caller would still have to 
+manage them (using talloc_free() or talloc_steal()).
+
 3 Performance And Scalability Issues
 3 Performance And Scalability Issues
 
 
 3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST 
 3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST 
@@ -790,7 +812,9 @@ question “what zone is this record in?” much harder (and “pick a
 random zone”, but that's less common). It could be done with as 
 random zone”, but that's less common). It could be done with as 
 few as 4 bits from the record header.[footnote:
 few as 4 bits from the record header.[footnote:
 Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives 
 Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives 
-the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
+the maximal 2^{61} byte zone. Zones range in factor of 8 steps. 
+Given the zone size for the zone the current record is in, we can 
+determine the start of the zone.
 ]
 ]
 
 
 3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
 3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
@@ -1009,7 +1033,8 @@ we need only check for recovery if this is set.
 
 
 3.9.1 Proposed Solution
 3.9.1 Proposed Solution
 
 
-None. At some point you say “use a real database”.
+None. At some point you say “use a real database”  (but see [replay-attribute]
+).
 
 
 But as a thought experiment, if we implemented transactions to 
 But as a thought experiment, if we implemented transactions to 
 only overwrite free entries (this is tricky: there must not be a 
 only overwrite free entries (this is tricky: there must not be a 
@@ -1038,11 +1063,11 @@ failed.
 
 
 3.10.1 Proposed Solution
 3.10.1 Proposed Solution
 
 
-We could solve a small part of the problem by providing read-only 
-transactions. These would allow one write transaction to begin, 
-but it could not commit until all r/o transactions are done. This 
-would require a new RO_TRANSACTION_LOCK, which would be upgraded 
-on commit.
+None (but see [replay-attribute]). We could solve a small part of 
+the problem by providing read-only transactions. These would 
+allow one write transaction to begin, but it could not commit 
+until all r/o transactions are done. This would require a new 
+RO_TRANSACTION_LOCK, which would be upgraded on commit.
 
 
 3.11 Default Hash Function Is Suboptimal
 3.11 Default Hash Function Is Suboptimal
 
 
@@ -1137,3 +1162,17 @@ filled). On crash, tdb_open() would examine the array of top
 levels, and apply the transactions until it encountered an 
 levels, and apply the transactions until it encountered an 
 invalid checksum.
 invalid checksum.
 
 
+3.15 Tracing Is Fragile, Replay Is External
+
+The current TDB has compile-time-enabled tracing code, but it 
+often breaks as it is not enabled by default. In a similar way, 
+the ctdb code has an external wrapper which does replay tracing 
+so it can coordinate cluster-wide transactions.
+
+3.15.1 Proposed Solution<replay-attribute>
+
+Tridge points out that an attribute can be later added to 
+tdb_open (see [attributes]) to provide replay/trace hooks, which 
+could become the basis for this and future parallel transactions 
+and snapshot support.
+