|
|
@@ -1,4679 +0,0 @@
|
|
|
-head 1.13;
|
|
|
-access;
|
|
|
-symbols;
|
|
|
-locks; strict;
|
|
|
-comment @# @;
|
|
|
-
|
|
|
-
|
|
|
-1.13
|
|
|
-date 2011.03.01.11.46.54; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.12;
|
|
|
-
|
|
|
-1.12
|
|
|
-date 2010.12.01.12.20.49; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.11;
|
|
|
-
|
|
|
-1.11
|
|
|
-date 2010.12.01.11.55.20; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.10;
|
|
|
-
|
|
|
-1.10
|
|
|
-date 2010.09.14.00.33.57; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.9;
|
|
|
-
|
|
|
-1.9
|
|
|
-date 2010.09.09.07.25.12; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.8;
|
|
|
-
|
|
|
-1.8
|
|
|
-date 2010.09.02.02.29.05; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.7;
|
|
|
-
|
|
|
-1.7
|
|
|
-date 2010.09.01.10.58.12; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.6;
|
|
|
-
|
|
|
-1.6
|
|
|
-date 2010.08.02.00.21.43; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.5;
|
|
|
-
|
|
|
-1.5
|
|
|
-date 2010.08.02.00.21.16; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.4;
|
|
|
-
|
|
|
-1.4
|
|
|
-date 2010.05.10.13.09.11; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.3;
|
|
|
-
|
|
|
-1.3
|
|
|
-date 2010.05.10.11.58.37; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.2;
|
|
|
-
|
|
|
-1.2
|
|
|
-date 2010.05.10.05.35.13; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next 1.1;
|
|
|
-
|
|
|
-1.1
|
|
|
-date 2010.05.04.02.29.16; author rusty; state Exp;
|
|
|
-branches;
|
|
|
-next ;
|
|
|
-
|
|
|
-
|
|
|
-desc
|
|
|
-@First draft
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.13
|
|
|
-log
|
|
|
-@Thread-safe API
|
|
|
-@
|
|
|
-text
|
|
|
-@#LyX 1.6.7 created this file. For more info see http://www.lyx.org/
|
|
|
-\lyxformat 345
|
|
|
-\begin_document
|
|
|
-\begin_header
|
|
|
-\textclass article
|
|
|
-\use_default_options true
|
|
|
-\language english
|
|
|
-\inputencoding auto
|
|
|
-\font_roman default
|
|
|
-\font_sans default
|
|
|
-\font_typewriter default
|
|
|
-\font_default_family default
|
|
|
-\font_sc false
|
|
|
-\font_osf false
|
|
|
-\font_sf_scale 100
|
|
|
-\font_tt_scale 100
|
|
|
-
|
|
|
-\graphics default
|
|
|
-\paperfontsize default
|
|
|
-\use_hyperref false
|
|
|
-\papersize default
|
|
|
-\use_geometry false
|
|
|
-\use_amsmath 1
|
|
|
-\use_esint 1
|
|
|
-\cite_engine basic
|
|
|
-\use_bibtopic false
|
|
|
-\paperorientation portrait
|
|
|
-\secnumdepth 3
|
|
|
-\tocdepth 3
|
|
|
-\paragraph_separation indent
|
|
|
-\defskip medskip
|
|
|
-\quotes_language english
|
|
|
-\papercolumns 1
|
|
|
-\papersides 1
|
|
|
-\paperpagestyle default
|
|
|
-\tracking_changes true
|
|
|
-\output_changes true
|
|
|
-\author "Rusty Russell,,,"
|
|
|
-\author ""
|
|
|
-\end_header
|
|
|
-
|
|
|
-\begin_body
|
|
|
-
|
|
|
-\begin_layout Title
|
|
|
-TDB2: A Redesigning The Trivial DataBase
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Author
|
|
|
-Rusty Russell, IBM Corporation
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Date
|
|
|
-1-December-2010
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Abstract
|
|
|
-The Trivial DataBase on-disk format is 32 bits; with usage cases heading
|
|
|
- towards the 4G limit, that must change.
|
|
|
- This required breakage provides an opportunity to revisit TDB's other design
|
|
|
- decisions and reassess them.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Section
|
|
|
-Introduction
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The Trivial DataBase was originally written by Andrew Tridgell as a simple
|
|
|
- key/data pair storage system with the same API as dbm, but allowing multiple
|
|
|
- readers and writers while being small enough (< 1000 lines of C) to include
|
|
|
- in SAMBA.
|
|
|
- The simple design created in 1999 has proven surprisingly robust and performant
|
|
|
-, used in Samba versions 3 and 4 as well as numerous other projects.
|
|
|
- Its useful life was greatly increased by the (backwards-compatible!) addition
|
|
|
- of transaction support in 2005.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The wider variety and greater demands of TDB-using code has lead to some
|
|
|
- organic growth of the API, as well as some compromises on the implementation.
|
|
|
- None of these, by themselves, are seen as show-stoppers, but the cumulative
|
|
|
- effect is to a loss of elegance over the initial, simple TDB implementation.
|
|
|
- Here is a table of the approximate number of lines of implementation code
|
|
|
- and number of API functions at the end of each year:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-\begin_inset Tabular
|
|
|
-<lyxtabular version="3" rows="12" columns="3">
|
|
|
-<features>
|
|
|
-<column alignment="center" valignment="top" width="0">
|
|
|
-<column alignment="center" valignment="top" width="0">
|
|
|
-<column alignment="center" valignment="top" width="0">
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-Year End
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-API Functions
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-Lines of C Code Implementation
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-1999
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-13
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-1195
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2000
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-24
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-1725
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2001
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-32
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2228
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2002
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-35
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2481
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2003
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-35
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2552
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2004
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-40
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2584
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2005
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-38
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2647
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2006
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-52
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-3754
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2007
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-66
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-4398
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2008
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-71
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-4768
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-<row>
|
|
|
-<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-2009
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-73
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
|
|
|
-\begin_inset Text
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-5715
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-</cell>
|
|
|
-</row>
|
|
|
-</lyxtabular>
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This review is an attempt to catalog and address all the known issues with
|
|
|
- TDB and create solutions which address the problems without significantly
|
|
|
- increasing complexity; all involved are far too aware of the dangers of
|
|
|
- second system syndrome in rewriting a successful project like this.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Section
|
|
|
-API Issues
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-tdb_open_ex Is Not Expandable
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The tdb_open() call was expanded to tdb_open_ex(), which added an optional
|
|
|
- hashing function and an optional logging function argument.
|
|
|
- Additional arguments to open would require the introduction of a tdb_open_ex2
|
|
|
- call etc.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "attributes"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-tdb_open() will take a linked-list of attributes:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-enum tdb_attribute {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- TDB_ATTRIBUTE_LOG = 0,
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- TDB_ATTRIBUTE_HASH = 1
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-};
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-struct tdb_attribute_base {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- enum tdb_attribute attr;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- union tdb_attribute *next;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-};
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-struct tdb_attribute_log {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- tdb_log_func log_fn;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- void *log_private;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-};
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-struct tdb_attribute_hash {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- tdb_hash_func hash_fn;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- void *hash_private;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-};
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-union tdb_attribute {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- struct tdb_attribute_base base;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- struct tdb_attribute_log log;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- struct tdb_attribute_hash hash;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-};
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This allows future attributes to be added, even if this expands the size
|
|
|
- of the union.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-tdb_traverse Makes Impossible Guarantees
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, and it
|
|
|
- was thought that it was important to guarantee that all records which exist
|
|
|
- at the start and end of the traversal would be included, and no record
|
|
|
- would be included twice.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This adds complexity (see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "Reliable-Traversal-Adds"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-) and does not work anyway for records which are altered (in particular,
|
|
|
- those which are expanded may be effectively deleted and re-added behind
|
|
|
- the traversal).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "traverse-Proposed-Solution"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Abandon the guarantee.
|
|
|
- You will see every record if no changes occur during your traversal, otherwise
|
|
|
- you will see some subset.
|
|
|
- You can prevent changes by using a transaction or the locking API.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
- Delete-during-traverse will still delete every record, too (assuming no
|
|
|
- other changes).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Nesting of Transactions Is Fraught
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-TDB has alternated between allowing nested transactions and not allowing
|
|
|
- them.
|
|
|
- Various paths in the Samba codebase assume that transactions will nest,
|
|
|
- and in a sense they can: the operation is only committed to disk when the
|
|
|
- outer transaction is committed.
|
|
|
- There are two problems, however:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Canceling the inner transaction will cause the outer transaction commit
|
|
|
- to fail, and will not undo any operations since the inner transaction began.
|
|
|
- This problem is soluble with some additional internal code.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-An inner transaction commit can be cancelled by the outer transaction.
|
|
|
- This is desirable in the way which Samba's database initialization code
|
|
|
- uses transactions, but could be a surprise to any users expecting a successful
|
|
|
- transaction commit to expose changes to others.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The current solution is to specify the behavior at tdb_open(), with the
|
|
|
- default currently that nested transactions are allowed.
|
|
|
- This flag can also be changed at runtime.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Given the usage patterns, it seems that the
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-least-surprise
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- behavior of disallowing nested transactions should become the default.
|
|
|
- Additionally, it seems the outer transaction is the only code which knows
|
|
|
- whether inner transactions should be allowed, so a flag to indicate this
|
|
|
- could be added to tdb_transaction_start.
|
|
|
- However, this behavior can be simulated with a wrapper which uses tdb_add_flags
|
|
|
-() and tdb_remove_flags(), so the API should not be expanded for this relatively
|
|
|
--obscure case.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1298979572
|
|
|
-Incomplete; nesting flag is still defined as per tdb1.
|
|
|
-\change_inserted 0 1298979584
|
|
|
-Complete; the nesting flag has been removed.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Incorrect Hash Function is Not Detected
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-tdb_open_ex() allows the calling code to specify a different hash function
|
|
|
- to use, but does not check that all other processes accessing this tdb
|
|
|
- are using the same hash function.
|
|
|
- The result is that records are missing from tdb_fetch().
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The header should contain an example hash result (eg.
|
|
|
- the hash of 0xdeadbeef), and tdb_open_ex() should check that the given
|
|
|
- hash function produces the same answer, or fail the tdb_open call.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-tdb_set_max_dead/TDB_VOLATILE Expose Implementation
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-In response to scalability issues with the free list (
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "TDB-Freelist-Is"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-) two API workarounds have been incorporated in TDB: tdb_set_max_dead()
|
|
|
- and the TDB_VOLATILE flag to tdb_open.
|
|
|
- The latter actually calls the former with an argument of
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-5
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
-.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This code allows deleted records to accumulate without putting them in the
|
|
|
- free list.
|
|
|
- On delete we iterate through each chain and free them in a batch if there
|
|
|
- are more than max_dead entries.
|
|
|
- These are never otherwise recycled except as a side-effect of a tdb_repack.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-With the scalability problems of the freelist solved, this API can be removed.
|
|
|
- The TDB_VOLATILE flag may still be useful as a hint that store and delete
|
|
|
- of records will be at least as common as fetch in order to allow some internal
|
|
|
- tuning, but initially will become a no-op.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Incomplete.
|
|
|
- TDB_VOLATILE still defined, but implementation should fail on unknown flags
|
|
|
- to be future-proof.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "TDB-Files-Cannot"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-TDB Files Cannot Be Opened Multiple Times In The Same Process
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-No process can open the same TDB twice; we check and disallow it.
|
|
|
- This is an unfortunate side-effect of fcntl locks, which operate on a per-file
|
|
|
- rather than per-file-descriptor basis, and do not nest.
|
|
|
- Thus, closing any file descriptor on a file clears all the locks obtained
|
|
|
- by this process, even if they were placed using a different file descriptor!
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Note that even if this were solved, deadlock could occur if operations were
|
|
|
- nested: this is a more manageable programming error in most cases.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-We could lobby POSIX to fix the perverse rules, or at least lobby Linux
|
|
|
- to violate them so that the most common implementation does not have this
|
|
|
- restriction.
|
|
|
- This would be a generally good idea for other fcntl lock users.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Samba uses a wrapper which hands out the same tdb_context to multiple callers
|
|
|
- if this happens, and does simple reference counting.
|
|
|
- We should do this inside the tdb library, which already emulates lock nesting
|
|
|
- internally; it would need to recognize when deadlock occurs within a single
|
|
|
- process.
|
|
|
- This would create a new failure mode for tdb operations (while we currently
|
|
|
- handle locking failures, they are impossible in normal use and a process
|
|
|
- encountering them can do little but give up).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-I do not see benefit in an additional tdb_open flag to indicate whether
|
|
|
- re-opening is allowed, as though there may be some benefit to adding a
|
|
|
- call to detect when a tdb_context is shared, to allow other to create such
|
|
|
- an API.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Incomplete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-TDB API Is Not POSIX Thread-safe
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The TDB API uses an error code which can be queried after an operation to
|
|
|
- determine what went wrong.
|
|
|
- This programming model does not work with threads, unless specific additional
|
|
|
- guarantees are given by the implementation.
|
|
|
- In addition, even otherwise-independent threads cannot open the same TDB
|
|
|
- (as in
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "TDB-Files-Cannot"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Reachitecting the API to include a tdb_errcode pointer would be a great
|
|
|
- deal of churn
|
|
|
-\change_inserted 0 1298979557
|
|
|
-, but fortunately most functions return 0 on success and -1 on error: we
|
|
|
- can change these to return 0 on success and a negative error code on error,
|
|
|
- and the API remains similar to previous.
|
|
|
- The tdb_fetch, tdb_firstkey and tdb_nextkey functions need to take a TDB_DATA
|
|
|
- pointer and return an error code.
|
|
|
- It is also simpler to have tdb_nextkey replace its key argument in place,
|
|
|
- freeing up any old .dptr.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1298979438
|
|
|
-; we are better to guarantee that the tdb_errcode is per-thread so the current
|
|
|
- programming model can be maintained.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1298979438
|
|
|
-This requires dynamic per-thread allocations, which is awkward with POSIX
|
|
|
- threads (pthread_key_create space is limited and we cannot simply allocate
|
|
|
- a key for every TDB).
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Internal locking is required to make sure that fcntl locks do not overlap
|
|
|
- between threads, and also that the global list of tdbs is maintained.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe
|
|
|
- version of the library, and otherwise no overhead will exist.
|
|
|
- Alternatively, a hooking mechanism similar to that proposed for
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "Proposed-Solution-locking-hook"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
- could be used to enable pthread locking at runtime.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Incomplete
|
|
|
-\change_inserted 0 1298979681
|
|
|
-; API has been changed but thread safety has not been implemented.
|
|
|
-\change_deleted 0 1298979669
|
|
|
-.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-*_nonblock Functions And *_mark Functions Expose Implementation
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-CTDB
|
|
|
-\begin_inset Foot
|
|
|
-status collapsed
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-Clustered TDB, see http://ctdb.samba.org
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
- wishes to operate on TDB in a non-blocking manner.
|
|
|
- This is currently done as follows:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Call the _nonblock variant of an API function (eg.
|
|
|
- tdb_lockall_nonblock).
|
|
|
- If this fails:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Fork a child process, and wait for it to call the normal variant (eg.
|
|
|
- tdb_lockall).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If the child succeeds, call the _mark variant to indicate we already have
|
|
|
- the locks (eg.
|
|
|
- tdb_lockall_mark).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Upon completion, tell the child to release the locks (eg.
|
|
|
- tdb_unlockall).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Indicate to tdb that it should consider the locks removed (eg.
|
|
|
- tdb_unlockall_mark).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-There are several issues with this approach.
|
|
|
- Firstly, adding two new variants of each function clutters the API for
|
|
|
- an obscure use, and so not all functions have three variants.
|
|
|
- Secondly, it assumes that all paths of the functions ask for the same locks,
|
|
|
- otherwise the parent process will have to get a lock which the child doesn't
|
|
|
- have under some circumstances.
|
|
|
- I don't believe this is currently the case, but it constrains the implementatio
|
|
|
-n.
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "Proposed-Solution-locking-hook"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Implement a hook for locking methods, so that the caller can control the
|
|
|
- calls to create and remove fcntl locks.
|
|
|
- In this scenario, ctdbd would operate as follows:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Call the normal API function, eg tdb_lockall().
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-When the lock callback comes in, check if the child has the lock.
|
|
|
- Initially, this is always false.
|
|
|
- If so, return 0.
|
|
|
- Otherwise, try to obtain it in non-blocking mode.
|
|
|
- If that fails, return EWOULDBLOCK.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Release locks in the unlock callback as normal.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If tdb_lockall() fails, see if we recorded a lock failure; if so, call the
|
|
|
- child to repeat the operation.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-The child records what locks it obtains, and returns that information to
|
|
|
- the parent.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-When the child has succeeded, goto 1.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This is flexible enough to handle any potential locking scenario, even when
|
|
|
- lock requirements change.
|
|
|
- It can be optimized so that the parent does not release locks, just tells
|
|
|
- the child which locks it doesn't need to obtain.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-It also keeps the complexity out of the API, and in ctdbd where it is needed.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Incomplete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-tdb_chainlock Functions Expose Implementation
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-tdb_chainlock locks some number of records, including the record indicated
|
|
|
- by the given key.
|
|
|
- This gave atomicity guarantees; no-one can start a transaction, alter,
|
|
|
- read or delete that key while the lock is held.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-It also makes the same guarantee for any other key in the chain, which is
|
|
|
- an internal implementation detail and potentially a cause for deadlock.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-None.
|
|
|
- It would be nice to have an explicit single entry lock which effected no
|
|
|
- other keys.
|
|
|
- Unfortunately, this won't work for an entry which doesn't exist.
|
|
|
- Thus while chainlock may be implemented more efficiently for the existing
|
|
|
- case, it will still have overlap issues with the non-existing case.
|
|
|
- So it is best to keep the current (lack of) guarantee about which records
|
|
|
- will be effected to avoid constraining our implementation.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Signal Handling is Not Race-Free
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The tdb_setalarm_sigptr() call allows the caller's signal handler to indicate
|
|
|
- that the tdb locking code should return with a failure, rather than trying
|
|
|
- again when a signal is received (and errno == EAGAIN).
|
|
|
- This is usually used to implement timeouts.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Unfortunately, this does not work in the case where the signal is received
|
|
|
- before the tdb code enters the fcntl() call to place the lock: the code
|
|
|
- will sleep within the fcntl() code, unaware that the signal wants it to
|
|
|
- exit.
|
|
|
- In the case of long timeouts, this does not happen in practice.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The locking hooks proposed in
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "Proposed-Solution-locking-hook"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
- would allow the user to decide on whether to fail the lock acquisition
|
|
|
- on a signal.
|
|
|
- This allows the caller to choose their own compromise: they could narrow
|
|
|
- the race by checking immediately before the fcntl call.
|
|
|
-\begin_inset Foot
|
|
|
-status collapsed
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-It may be possible to make this race-free in some implementations by having
|
|
|
- the signal handler alter the struct flock to make it invalid.
|
|
|
- This will cause the fcntl() lock call to fail with EINVAL if the signal
|
|
|
- occurs before the kernel is entered, otherwise EAGAIN.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Incomplete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-The API Uses Gratuitous Typedefs, Capitals
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-typedefs are useful for providing source compatibility when types can differ
|
|
|
- across implementations, or arguably in the case of function pointer definitions
|
|
|
- which are hard for humans to parse.
|
|
|
- Otherwise it is simply obfuscation and pollutes the namespace.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Capitalization is usually reserved for compile-time constants and macros.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Description
|
|
|
-TDB_CONTEXT There is no reason to use this over 'struct tdb_context'; the
|
|
|
- definition isn't visible to the API user anyway.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Description
|
|
|
-TDB_DATA There is no reason to use this over struct TDB_DATA; the struct
|
|
|
- needs to be understood by the API user.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Description
|
|
|
-struct
|
|
|
-\begin_inset space ~
|
|
|
-\end_inset
|
|
|
-
|
|
|
-TDB_DATA This would normally be called 'struct tdb_data'.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Description
|
|
|
-enum
|
|
|
-\begin_inset space ~
|
|
|
-\end_inset
|
|
|
-
|
|
|
-TDB_ERROR Similarly, this would normally be enum tdb_error.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-None.
|
|
|
- Introducing lower case variants would please pedants like myself, but if
|
|
|
- it were done the existing ones should be kept.
|
|
|
- There is little point forcing a purely cosmetic change upon tdb users.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "tdb_log_func-Doesnt-Take"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-tdb_log_func Doesn't Take The Private Pointer
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-For API compatibility reasons, the logging function needs to call tdb_get_loggin
|
|
|
-g_private() to retrieve the pointer registered by the tdb_open_ex for logging.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-It should simply take an extra argument, since we are prepared to break
|
|
|
- the API/ABI.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Various Callback Functions Are Not Typesafe
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The callback functions in tdb_set_logging_function (after
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "tdb_log_func-Doesnt-Take"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
- is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read and tdb_check
|
|
|
- all take void * and must internally convert it to the argument type they
|
|
|
- were expecting.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-If this type changes, the compiler will not produce warnings on the callers,
|
|
|
- since it only sees void *.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-With careful use of macros, we can create callback functions which give
|
|
|
- a warning when used on gcc and the types of the callback and its private
|
|
|
- argument differ.
|
|
|
- Unsupported compilers will not give a warning, which is no worse than now.
|
|
|
- In addition, the callbacks become clearer, as they need not use void *
|
|
|
- for their parameter.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Incomplete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB file should
|
|
|
- be cleared if the caller discovers it is the only process with the TDB
|
|
|
- open.
|
|
|
- However, if any caller does not specify TDB_CLEAR_IF_FIRST it will not
|
|
|
- be detected, so will have the TDB erased underneath them (usually resulting
|
|
|
- in a crash).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-There is a similar issue on fork(); if the parent exits (or otherwise closes
|
|
|
- the tdb) before the child calls tdb_reopen_all() to establish the lock
|
|
|
- used to indicate the TDB is opened by someone, a TDB_CLEAR_IF_FIRST opener
|
|
|
- at that moment will believe it alone has opened the TDB and will erase
|
|
|
- it.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Remove TDB_CLEAR_IF_FIRST.
|
|
|
- Other workarounds are possible, but see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "TDB_CLEAR_IF_FIRST-Imposes-Performance"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1298979699
|
|
|
-Incomplete, TDB_CLEAR_IF_FIRST still defined, but not implemented.
|
|
|
-\change_inserted 0 1298979700
|
|
|
-Complete.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Extending The Header Is Difficult
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-We have reserved (zeroed) words in the TDB header, which can be used for
|
|
|
- future features.
|
|
|
- If the future features are compulsory, the version number must be updated
|
|
|
- to prevent old code from accessing the database.
|
|
|
- But if the future feature is optional, we have no way of telling if older
|
|
|
- code is accessing the database or not.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The header should contain a
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-format variant
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- value (64-bit).
|
|
|
- This is divided into two 32-bit parts:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-The lower part reflects the format variant understood by code accessing
|
|
|
- the database.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-The upper part reflects the format variant you must understand to write
|
|
|
- to the database (otherwise you can only open for reading).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The latter field can only be written at creation time, the former should
|
|
|
- be written under the OPEN_LOCK when opening the database for writing, if
|
|
|
- the variant of the code is lower than the current lowest variant.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This should allow backwards-compatible features to be added, and detection
|
|
|
- if older code (which doesn't understand the feature) writes to the database.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Incomplete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Record Headers Are Not Expandible
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-If we later want to add (say) checksums on keys and data, it would require
|
|
|
- another format change, which we'd like to avoid.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-We often have extra padding at the tail of a record.
|
|
|
- If we ensure that the first byte (if any) of this padding is zero, we will
|
|
|
- have a way for future changes to detect code which doesn't understand a
|
|
|
- new format: the new code would write (say) a 1 at the tail, and thus if
|
|
|
- there is no tail or the first byte is 0, we would know the extension is
|
|
|
- not present on that record.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Incomplete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-TDB Does Not Use Talloc
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Many users of TDB (particularly Samba) use the talloc allocator, and thus
|
|
|
- have to wrap TDB in a talloc context to use it conveniently.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The allocation within TDB is not complicated enough to justify the use of
|
|
|
- talloc, and I am reluctant to force another (excellent) library on TDB
|
|
|
- users.
|
|
|
- Nonetheless a compromise is possible.
|
|
|
- An attribute (see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "attributes"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-) can be added later to tdb_open() to provide an alternate allocation mechanism,
|
|
|
- specifically for talloc but usable by any other allocator (which would
|
|
|
- ignore the
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-context
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- argument).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This would form a talloc heirarchy as expected, but the caller would still
|
|
|
- have to attach a destructor to the tdb context returned from tdb_open to
|
|
|
- close it.
|
|
|
- All TDB_DATA fields would be children of the tdb_context, and the caller
|
|
|
- would still have to manage them (using talloc_free() or talloc_steal()).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Deferred.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Section
|
|
|
-Performance And Scalability Issues
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "TDB_CLEAR_IF_FIRST-Imposes-Performance"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-TDB_CLEAR_IF_FIRST Imposes Performance Penalty
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is placed at offset
|
|
|
- 4 (aka.
|
|
|
- the ACTIVE_LOCK).
|
|
|
- While these locks never conflict in normal tdb usage, they do add substantial
|
|
|
- overhead for most fcntl lock implementations when the kernel scans to detect
|
|
|
- if a lock conflict exists.
|
|
|
- This is often a single linked list, making the time to acquire and release
|
|
|
- a fcntl lock O(N) where N is the number of processes with the TDB open,
|
|
|
- not the number actually doing work.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-In a Samba server it is common to have huge numbers of clients sitting idle,
|
|
|
- and thus they have weaned themselves off the TDB_CLEAR_IF_FIRST flag.
|
|
|
-\begin_inset Foot
|
|
|
-status collapsed
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-There is a flag to tdb_reopen_all() which is used for this optimization:
|
|
|
- if the parent process will outlive the child, the child does not need the
|
|
|
- ACTIVE_LOCK.
|
|
|
- This is a workaround for this very performance issue.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Remove the flag.
|
|
|
- It was a neat idea, but even trivial servers tend to know when they are
|
|
|
- initializing for the first time and can simply unlink the old tdb at that
|
|
|
- point.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1298979837
|
|
|
-Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing.
|
|
|
-\change_inserted 0 1298979837
|
|
|
-Complete.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-TDB Files Have a 4G Limit
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This seems to be becoming an issue (so much for
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-trivial
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
-!), particularly for ldb.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-A new, incompatible TDB format which uses 64 bit offsets internally rather
|
|
|
- than 32 bit as now.
|
|
|
- For simplicity of endian conversion (which TDB does on the fly if required),
|
|
|
- all values will be 64 bit on disk.
|
|
|
- In practice, some upper bits may be used for other purposes, but at least
|
|
|
- 56 bits will be available for file offsets.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-tdb_open() will automatically detect the old version, and even create them
|
|
|
- if TDB_VERSION6 is specified to tdb_open.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-32 bit processes will still be able to access TDBs larger than 4G (assuming
|
|
|
- that their off_t allows them to seek to 64 bits), they will gracefully
|
|
|
- fall back as they fail to mmap.
|
|
|
- This can happen already with large TDBs.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Old versions of tdb will fail to open the new TDB files (since 28 August
|
|
|
- 2009, commit 398d0c29290: prior to that any unrecognized file format would
|
|
|
- be erased and initialized as a fresh tdb!)
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-TDB Records Have a 4G Limit
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This has not been a reported problem, and the API uses size_t which can
|
|
|
- be 64 bit on 64 bit platforms.
|
|
|
- However, other limits may have made such an issue moot.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Record sizes will be 64 bit, with an error returned on 32 bit platforms
|
|
|
- which try to access such records (the current implementation would return
|
|
|
- TDB_ERR_OOM in a similar case).
|
|
|
- It seems unlikely that 32 bit keys will be a limitation, so the implementation
|
|
|
- may not support this (see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "sub:Records-Incur-A"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Hash Size Is Determined At TDB Creation Time
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-TDB contains a number of hash chains in the header; the number is specified
|
|
|
- at creation time, and defaults to 131.
|
|
|
- This is such a bottleneck on large databases (as each hash chain gets quite
|
|
|
- long), that LDB uses 10,000 for this hash.
|
|
|
- In general it is impossible to know what the 'right' answer is at database
|
|
|
- creation time.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "sub:Hash-Size-Solution"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-After comprehensive performance testing on various scalable hash variants
|
|
|
-\begin_inset Foot
|
|
|
-status collapsed
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying
|
|
|
- because I was previously convinced that an expanding tree of hashes would
|
|
|
- be very close to optimal.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-, it became clear that it is hard to beat a straight linear hash table which
|
|
|
- doubles in size when it reaches saturation.
|
|
|
- Unfortunately, altering the hash table introduces serious locking complications
|
|
|
-: the entire hash table needs to be locked to enlarge the hash table, and
|
|
|
- others might be holding locks.
|
|
|
- Particularly insidious are insertions done under tdb_chainlock.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Thus an expanding layered hash will be used: an array of hash groups, with
|
|
|
- each hash group exploding into pointers to lower hash groups once it fills,
|
|
|
- turning into a hash tree.
|
|
|
- This has implications for locking: we must lock the entire group in case
|
|
|
- we need to expand it, yet we don't know how deep the tree is at that point.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Note that bits from the hash table entries should be stolen to hold more
|
|
|
- hash bits to reduce the penalty of collisions.
|
|
|
- We can use the otherwise-unused lower 3 bits.
|
|
|
- If we limit the size of the database to 64 exabytes, we can use the top
|
|
|
- 8 bits of the hash entry as well.
|
|
|
- These 11 bits would reduce false positives down to 1 in 2000 which is more
|
|
|
- than we need: we can use one of the bits to indicate that the extra hash
|
|
|
- bits are valid.
|
|
|
- This means we can choose not to re-hash all entries when we expand a hash
|
|
|
- group; simply use the next bits we need and mark them invalid.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "TDB-Freelist-Is"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-TDB Freelist Is Highly Contended
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-TDB uses a single linked list for the free list.
|
|
|
- Allocation occurs as follows, using heuristics which have evolved over
|
|
|
- time:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Get the free list lock for this whole operation.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Multiply length by 1.25, so we always over-allocate by 25%.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Set the slack multiplier to 1.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Examine the current freelist entry: if it is > length but < the current
|
|
|
- best case, remember it as the best case.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Multiply the slack multiplier by 1.05.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If our best fit so far is less than length * slack multiplier, return it.
|
|
|
- The slack will be turned into a new free record if it's large enough.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Otherwise, go onto the next freelist entry.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Deleting a record occurs as follows:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Lock the hash chain for this whole operation.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Walk the chain to find the record, keeping the prev pointer offset.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If max_dead is non-zero:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_deeper
|
|
|
-\begin_layout Enumerate
|
|
|
-Walk the hash chain again and count the dead records.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If it's more than max_dead, bulk free all the dead ones (similar to steps
|
|
|
- 4 and below, but the lock is only obtained once).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Simply mark this record as dead and return.
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_deeper
|
|
|
-\begin_layout Enumerate
|
|
|
-Get the free list lock for the remainder of this operation.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "right-merging"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-Examine the following block to see if it is free; if so, enlarge the current
|
|
|
- block and remove that block from the free list.
|
|
|
- This was disabled, as removal from the free list was O(entries-in-free-list).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Examine the preceeding block to see if it is free: for this reason, each
|
|
|
- block has a 32-bit tailer which indicates its length.
|
|
|
- If it is free, expand it to cover our new block and return.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Otherwise, prepend ourselves to the free list.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Disabling right-merging (step
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "right-merging"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-) causes fragmentation; the other heuristics proved insufficient to address
|
|
|
- this, so the final answer to this was that when we expand the TDB file
|
|
|
- inside a transaction commit, we repack the entire tdb.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The single list lock limits our allocation rate; due to the other issues
|
|
|
- this is not currently seen as a bottleneck.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The first step is to remove all the current heuristics, as they obviously
|
|
|
- interact, then examine them once the lock contention is addressed.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The free list must be split to reduce contention.
|
|
|
- Assuming perfect free merging, we can at most have 1 free list entry for
|
|
|
- each entry.
|
|
|
- This implies that the number of free lists is related to the size of the
|
|
|
- hash table, but as it is rare to walk a large number of free list entries
|
|
|
- we can use far fewer, say 1/32 of the number of hash buckets.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-It seems tempting to try to reuse the hash implementation which we use for
|
|
|
- records here, but we have two ways of searching for free entries: for allocatio
|
|
|
-n we search by size (and possibly zone) which produces too many clashes
|
|
|
- for our hash table to handle well, and for coalescing we search by address.
|
|
|
- Thus an array of doubly-linked free lists seems preferable.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-There are various benefits in using per-size free lists (see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "sub:TDB-Becomes-Fragmented"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-) but it's not clear this would reduce contention in the common case where
|
|
|
- all processes are allocating/freeing the same size.
|
|
|
- Thus we almost certainly need to divide in other ways: the most obvious
|
|
|
- is to divide the file into zones, and using a free list (or table of free
|
|
|
- lists) for each.
|
|
|
- This approximates address ordering.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Unfortunately it is difficult to know what heuristics should be used to
|
|
|
- determine zone sizes, and our transaction code relies on being able to
|
|
|
- create a
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-recovery area
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- by simply appending to the file (difficult if it would need to create a
|
|
|
- new zone header).
|
|
|
- Thus we use a linked-list of free tables; currently we only ever create
|
|
|
- one, but if there is more than one we choose one at random to use.
|
|
|
- In future we may use heuristics to add new free tables on contention.
|
|
|
- We only expand the file when all free tables are exhausted.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The basic algorithm is as follows.
|
|
|
- Freeing is simple:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Identify the correct free list.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Lock the corresponding list.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Re-check the list (we didn't have a lock, sizes could have changed): relock
|
|
|
- if necessary.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Place the freed entry in the list.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Allocation is a little more complicated, as we perform delayed coalescing
|
|
|
- at this point:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Pick a free table; usually the previous one.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Lock the corresponding list.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If the top entry is -large enough, remove it from the list and return it.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Otherwise, coalesce entries in the list.If there was no entry large enough,
|
|
|
- unlock the list and try the next largest list
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If no list has an entry which meets our needs, try the next free table.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If no zone satisfies, expand the file.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This optimizes rapid insert/delete of free list entries by not coalescing
|
|
|
- them all the time..
|
|
|
- First-fit address ordering ordering seems to be fairly good for keeping
|
|
|
- fragmentation low (see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "sub:TDB-Becomes-Fragmented"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-).
|
|
|
- Note that address ordering does not need a tailer to coalesce, though if
|
|
|
- we needed one we could have one cheaply: see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "sub:Records-Incur-A"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-.
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Each free entry has the free table number in the header: less than 255.
|
|
|
- It also contains a doubly-linked list for easy deletion.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "sub:TDB-Becomes-Fragmented"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-TDB Becomes Fragmented
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Much of this is a result of allocation strategy
|
|
|
-\begin_inset Foot
|
|
|
-status collapsed
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 ftp://ftp.cs.ute
|
|
|
-xas.edu/pub/garbage/malloc/ismm98.ps
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
- and deliberate hobbling of coalescing; internal fragmentation (aka overallocati
|
|
|
-on) is deliberately set at 25%, and external fragmentation is only cured
|
|
|
- by the decision to repack the entire db when a transaction commit needs
|
|
|
- to enlarge the file.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The 25% overhead on allocation works in practice for ldb because indexes
|
|
|
- tend to expand by one record at a time.
|
|
|
- This internal fragmentation can be resolved by having an
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-expanded
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- bit in the header to note entries that have previously expanded, and allocating
|
|
|
- more space for them.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-There are is a spectrum of possible solutions for external fragmentation:
|
|
|
- one is to use a fragmentation-avoiding allocation strategy such as best-fit
|
|
|
- address-order allocator.
|
|
|
- The other end of the spectrum would be to use a bump allocator (very fast
|
|
|
- and simple) and simply repack the file when we reach the end.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-There are three problems with efficient fragmentation-avoiding allocators:
|
|
|
- they are non-trivial, they tend to use a single free list for each size,
|
|
|
- and there's no evidence that tdb allocation patterns will match those recorded
|
|
|
- for general allocators (though it seems likely).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Thus we don't spend too much effort on external fragmentation; we will be
|
|
|
- no worse than the current code if we need to repack on occasion.
|
|
|
- More effort is spent on reducing freelist contention, and reducing overhead.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "sub:Records-Incur-A"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-Records Incur A 28-Byte Overhead
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Each TDB record has a header as follows:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-struct tdb_record {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- tdb_off_t next; /* offset of the next record in the list */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- tdb_len_t rec_len; /* total byte length of record */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- tdb_len_t key_len; /* byte length of key */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- tdb_len_t data_len; /* byte length of data */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint32_t full_hash; /* the full 32 bit hash of the key */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint32_t magic; /* try to catch errors */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- /* the following union is implied:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- union {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- char record[rec_len];
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- struct {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- char key[key_len];
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- char data[data_len];
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- }
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint32_t totalsize; (tailer)
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- }
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- */
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-};
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Naively, this would double to a 56-byte overhead on a 64 bit implementation.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-We can use various techniques to reduce this for an allocated block:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-The 'next' pointer is not required, as we are using a flat hash table.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-'rec_len' can instead be expressed as an addition to key_len and data_len
|
|
|
- (it accounts for wasted or overallocated length in the record).
|
|
|
- Since the record length is always a multiple of 8, we can conveniently
|
|
|
- fit it in 32 bits (representing up to 35 bits).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-'key_len' and 'data_len' can be reduced.
|
|
|
- I'm unwilling to restrict 'data_len' to 32 bits, but instead we can combine
|
|
|
- the two into one 64-bit field and using a 5 bit value which indicates at
|
|
|
- what bit to divide the two.
|
|
|
- Keys are unlikely to scale as fast as data, so I'm assuming a maximum key
|
|
|
- size of 32 bits.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-'full_hash' is used to avoid a memcmp on the
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-miss
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- case, but this is diminishing returns after a handful of bits (at 10 bits,
|
|
|
- it reduces 99.9% of false memcmp).
|
|
|
- As an aside, as the lower bits are already incorporated in the hash table
|
|
|
- resolution, the upper bits should be used here.
|
|
|
- Note that it's not clear that these bits will be a win, given the extra
|
|
|
- bits in the hash table itself (see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "sub:Hash-Size-Solution"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-'magic' does not need to be enlarged: it currently reflects one of 5 values
|
|
|
- (used, free, dead, recovery, and unused_recovery).
|
|
|
- It is useful for quick sanity checking however, and should not be eliminated.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-'tailer' is only used to coalesce free blocks (so a block to the right can
|
|
|
- find the header to check if this block is free).
|
|
|
- This can be replaced by a single 'free' bit in the header of the following
|
|
|
- block (and the tailer only exists in free blocks).
|
|
|
-\begin_inset Foot
|
|
|
-status collapsed
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-This technique from Thomas Standish.
|
|
|
- Data Structure Techniques.
|
|
|
- Addison-Wesley, Reading, Massachusetts, 1980.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
- The current proposed coalescing algorithm doesn't need this, however.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This produces a 16 byte used header like this:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-struct tdb_used_record {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint32_t used_magic : 16,
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- key_data_divide: 5,
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- top_hash: 11;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint32_t extra_octets;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint64_t key_and_data_len;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-};
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-And a free record like this:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-struct tdb_free_record {
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint64_t free_magic: 8,
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- prev : 56;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint64_t free_table: 8,
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- total_length : 56
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint64_t next;;
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-};
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1291206079
|
|
|
-
|
|
|
-\change_unchanged
|
|
|
-Note that by limiting valid offsets to 56 bits, we can pack everything we
|
|
|
- need into 3 64-byte words, meaning our minimum record size is 8 bytes.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Transaction Commit Requires 4 fdatasync
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The current transaction algorithm is:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-write_recovery_data();
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-sync();
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-write_recovery_header();
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-sync();
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-overwrite_with_new_data();
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-sync();
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-remove_recovery_header();
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-sync();
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-On current ext3, each sync flushes all data to disk, so the next 3 syncs
|
|
|
- are relatively expensive.
|
|
|
- But this could become a performance bottleneck on other filesystems such
|
|
|
- as ext4.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Neil Brown points out that this is overzealous, and only one sync is needed:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Bundle the recovery data, a transaction counter and a strong checksum of
|
|
|
- the new data.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Strong checksum that whole bundle.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Store the bundle in the database.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Overwrite the oldest of the two recovery pointers in the header (identified
|
|
|
- using the transaction counter) with the offset of this bundle.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-sync.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Write the new data to the file.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Checking for recovery means identifying the latest bundle with a valid checksum
|
|
|
- and using the new data checksum to ensure that it has been applied.
|
|
|
- This is more expensive than the current check, but need only be done at
|
|
|
- open.
|
|
|
- For running databases, a separate header field can be used to indicate
|
|
|
- a transaction in progress; we need only check for recovery if this is set.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Deferred.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "sub:TDB-Does-Not"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-TDB Does Not Have Snapshot Support
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed SolutionNone.
|
|
|
- At some point you say
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-use a real database
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- (but see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "replay-attribute"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-).
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-But as a thought experiment, if we implemented transactions to only overwrite
|
|
|
- free entries (this is tricky: there must not be a header in each entry
|
|
|
- which indicates whether it is free, but use of presence in metadata elsewhere),
|
|
|
- and a pointer to the hash table, we could create an entirely new commit
|
|
|
- without destroying existing data.
|
|
|
- Then it would be easy to implement snapshots in a similar way.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This would not allow arbitrary changes to the database, such as tdb_repack
|
|
|
- does, and would require more space (since we have to preserve the current
|
|
|
- and future entries at once).
|
|
|
- If we used hash trees rather than one big hash table, we might only have
|
|
|
- to rewrite some sections of the hash, too.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-We could then implement snapshots using a similar method, using multiple
|
|
|
- different hash tables/free tables.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Deferred.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Transactions Cannot Operate in Parallel
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This would be useless for ldb, as it hits the index records with just about
|
|
|
- every update.
|
|
|
- It would add significant complexity in resolving clashes, and cause the
|
|
|
- all transaction callers to write their code to loop in the case where the
|
|
|
- transactions spuriously failed.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-None (but see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "replay-attribute"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-).
|
|
|
- We could solve a small part of the problem by providing read-only transactions.
|
|
|
- These would allow one write transaction to begin, but it could not commit
|
|
|
- until all r/o transactions are done.
|
|
|
- This would require a new RO_TRANSACTION_LOCK, which would be upgraded on
|
|
|
- commit.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Deferred.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Default Hash Function Is Suboptimal
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The Knuth-inspired multiplicative hash used by tdb is fairly slow (especially
|
|
|
- if we expand it to 64 bits), and works best when the hash bucket size is
|
|
|
- a prime number (which also means a slow modulus).
|
|
|
- In addition, it is highly predictable which could potentially lead to a
|
|
|
- Denial of Service attack in some TDB uses.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The Jenkins lookup3 hash
|
|
|
-\begin_inset Foot
|
|
|
-status open
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-http://burtleburtle.net/bob/c/lookup3.c
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
- is a fast and superbly-mixing hash.
|
|
|
- It's used by the Linux kernel and almost everything else.
|
|
|
- This has the particular properties that it takes an initial seed, and produces
|
|
|
- two 32 bit hash numbers, which we can combine into a 64-bit hash.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The seed should be created at tdb-creation time from some random source,
|
|
|
- and placed in the header.
|
|
|
- This is far from foolproof, but adds a little bit of protection against
|
|
|
- hash bombing.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "Reliable-Traversal-Adds"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-Reliable Traversal Adds Complexity
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-We lock a record during traversal iteration, and try to grab that lock in
|
|
|
- the delete code.
|
|
|
- If that grab on delete fails, we simply mark it deleted and continue onwards;
|
|
|
- traversal checks for this condition and does the delete when it moves off
|
|
|
- the record.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-If traversal terminates, the dead record may be left indefinitely.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Remove reliability guarantees; see
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "traverse-Proposed-Solution"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Complete.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Fcntl Locking Adds Overhead
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Placing a fcntl lock means a system call, as does removing one.
|
|
|
- This is actually one reason why transactions can be faster (everything
|
|
|
- is locked once at transaction start).
|
|
|
- In the uncontended case, this overhead can theoretically be eliminated.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-None.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-We tried this before with spinlock support, in the early days of TDB, and
|
|
|
- it didn't make much difference except in manufactured benchmarks.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-We could use spinlocks (with futex kernel support under Linux), but it means
|
|
|
- that we lose automatic cleanup when a process dies with a lock.
|
|
|
- There is a method of auto-cleanup under Linux, but it's not supported by
|
|
|
- other operating systems.
|
|
|
- We could reintroduce a clear-if-first-style lock and sweep for dead futexes
|
|
|
- on open, but that wouldn't help the normal case of one concurrent opener
|
|
|
- dying.
|
|
|
- Increasingly elaborate repair schemes could be considered, but they require
|
|
|
- an ABI change (everyone must use them) anyway, so there's no need to do
|
|
|
- this at the same time as everything else.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Some Transactions Don't Require Durability
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Volker points out that gencache uses a CLEAR_IF_FIRST tdb for normal (fast)
|
|
|
- usage, and occasionally empties the results into a transactional TDB.
|
|
|
- This kind of usage prioritizes performance over durability: as long as
|
|
|
- we are consistent, data can be lost.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-This would be more neatly implemented inside tdb: a
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-soft
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- transaction commit (ie.
|
|
|
- syncless) which meant that data may be reverted on a crash.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-None.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Unfortunately any transaction scheme which overwrites old data requires
|
|
|
- a sync before that overwrite to avoid the possibility of corruption.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-It seems possible to use a scheme similar to that described in
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "sub:TDB-Does-Not"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-,where transactions are committed without overwriting existing data, and
|
|
|
- an array of top-level pointers were available in the header.
|
|
|
- If the transaction is
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-soft
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- then we would not need a sync at all: existing processes would pick up
|
|
|
- the new hash table and free list and work with that.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-At some later point, a sync would allow recovery of the old data into the
|
|
|
- free lists (perhaps when the array of top-level pointers filled).
|
|
|
- On crash, tdb_open() would examine the array of top levels, and apply the
|
|
|
- transactions until it encountered an invalid checksum.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-Tracing Is Fragile, Replay Is External
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The current TDB has compile-time-enabled tracing code, but it often breaks
|
|
|
- as it is not enabled by default.
|
|
|
- In a similar way, the ctdb code has an external wrapper which does replay
|
|
|
- tracing so it can coordinate cluster-wide transactions.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Proposed Solution
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "replay-attribute"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Tridge points out that an attribute can be later added to tdb_open (see
|
|
|
-
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "attributes"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-) to provide replay/trace hooks, which could become the basis for this and
|
|
|
- future parallel transactions and snapshot support.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsubsection
|
|
|
-Status
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-Deferred.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_body
|
|
|
-\end_document
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.12
|
|
|
-log
|
|
|
-@Add status, some fixes, linked freelists.
|
|
|
-@
|
|
|
-text
|
|
|
-@d53 1
|
|
|
-a53 7
|
|
|
-
|
|
|
-\change_deleted 0 1291204535
|
|
|
-14-September
|
|
|
-\change_inserted 0 1291204533
|
|
|
-1-December
|
|
|
-\change_unchanged
|
|
|
--2010
|
|
|
-a580 2
|
|
|
-\change_inserted 0 1291204563
|
|
|
-
|
|
|
-a583 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204572
|
|
|
-a587 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204573
|
|
|
-a588 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a629 2
|
|
|
-\change_inserted 0 1291204588
|
|
|
-
|
|
|
-a632 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204588
|
|
|
-a636 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204631
|
|
|
-a639 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a693 2
|
|
|
-\change_inserted 0 1291204639
|
|
|
-
|
|
|
-a696 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204640
|
|
|
-d702 1
|
|
|
-a702 1
|
|
|
-\change_inserted 0 1291204665
|
|
|
-d704 2
|
|
|
-a728 2
|
|
|
-\change_inserted 0 1291204671
|
|
|
-
|
|
|
-a731 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204671
|
|
|
-a735 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204673
|
|
|
-a736 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a780 2
|
|
|
-\change_inserted 0 1291204731
|
|
|
-
|
|
|
-a783 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204732
|
|
|
-a787 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204779
|
|
|
-a790 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a842 2
|
|
|
-\change_inserted 0 1291204830
|
|
|
-
|
|
|
-a845 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204831
|
|
|
-a849 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204834
|
|
|
-a850 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d879 9
|
|
|
-a887 2
|
|
|
- deal of churn; we are better to guarantee that the tdb_errcode is per-thread
|
|
|
- so the current programming model can be maintained.
|
|
|
-d891 9
|
|
|
-d903 2
|
|
|
-a922 2
|
|
|
-\change_inserted 0 1291204847
|
|
|
-
|
|
|
-a925 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204847
|
|
|
-d930 5
|
|
|
-a934 3
|
|
|
-
|
|
|
-\change_inserted 0 1291204852
|
|
|
-Incomplete.
|
|
|
-a1051 2
|
|
|
-\change_inserted 0 1291204881
|
|
|
-
|
|
|
-a1054 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204881
|
|
|
-a1058 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204885
|
|
|
-a1059 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1140 2
|
|
|
-\change_inserted 0 1291204898
|
|
|
-
|
|
|
-a1143 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204898
|
|
|
-a1147 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204901
|
|
|
-a1148 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1224 2
|
|
|
-\change_inserted 0 1291204908
|
|
|
-
|
|
|
-a1227 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204908
|
|
|
-a1231 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204908
|
|
|
-a1232 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1271 2
|
|
|
-\change_inserted 0 1291204917
|
|
|
-
|
|
|
-a1274 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204917
|
|
|
-a1278 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204920
|
|
|
-a1279 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1316 2
|
|
|
-\change_inserted 0 1291204927
|
|
|
-
|
|
|
-a1319 2
|
|
|
-
|
|
|
-\change_inserted 0 1291204928
|
|
|
-d1325 1
|
|
|
-a1325 1
|
|
|
-\change_inserted 0 1291204942
|
|
|
-d1327 2
|
|
|
-a1381 2
|
|
|
-\change_inserted 0 1291205003
|
|
|
-
|
|
|
-a1384 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205004
|
|
|
-a1388 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205007
|
|
|
-a1411 2
|
|
|
-\change_inserted 0 1291205019
|
|
|
-
|
|
|
-a1414 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205019
|
|
|
-a1418 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205023
|
|
|
-a1419 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1465 2
|
|
|
-\change_inserted 0 1291205029
|
|
|
-
|
|
|
-a1468 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205029
|
|
|
-a1472 2
|
|
|
-
|
|
|
-\change_inserted 0 1291206020
|
|
|
-a1473 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1528 2
|
|
|
-\change_inserted 0 1291205043
|
|
|
-
|
|
|
-a1531 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205043
|
|
|
-d1537 1
|
|
|
-a1537 1
|
|
|
-\change_inserted 0 1291205057
|
|
|
-d1539 2
|
|
|
-a1589 2
|
|
|
-\change_inserted 0 1291205062
|
|
|
-
|
|
|
-a1592 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205062
|
|
|
-a1596 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205062
|
|
|
-a1597 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1626 2
|
|
|
-\change_inserted 0 1291205072
|
|
|
-
|
|
|
-a1629 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205073
|
|
|
-a1633 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205073
|
|
|
-a1634 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1674 4
|
|
|
-
|
|
|
-\change_deleted 0 1291204504
|
|
|
-
|
|
|
-\change_unchanged
|
|
|
-a1699 2
|
|
|
-\change_inserted 0 1291205079
|
|
|
-
|
|
|
-a1702 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205080
|
|
|
-a1706 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205080
|
|
|
-a1707 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1833 2
|
|
|
-\change_inserted 0 1291205090
|
|
|
-
|
|
|
-d1869 2
|
|
|
-a1870 7
|
|
|
- is to divide the file into zones, and using a free list (or
|
|
|
-\change_inserted 0 1291205498
|
|
|
-table
|
|
|
-\change_deleted 0 1291205497
|
|
|
-set
|
|
|
-\change_unchanged
|
|
|
- of free lists) for each.
|
|
|
-a1871 2
|
|
|
-\change_inserted 0 1291205203
|
|
|
-
|
|
|
-a1874 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205358
|
|
|
-a1890 21
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1291205198
|
|
|
-Note that this means we need to split the free lists when we expand the
|
|
|
- file; this is probably acceptable when we double the hash table size, since
|
|
|
- that is such an expensive operation already.
|
|
|
- In the case of increasing the file size, there is an optimization we can
|
|
|
- use: if we use M in the formula above as the file size rounded up to the
|
|
|
- next power of 2, we only need reshuffle free lists when the file size crosses
|
|
|
- a power of 2 boundary,
|
|
|
-\emph on
|
|
|
-and
|
|
|
-\emph default
|
|
|
-reshuffling the free lists is trivial: we simply merge every consecutive
|
|
|
- pair of free lists.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d1899 1
|
|
|
-a1899 7
|
|
|
-Identify the correct
|
|
|
-\change_inserted 0 1291205366
|
|
|
-free list
|
|
|
-\change_deleted 0 1291205364
|
|
|
-zone
|
|
|
-\change_unchanged
|
|
|
-.
|
|
|
-d1907 2
|
|
|
-a1908 7
|
|
|
-Re-check the
|
|
|
-\change_inserted 0 1291205372
|
|
|
-list
|
|
|
-\change_deleted 0 1291205371
|
|
|
-zone
|
|
|
-\change_unchanged
|
|
|
- (we didn't have a lock, sizes could have changed): relock if necessary.
|
|
|
-d1912 1
|
|
|
-a1912 5
|
|
|
-Place the freed entry in the list
|
|
|
-\change_deleted 0 1291205382
|
|
|
- for that zone
|
|
|
-\change_unchanged
|
|
|
-.
|
|
|
-d1921 1
|
|
|
-a1921 15
|
|
|
-Pick a
|
|
|
-\change_deleted 0 1291205403
|
|
|
-zone either the zone we last freed into, or based on a
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-random
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- number.
|
|
|
-\change_inserted 0 1291205411
|
|
|
-free table; usually the previous one.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1925 10
|
|
|
-\change_deleted 0 1291205432
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_deleted 0 1291205428
|
|
|
-Re-check the zone: relock if necessary.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d1934 1
|
|
|
-a1934 7
|
|
|
- unlock the list and try the next
|
|
|
-\change_inserted 0 1291205455
|
|
|
-largest list
|
|
|
-\change_deleted 0 1291205452
|
|
|
-zone.
|
|
|
-\change_inserted 0 1291205457
|
|
|
-
|
|
|
-a1937 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205476
|
|
|
-a1938 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1966 2
|
|
|
-\change_inserted 0 1291205542
|
|
|
-
|
|
|
-a1969 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205591
|
|
|
-a1971 70
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1291205539
|
|
|
-I anticipate that the number of entries in each free zone would be small,
|
|
|
- but it might be worth using one free entry to hold pointers to the others
|
|
|
- for cache efficiency.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1291205534
|
|
|
-\begin_inset CommandInset label
|
|
|
-LatexCommand label
|
|
|
-name "freelist-in-zone"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-If we want to avoid locking complexity (enlarging the free lists when we
|
|
|
- enlarge the file) we could place the array of free lists at the beginning
|
|
|
- of each zone.
|
|
|
- This means existing array lists never move, but means that a record cannot
|
|
|
- be larger than a zone.
|
|
|
- That in turn implies that zones should be variable sized (say, power of
|
|
|
- 2), which makes the question
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-what zone is this record in?
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- much harder (and
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-pick a random zone
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
-, but that's less common).
|
|
|
- It could be done with as few as 4 bits from the record header.
|
|
|
-\begin_inset Foot
|
|
|
-status collapsed
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-Using
|
|
|
-\begin_inset Formula $2^{16+N*3}$
|
|
|
-\end_inset
|
|
|
-
|
|
|
-means 0 gives a minimal 65536-byte zone, 15 gives the maximal
|
|
|
-\begin_inset Formula $2^{61}$
|
|
|
-\end_inset
|
|
|
-
|
|
|
- byte zone.
|
|
|
- Zones range in factor of 8 steps.
|
|
|
- Given the zone size for the zone the current record is in, we can determine
|
|
|
- the start of the zone.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-
|
|
|
-\change_inserted 0 1291205139
|
|
|
-
|
|
|
-d2218 1
|
|
|
-a2218 5
|
|
|
- uint32_t
|
|
|
-\change_inserted 0 1291205758
|
|
|
-used_
|
|
|
-\change_unchanged
|
|
|
-magic : 16,
|
|
|
-a2222 4
|
|
|
-\change_deleted 0 1291205693
|
|
|
- prev_is_free: 1,
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d2230 1
|
|
|
-a2230 7
|
|
|
- top_hash: 1
|
|
|
-\change_inserted 0 1291205704
|
|
|
-1
|
|
|
-\change_deleted 0 1291205704
|
|
|
-0
|
|
|
-\change_unchanged
|
|
|
-;
|
|
|
-d2254 1
|
|
|
-a2254 9
|
|
|
- uint
|
|
|
-\change_inserted 0 1291205725
|
|
|
-64
|
|
|
-\change_deleted 0 1291205723
|
|
|
-32
|
|
|
-\change_unchanged
|
|
|
-_t
|
|
|
-\change_inserted 0 1291205753
|
|
|
-free_magic: 8,
|
|
|
-a2257 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205746
|
|
|
-a2262 24
|
|
|
-\change_deleted 0 1291205749
|
|
|
-free_magic;
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
- uint64_t
|
|
|
-\change_inserted 0 1291205786
|
|
|
-free_table: 8,
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout LyX-Code
|
|
|
-
|
|
|
-\change_inserted 0 1291205788
|
|
|
-
|
|
|
-\change_unchanged
|
|
|
-total_length
|
|
|
-\change_inserted 0 1291205792
|
|
|
- : 56
|
|
|
-\change_deleted 0 1291205790
|
|
|
-;
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d2266 1
|
|
|
-a2266 7
|
|
|
- uint64_t
|
|
|
-\change_deleted 0 1291205801
|
|
|
-prev,
|
|
|
-\change_unchanged
|
|
|
-next;
|
|
|
-\change_deleted 0 1291205811
|
|
|
-
|
|
|
-d2270 1
|
|
|
-a2270 3
|
|
|
-
|
|
|
-\change_deleted 0 1291205811
|
|
|
- ...
|
|
|
-d2274 1
|
|
|
-a2274 5
|
|
|
-
|
|
|
-\change_deleted 0 1291205808
|
|
|
- uint64_t tailer
|
|
|
-\change_unchanged
|
|
|
-;
|
|
|
-d2283 5
|
|
|
-a2287 16
|
|
|
-\change_deleted 0 1291205827
|
|
|
-We might want to take some bits from the used record's top_hash (and the
|
|
|
- free record which has 32 bits of padding to spare anyway) if we use variable
|
|
|
- sized zones.
|
|
|
- See
|
|
|
-\begin_inset CommandInset ref
|
|
|
-LatexCommand ref
|
|
|
-reference "freelist-in-zone"
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-.
|
|
|
-
|
|
|
-\change_inserted 0 1291205885
|
|
|
- Note that by limiting valid offsets to 56 bits, we can pack everything
|
|
|
- we need into 3 64-byte words, meaning our minimum record size is 8 bytes.
|
|
|
-a2290 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205886
|
|
|
-a2294 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205886
|
|
|
-a2295 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2385 2
|
|
|
-\change_inserted 0 1291205894
|
|
|
-
|
|
|
-a2388 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205894
|
|
|
-a2392 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205902
|
|
|
-a2393 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2415 4
|
|
|
-
|
|
|
-\change_deleted 0 1291204504
|
|
|
-
|
|
|
-\change_unchanged
|
|
|
-a2445 2
|
|
|
-\change_inserted 0 1291205910
|
|
|
-
|
|
|
-a2448 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205910
|
|
|
-a2452 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205914
|
|
|
-a2453 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2485 2
|
|
|
-\change_inserted 0 1291205919
|
|
|
-
|
|
|
-a2488 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205919
|
|
|
-a2492 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205922
|
|
|
-a2493 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2533 2
|
|
|
-\change_inserted 0 1291205929
|
|
|
-
|
|
|
-a2536 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205929
|
|
|
-a2540 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205929
|
|
|
-a2541 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2578 2
|
|
|
-\change_inserted 0 1291205932
|
|
|
-
|
|
|
-a2581 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205933
|
|
|
-a2585 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205933
|
|
|
-a2586 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2724 2
|
|
|
-\change_inserted 0 1291205944
|
|
|
-
|
|
|
-a2727 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205945
|
|
|
-a2731 2
|
|
|
-
|
|
|
-\change_inserted 0 1291205948
|
|
|
-a2732 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.11
|
|
|
-log
|
|
|
-@Merge changes
|
|
|
-@
|
|
|
-text
|
|
|
-@d53 7
|
|
|
-a59 1
|
|
|
-14-September-2010
|
|
|
-d587 16
|
|
|
-d644 18
|
|
|
-d716 16
|
|
|
-d753 16
|
|
|
-d813 18
|
|
|
-d883 16
|
|
|
-d953 16
|
|
|
-d1084 16
|
|
|
-d1181 16
|
|
|
-d1273 16
|
|
|
-d1328 16
|
|
|
-d1381 16
|
|
|
-d1447 19
|
|
|
-a1465 2
|
|
|
- if older code (which doesn't understand the feature) writes to the database.Reco
|
|
|
-rd Headers Are Not Expandible
|
|
|
-d1484 16
|
|
|
-d1546 16
|
|
|
-d1617 16
|
|
|
-d1680 16
|
|
|
-d1725 16
|
|
|
-d1810 16
|
|
|
-d1951 8
|
|
|
-a1958 3
|
|
|
-Proposed SolutionThe first step is to remove all the current heuristics,
|
|
|
- as they obviously interact, then examine them once the lock contention
|
|
|
- is addressed.
|
|
|
-d1989 7
|
|
|
-a1995 2
|
|
|
- is to divide the file into zones, and using a free list (or set of free
|
|
|
- lists) for each.
|
|
|
-d1997 2
|
|
|
-d2002 25
|
|
|
-d2039 2
|
|
|
-d2049 7
|
|
|
-a2055 1
|
|
|
-Identify the correct zone.
|
|
|
-d2063 7
|
|
|
-a2069 2
|
|
|
-Re-check the zone (we didn't have a lock, sizes could have changed): relock
|
|
|
- if necessary.
|
|
|
-d2073 5
|
|
|
-a2077 1
|
|
|
-Place the freed entry in the list for that zone.
|
|
|
-d2086 3
|
|
|
-a2088 1
|
|
|
-Pick a zone either the zone we last freed into, or based on a
|
|
|
-d2097 4
|
|
|
-d2105 2
|
|
|
-d2110 2
|
|
|
-d2113 2
|
|
|
-d2123 15
|
|
|
-a2137 1
|
|
|
- unlock the list and try the next zone.
|
|
|
-d2166 11
|
|
|
-d2180 2
|
|
|
-d2185 2
|
|
|
-d2190 2
|
|
|
-d2223 1
|
|
|
-a2223 1
|
|
|
-status open
|
|
|
-d2243 2
|
|
|
-d2491 5
|
|
|
-a2495 1
|
|
|
- uint32_t magic : 16,
|
|
|
-d2499 2
|
|
|
-d2502 2
|
|
|
-d2511 7
|
|
|
-a2517 1
|
|
|
- top_hash: 10;
|
|
|
-d2541 29
|
|
|
-a2569 1
|
|
|
- uint32_t free_magic;
|
|
|
-d2573 11
|
|
|
-a2583 1
|
|
|
- uint64_t total_length;
|
|
|
-d2587 7
|
|
|
-a2593 1
|
|
|
- uint64_t prev, next;
|
|
|
-d2597 2
|
|
|
-d2603 5
|
|
|
-a2607 1
|
|
|
- uint64_t tailer;
|
|
|
-d2615 2
|
|
|
-d2628 18
|
|
|
-d2736 16
|
|
|
-d2808 16
|
|
|
-d2856 16
|
|
|
-d2912 16
|
|
|
-d2965 16
|
|
|
-d3119 16
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.10
|
|
|
-log
|
|
|
-@Tracing attribute, talloc support.
|
|
|
-@
|
|
|
-text
|
|
|
-@d1 1
|
|
|
-a1 1
|
|
|
-#LyX 1.6.5 created this file. For more info see http://www.lyx.org/
|
|
|
-d53 1
|
|
|
-a53 7
|
|
|
-
|
|
|
-\change_deleted 0 1283307542
|
|
|
-26-July
|
|
|
-\change_inserted 0 1284423485
|
|
|
-14-September
|
|
|
-\change_unchanged
|
|
|
--2010
|
|
|
-a472 2
|
|
|
-\change_inserted 0 1284422789
|
|
|
-
|
|
|
-a479 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a838 2
|
|
|
-
|
|
|
-\change_inserted 0 1284016998
|
|
|
-a846 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1194 2
|
|
|
-\change_inserted 0 1284015637
|
|
|
-
|
|
|
-a1197 2
|
|
|
-
|
|
|
-\change_inserted 0 1284015716
|
|
|
-a1201 2
|
|
|
-
|
|
|
-\change_inserted 0 1284015906
|
|
|
-a1210 2
|
|
|
-
|
|
|
-\change_inserted 0 1284015637
|
|
|
-a1214 2
|
|
|
-
|
|
|
-\change_inserted 0 1284016114
|
|
|
-a1227 2
|
|
|
-
|
|
|
-\change_inserted 0 1284016149
|
|
|
-a1232 2
|
|
|
-
|
|
|
-\change_inserted 0 1284016639
|
|
|
-a1237 2
|
|
|
-
|
|
|
-\change_inserted 0 1284016821
|
|
|
-a1243 2
|
|
|
-
|
|
|
-\change_inserted 0 1284016803
|
|
|
-d1245 2
|
|
|
-a1246 9
|
|
|
- if older code (which doesn't understand the feature) writes to the database.
|
|
|
-\change_deleted 0 1284016101
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Subsection
|
|
|
-
|
|
|
-\change_inserted 0 1284015634
|
|
|
-Record Headers Are Not Expandible
|
|
|
-a1249 2
|
|
|
-
|
|
|
-\change_inserted 0 1284015634
|
|
|
-a1254 2
|
|
|
-
|
|
|
-\change_inserted 0 1284015634
|
|
|
-a1258 2
|
|
|
-
|
|
|
-\change_inserted 0 1284422552
|
|
|
-a1267 2
|
|
|
-
|
|
|
-\change_inserted 0 1284422568
|
|
|
-a1271 2
|
|
|
-
|
|
|
-\change_inserted 0 1284422646
|
|
|
-a1276 2
|
|
|
-
|
|
|
-\change_inserted 0 1284422656
|
|
|
-a1280 2
|
|
|
-
|
|
|
-\change_inserted 0 1284423065
|
|
|
-a1305 2
|
|
|
-
|
|
|
-\change_inserted 0 1284423042
|
|
|
-a1310 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1457 2
|
|
|
-
|
|
|
-\change_inserted 0 1283336713
|
|
|
-a1463 2
|
|
|
-
|
|
|
-\change_unchanged
|
|
|
-d1482 2
|
|
|
-d1485 1
|
|
|
-a1485 51
|
|
|
-\change_deleted 0 1283307675
|
|
|
-There are three details which become important:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_deleted 0 1283307675
|
|
|
-On encountering a full bucket, we use the next bucket.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_deleted 0 1283307675
|
|
|
-Extra hash bits are stored with the offset, to reduce comparisons.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_deleted 0 1283307675
|
|
|
-A marker entry is used on deleting an entry.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1283307675
|
|
|
-The doubling of the table must be done under a transaction; we will not
|
|
|
- reduce it on deletion, so it will be an unusual case.
|
|
|
- It will either be placed at the head (other entries will be moved out the
|
|
|
- way so we can expand).
|
|
|
- We could have a pointer in the header to the current hashtable location,
|
|
|
- but that pointer would have to be read frequently to check for hashtable
|
|
|
- moves.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1283307675
|
|
|
-The locking for this is slightly more complex than the chained case; we
|
|
|
- currently have one lock per bucket, and that means we would need to expand
|
|
|
- the lock if we overflow to the next bucket.
|
|
|
- The frequency of such collisions will effect our locking heuristics: we
|
|
|
- can always lock more buckets than we need.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1283307675
|
|
|
-One possible optimization is to only re-check the hash size on an insert
|
|
|
- or a lookup miss.
|
|
|
-
|
|
|
-\change_inserted 0 1283307770
|
|
|
-a1492 2
|
|
|
-
|
|
|
-\change_inserted 0 1283336187
|
|
|
-a1500 2
|
|
|
-
|
|
|
-\change_inserted 0 1283336586
|
|
|
-a1510 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d1636 3
|
|
|
-a1638 8
|
|
|
-Proposed Solution
|
|
|
-\change_deleted 0 1283336858
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-The first step is to remove all the current heuristics, as they obviously
|
|
|
- interact, then examine them once the lock contention is addressed.
|
|
|
-a1647 2
|
|
|
-\change_inserted 0 1283336910
|
|
|
-
|
|
|
-a1650 2
|
|
|
-
|
|
|
-\change_inserted 0 1283337052
|
|
|
-a1655 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1776 2
|
|
|
-\change_inserted 0 1283309850
|
|
|
-
|
|
|
-a1779 2
|
|
|
-
|
|
|
-\change_inserted 0 1283337216
|
|
|
-a1813 2
|
|
|
-
|
|
|
-\change_inserted 0 1284424151
|
|
|
-a1825 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1830 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2031 2
|
|
|
-
|
|
|
-\change_inserted 0 1283336739
|
|
|
-a2040 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2117 2
|
|
|
-\change_inserted 0 1283337133
|
|
|
-
|
|
|
-a2120 2
|
|
|
-
|
|
|
-\change_inserted 0 1283337139
|
|
|
-a2121 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2136 2
|
|
|
-
|
|
|
-\change_inserted 0 1283337235
|
|
|
-a2147 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d2251 1
|
|
|
-a2251 7
|
|
|
-Proposed Solution
|
|
|
-\change_deleted 0 1284423472
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-None.
|
|
|
-d2261 1
|
|
|
-a2261 1
|
|
|
-\change_inserted 0 1284423891
|
|
|
-d2263 1
|
|
|
-a2263 4
|
|
|
-\change_deleted 0 1284423891
|
|
|
-.
|
|
|
-
|
|
|
-\change_inserted 0 1284423901
|
|
|
-a2271 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2293 2
|
|
|
-\change_inserted 0 1284423495
|
|
|
-
|
|
|
-a2312 2
|
|
|
-
|
|
|
-\change_inserted 0 1284424201
|
|
|
-d2321 1
|
|
|
-a2321 3
|
|
|
-
|
|
|
-\change_unchanged
|
|
|
-We could solve a small part of the problem by providing read-only transactions.
|
|
|
-a2505 2
|
|
|
-\change_inserted 0 1284423555
|
|
|
-
|
|
|
-a2508 2
|
|
|
-
|
|
|
-\change_inserted 0 1284423617
|
|
|
-a2512 2
|
|
|
-
|
|
|
-\change_inserted 0 1284423719
|
|
|
-a2519 2
|
|
|
-
|
|
|
-\change_inserted 0 1284423864
|
|
|
-a2530 2
|
|
|
-
|
|
|
-\change_inserted 0 1284423850
|
|
|
-a2540 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.9
|
|
|
-log
|
|
|
-@Extension mechanism.
|
|
|
-@
|
|
|
-text
|
|
|
-@d56 2
|
|
|
-a57 2
|
|
|
-\change_inserted 0 1284016854
|
|
|
-9-September
|
|
|
-d479 11
|
|
|
-d1303 1
|
|
|
-a1303 1
|
|
|
-\change_inserted 0 1284016847
|
|
|
-d1310 56
|
|
|
-d1945 1
|
|
|
-a1945 1
|
|
|
-\change_inserted 0 1283310945
|
|
|
-d1956 2
|
|
|
-d2402 2
|
|
|
-d2416 4
|
|
|
-d2421 12
|
|
|
-d2455 2
|
|
|
-d2476 12
|
|
|
-d2673 47
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.8
|
|
|
-log
|
|
|
-@Remove bogus footnote
|
|
|
-@
|
|
|
-text
|
|
|
-@d56 2
|
|
|
-a57 2
|
|
|
-\change_inserted 0 1283307544
|
|
|
-1-September
|
|
|
-d838 12
|
|
|
-d1198 103
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.7
|
|
|
-log
|
|
|
-@Moving hash table does not work.
|
|
|
-@
|
|
|
-text
|
|
|
-@a1436 12
|
|
|
-\begin_inset Foot
|
|
|
-status collapsed
|
|
|
-
|
|
|
-\begin_layout Plain Layout
|
|
|
-
|
|
|
-\change_inserted 0 1283336450
|
|
|
-If we make the hash offsets zone-relative, then this only restricts the
|
|
|
- zone size, not the overall database size.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_inset
|
|
|
-
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.6
|
|
|
-log
|
|
|
-@Commit changes
|
|
|
-@
|
|
|
-text
|
|
|
-@d38 1
|
|
|
-a38 1
|
|
|
-\author ""
|
|
|
-d53 7
|
|
|
-a59 1
|
|
|
-26-July-2010
|
|
|
-d1333 10
|
|
|
-d1361 3
|
|
|
-a1363 1
|
|
|
- There are three details which become important:
|
|
|
-d1367 2
|
|
|
-d1373 2
|
|
|
-d1379 2
|
|
|
-d1385 2
|
|
|
-d1397 2
|
|
|
-d1407 2
|
|
|
-d1411 45
|
|
|
-d1582 2
|
|
|
-d1598 14
|
|
|
-d1733 62
|
|
|
-d1996 13
|
|
|
-d2086 10
|
|
|
-d2110 15
|
|
|
-a2124 1
|
|
|
-\begin_layout LyX-Code
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.5
|
|
|
-log
|
|
|
-@Soft transaction commit
|
|
|
-@
|
|
|
-text
|
|
|
-@d38 1
|
|
|
-a38 1
|
|
|
-\author "Rusty Russell,,,"
|
|
|
-a52 4
|
|
|
-
|
|
|
-\change_deleted 0 1280141199
|
|
|
-10-May-2010
|
|
|
-\change_inserted 0 1280141202
|
|
|
-a53 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2028 2
|
|
|
-
|
|
|
-\change_inserted 0 1280140902
|
|
|
-a2034 2
|
|
|
-
|
|
|
-\change_unchanged
|
|
|
-a2212 2
|
|
|
-\change_inserted 0 1280140661
|
|
|
-
|
|
|
-a2215 2
|
|
|
-
|
|
|
-\change_inserted 0 1280140703
|
|
|
-a2219 2
|
|
|
-
|
|
|
-\change_inserted 0 1280708312
|
|
|
-a2226 2
|
|
|
-
|
|
|
-\change_inserted 0 1280708400
|
|
|
-a2239 2
|
|
|
-
|
|
|
-\change_inserted 0 1280140836
|
|
|
-a2243 2
|
|
|
-
|
|
|
-\change_inserted 0 1280708255
|
|
|
-a2247 2
|
|
|
-
|
|
|
-\change_inserted 0 1280708374
|
|
|
-a2252 2
|
|
|
-
|
|
|
-\change_inserted 0 1280141181
|
|
|
-a2274 2
|
|
|
-
|
|
|
-\change_inserted 0 1280141345
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.4
|
|
|
-log
|
|
|
-@Merge changes
|
|
|
-@
|
|
|
-text
|
|
|
-@d38 1
|
|
|
-a38 1
|
|
|
-\author ""
|
|
|
-d53 2
|
|
|
-d56 4
|
|
|
-d2035 10
|
|
|
-d2223 84
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.3
|
|
|
-log
|
|
|
-@Transaction and freelist rethink.
|
|
|
-@
|
|
|
-text
|
|
|
-@d38 1
|
|
|
-a38 1
|
|
|
-\author "Rusty Russell,,,"
|
|
|
-d53 1
|
|
|
-a53 1
|
|
|
-27-April-2010
|
|
|
-d662 1
|
|
|
-a662 5
|
|
|
- behavior of disallowing
|
|
|
-\change_inserted 0 1272940179
|
|
|
-nested
|
|
|
-\change_unchanged
|
|
|
-transactions should become the default.
|
|
|
-a1210 2
|
|
|
-\change_inserted 0 1272944650
|
|
|
-
|
|
|
-a1214 2
|
|
|
-
|
|
|
-\change_inserted 0 1272944763
|
|
|
-a1218 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1223 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1301 2
|
|
|
-
|
|
|
-\change_inserted 0 1273478114
|
|
|
-a1310 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d1515 1
|
|
|
-a1515 11
|
|
|
-The free list
|
|
|
-\change_deleted 0 1273469807
|
|
|
-should
|
|
|
-\change_inserted 0 1273469810
|
|
|
-must
|
|
|
-\change_unchanged
|
|
|
- be split
|
|
|
-\change_deleted 0 1273469815
|
|
|
-into multiple lists
|
|
|
-\change_unchanged
|
|
|
-to reduce contention.
|
|
|
-a1520 2
|
|
|
-\change_inserted 0 1273470006
|
|
|
-
|
|
|
-a1523 2
|
|
|
-
|
|
|
-\change_inserted 0 1273492055
|
|
|
-a1539 2
|
|
|
-
|
|
|
-\change_inserted 0 1273483888
|
|
|
-a1551 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1554 8
|
|
|
-
|
|
|
-\change_deleted 0 1272942055
|
|
|
-There are various ways to organize these lisys, but because we want to be
|
|
|
- able to quickly identify which free list an entry is in, and reduce the
|
|
|
- number of locks required for merging, we will use zoning (eg.
|
|
|
- each free list covers some fixed fraction of the file).
|
|
|
-
|
|
|
-\change_inserted 0 1273484187
|
|
|
-d1556 1
|
|
|
-a1556 7
|
|
|
-
|
|
|
-\change_deleted 0 1273484194
|
|
|
-The algorithm for f
|
|
|
-\change_inserted 0 1273484194
|
|
|
-F
|
|
|
-\change_unchanged
|
|
|
-reeing is simple:
|
|
|
-d1560 1
|
|
|
-a1560 7
|
|
|
-Identify the correct
|
|
|
-\change_deleted 0 1273482856
|
|
|
-free list
|
|
|
-\change_inserted 0 1273482857
|
|
|
-zone
|
|
|
-\change_unchanged
|
|
|
-.
|
|
|
-d1564 1
|
|
|
-a1564 7
|
|
|
-Lock the
|
|
|
-\change_inserted 0 1273482895
|
|
|
-corresponding
|
|
|
-\change_unchanged
|
|
|
-list
|
|
|
-\change_inserted 0 1273482863
|
|
|
-.
|
|
|
-a1567 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482909
|
|
|
-d1573 1
|
|
|
-a1573 13
|
|
|
-
|
|
|
-\change_deleted 0 1273482885
|
|
|
-, and p
|
|
|
-\change_inserted 0 1273482888
|
|
|
-P
|
|
|
-\change_unchanged
|
|
|
-lace the freed entry
|
|
|
-\change_deleted 0 1273492415
|
|
|
-at the head
|
|
|
-\change_inserted 0 1273492415
|
|
|
-in the list for that zone
|
|
|
-\change_unchanged
|
|
|
-.
|
|
|
-d1577 2
|
|
|
-a1578 7
|
|
|
-Allocation is a little more complicated, as we
|
|
|
-\change_deleted 0 1273483240
|
|
|
-merge entries as we walk the list:
|
|
|
-\change_inserted 0 1273484250
|
|
|
-perform delayed coalescing at this point:
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d1582 1
|
|
|
-a1582 19
|
|
|
-Pick a
|
|
|
-\change_deleted 0 1273482955
|
|
|
-free list;
|
|
|
-\change_inserted 0 1273482957
|
|
|
-zone
|
|
|
-\change_unchanged
|
|
|
- either the
|
|
|
-\change_deleted 0 1273482962
|
|
|
-list
|
|
|
-\change_inserted 0 1273482962
|
|
|
-zone
|
|
|
-\change_unchanged
|
|
|
- we last freed
|
|
|
-\change_deleted 0 1273482966
|
|
|
-o
|
|
|
-\change_inserted 0 1273482966
|
|
|
-i
|
|
|
-\change_unchanged
|
|
|
-nto, or based on a
|
|
|
-d1594 1
|
|
|
-a1594 9
|
|
|
-Lock th
|
|
|
-\change_inserted 0 1273482980
|
|
|
-e corresponding
|
|
|
-\change_deleted 0 1273482973
|
|
|
-at
|
|
|
-\change_unchanged
|
|
|
- list.
|
|
|
-\change_inserted 0 1273482982
|
|
|
-
|
|
|
-a1597 2
|
|
|
-
|
|
|
-\change_inserted 0 1273483084
|
|
|
-a1598 53
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-If the top entry is
|
|
|
-\change_deleted 0 1273492155
|
|
|
-well-sized,
|
|
|
-\change_inserted 0 1273492159
|
|
|
--large enough,
|
|
|
-\change_unchanged
|
|
|
-remove it from the list and return it.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-Otherwise,
|
|
|
-\change_inserted 0 1273492206
|
|
|
-coalesce entries in the list.
|
|
|
-\change_deleted 0 1273492200
|
|
|
-examine the entry to the right of it in the file.
|
|
|
- If it is free:
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_deeper
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_deleted 0 1273492200
|
|
|
-If that entry is in a different list, lock that list too.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_deleted 0 1273492200
|
|
|
-If we had to place a new lock, re-check that the entry is free.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_deleted 0 1273492200
|
|
|
-Remove that entry from its free list and expand this entry to cover it.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_deleted 0 1273485554
|
|
|
-Goto step 3.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\end_deeper
|
|
|
-\begin_layout Enumerate
|
|
|
-
|
|
|
-\change_inserted 0 1273485311
|
|
|
-If there was no entry large enough, unlock the list and try the next zone.
|
|
|
-d1602 1
|
|
|
-a1602 5
|
|
|
-
|
|
|
-\change_deleted 0 1273483646
|
|
|
-Repeat step 3 with each entry in the list.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d1606 2
|
|
|
-a1607 5
|
|
|
-
|
|
|
-\change_deleted 0 1273483668
|
|
|
-Unlock the list and repeat step 2 with the next list.
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-d1611 1
|
|
|
-a1611 7
|
|
|
-If no
|
|
|
-\change_deleted 0 1273483671
|
|
|
-list
|
|
|
-\change_inserted 0 1273483671
|
|
|
-zone
|
|
|
-\change_unchanged
|
|
|
- satisfies, expand the file.
|
|
|
-d1615 2
|
|
|
-a1616 9
|
|
|
-This optimizes rapid insert/delete of free list entries
|
|
|
-\change_inserted 0 1273485794
|
|
|
- by not coalescing them all the time.
|
|
|
-\change_deleted 0 1273483685
|
|
|
-, and allows us to get rid of the tailer altogether
|
|
|
-\change_unchanged
|
|
|
-.
|
|
|
-
|
|
|
-\change_inserted 0 1273492299
|
|
|
-a1638 39
|
|
|
-
|
|
|
-\change_deleted 0 1273476840
|
|
|
-The question of
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-well-sized
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- free entries is more difficult: the 25% overhead works in practice for
|
|
|
- ldb because indexes tend to expand by one record at a time.
|
|
|
- This can be resolved by having an
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-expanded
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- bit in the header to note entries that have previously expanded, and allocating
|
|
|
- more space for them.
|
|
|
- Whether the
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-increasing slack
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
- algorithm should be implemented or first-fit used is still unknown: we
|
|
|
- will determine this once these other ideas are implemented.
|
|
|
-\change_inserted 0 1273483750
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_inserted 0 1273492450
|
|
|
-a1644 2
|
|
|
-
|
|
|
-\change_inserted 0 1273470441
|
|
|
-a1654 2
|
|
|
-
|
|
|
-\change_inserted 0 1273476556
|
|
|
-a1659 2
|
|
|
-
|
|
|
-\change_inserted 0 1273470423
|
|
|
-a1661 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1672 2
|
|
|
-
|
|
|
-\change_inserted 0 1273476847
|
|
|
-a1676 2
|
|
|
-
|
|
|
-\change_inserted 0 1273476886
|
|
|
-a1691 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477233
|
|
|
-a1699 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477534
|
|
|
-a1706 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482700
|
|
|
-a1712 2
|
|
|
-
|
|
|
-\change_inserted 0 1273478079
|
|
|
-a1722 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477839
|
|
|
-a1726 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1730 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1734 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1738 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1742 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1746 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1750 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1754 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1758 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1762 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1766 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1770 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1774 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1778 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1782 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1786 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1790 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1794 2
|
|
|
-
|
|
|
-\change_inserted 0 1273477925
|
|
|
-a1798 2
|
|
|
-
|
|
|
-\change_inserted 0 1273492522
|
|
|
-a1802 2
|
|
|
-
|
|
|
-\change_inserted 0 1273492530
|
|
|
-a1806 2
|
|
|
-
|
|
|
-\change_inserted 0 1273492546
|
|
|
-a1810 2
|
|
|
-
|
|
|
-\change_inserted 0 1273478239
|
|
|
-a1814 2
|
|
|
-
|
|
|
-\change_inserted 0 1273479960
|
|
|
-a1821 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480265
|
|
|
-a1830 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480354
|
|
|
-a1845 2
|
|
|
-
|
|
|
-\change_inserted 0 1273478968
|
|
|
-a1851 2
|
|
|
-
|
|
|
-\change_inserted 0 1273492604
|
|
|
-a1859 2
|
|
|
-
|
|
|
-\change_inserted 0 1273479572
|
|
|
-a1862 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a1870 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480282
|
|
|
-a1874 2
|
|
|
-
|
|
|
-\change_inserted 0 1273478931
|
|
|
-a1878 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481549
|
|
|
-a1882 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481557
|
|
|
-a1886 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480307
|
|
|
-a1890 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480335
|
|
|
-a1894 2
|
|
|
-
|
|
|
-\change_inserted 0 1273479897
|
|
|
-a1898 2
|
|
|
-
|
|
|
-\change_inserted 0 1273479653
|
|
|
-a1902 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480371
|
|
|
-a1906 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480464
|
|
|
-a1910 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480399
|
|
|
-a1914 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480425
|
|
|
-a1918 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480453
|
|
|
-a1922 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480455
|
|
|
-a1926 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480450
|
|
|
-a1930 2
|
|
|
-
|
|
|
-\change_inserted 0 1273480452
|
|
|
-a1935 2
|
|
|
-\change_inserted 0 1273478830
|
|
|
-
|
|
|
-a1942 5
|
|
|
-
|
|
|
-\change_deleted 0 1273481604
|
|
|
-In theory, we could get away with 2: one after we write the new data, and
|
|
|
- one to somehow atomically change over to it.
|
|
|
-\change_inserted 0 1273481632
|
|
|
-a1946 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481724
|
|
|
-a1950 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481713
|
|
|
-a1954 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481717
|
|
|
-a1958 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481730
|
|
|
-a1962 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481736
|
|
|
-a1966 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481744
|
|
|
-a1970 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481748
|
|
|
-a1974 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482185
|
|
|
-a1978 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482259
|
|
|
-a1989 50
|
|
|
-
|
|
|
-\change_deleted 0 1273481848
|
|
|
-None.
|
|
|
- Trying to rewrite the transaction code is a separate experiment, which
|
|
|
- I encourage someone else to do.
|
|
|
- At some point you say
|
|
|
-\begin_inset Quotes eld
|
|
|
-\end_inset
|
|
|
-
|
|
|
-use a real database
|
|
|
-\begin_inset Quotes erd
|
|
|
-\end_inset
|
|
|
-
|
|
|
-.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1273481848
|
|
|
-But as a thought experiment:
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1273481788
|
|
|
-Say there was a pointer in the header which said where the hash table and
|
|
|
- free list tables were, and that no blocks were labeled with whether they
|
|
|
- were free or not (it had to be derived from what list they were in).
|
|
|
- We could create new hash table and free list in some free space, and populate
|
|
|
- it as we want the post-committed state to look.
|
|
|
- Then we sync, then we switch the offset in the header, then we sync again.
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_deleted 0 1273481788
|
|
|
-This would not allow arbitrary changes to the database, such as tdb_repack
|
|
|
- does, and would require more space (since we have to preserve the current
|
|
|
- and future entries at once).
|
|
|
- If we used hash trees rather than one big hash table, we might only have
|
|
|
- to rewrite some sections of the hash, too.
|
|
|
-\change_inserted 0 1273481854
|
|
|
-
|
|
|
-\end_layout
|
|
|
-
|
|
|
-\begin_layout Standard
|
|
|
-
|
|
|
-\change_inserted 0 1273482102
|
|
|
-a1993 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482061
|
|
|
-a1998 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482063
|
|
|
-a2002 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482072
|
|
|
-a2006 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482139
|
|
|
-a2011 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482364
|
|
|
-a2015 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482163
|
|
|
-a2019 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482493
|
|
|
-a2037 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482536
|
|
|
-a2046 2
|
|
|
-\change_unchanged
|
|
|
-
|
|
|
-a2049 2
|
|
|
-
|
|
|
-\change_inserted 0 1273482641
|
|
|
-a2058 2
|
|
|
-
|
|
|
-\change_inserted 0 1273481827
|
|
|
-d2067 2
|
|
|
-a2068 11
|
|
|
-We could
|
|
|
-\change_inserted 0 1273481829
|
|
|
-then
|
|
|
-\change_unchanged
|
|
|
-implement snapshots using a similar method
|
|
|
-\change_deleted 0 1273481838
|
|
|
- to the above, only
|
|
|
-\change_inserted 0 1273481840
|
|
|
-,
|
|
|
-\change_unchanged
|
|
|
- using multiple different hash tables/free tables.
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.2
|
|
|
-log
|
|
|
-@After first feedback (Ronnie & Volker)
|
|
|
-@
|
|
|
-text
|
|
|
-@d1314 13
|
|
|
-d1531 11
|
|
|
-a1541 1
|
|
|
-The free list should be split into multiple lists to reduce contention.
|
|
|
-d1547 39
|
|
|
-d1596 7
|
|
|
-d1604 1
|
|
|
-a1604 1
|
|
|
-The algorithm for freeing is simple:
|
|
|
-d1608 7
|
|
|
-a1614 1
|
|
|
-Identify the correct free list.
|
|
|
-d1618 30
|
|
|
-a1647 1
|
|
|
-Lock the list, and place the freed entry at the head.
|
|
|
-d1651 7
|
|
|
-a1657 2
|
|
|
-Allocation is a little more complicated, as we merge entries as we walk
|
|
|
- the list:
|
|
|
-d1661 19
|
|
|
-a1679 1
|
|
|
-Pick a free list; either the list we last freed onto, or based on a
|
|
|
-d1691 17
|
|
|
-a1707 1
|
|
|
-Lock that list.
|
|
|
-d1711 7
|
|
|
-a1717 1
|
|
|
-If the top entry is well-sized, remove it from the list and return it.
|
|
|
-d1721 5
|
|
|
-a1725 1
|
|
|
-Otherwise, examine the entry to the right of it in the file.
|
|
|
-d1731 2
|
|
|
-d1737 2
|
|
|
-d1743 2
|
|
|
-d1749 2
|
|
|
-d1756 8
|
|
|
-d1765 2
|
|
|
-d1770 2
|
|
|
-d1773 2
|
|
|
-d1778 7
|
|
|
-a1784 1
|
|
|
-If no list satisfies, expand the file.
|
|
|
-d1788 28
|
|
|
-a1815 2
|
|
|
-This optimizes rapid insert/delete of free list entries, and allows us to
|
|
|
- get rid of the tailer altogether.
|
|
|
-d1819 2
|
|
|
-d1851 1
|
|
|
-a1851 1
|
|
|
-\change_inserted 0 1272941474
|
|
|
-d1857 303
|
|
|
-a2159 18
|
|
|
-\change_inserted 0 1272942759
|
|
|
-There are various ways to organize these lists, but because we want to be
|
|
|
- able to quickly identify which free list an entry is in, and reduce the
|
|
|
- number of locks required for merging, we will use zoning (eg.
|
|
|
- each of the N free lists in a tdb file of size M covers a fixed fraction
|
|
|
- M/N).
|
|
|
- Note that this means we need to reshuffle the free lists when we expand
|
|
|
- the file; this is probably acceptable when we double the hash table size,
|
|
|
- since that is such an expensive operation already.
|
|
|
- In the case of increasing the file size, there is an optimization we can
|
|
|
- use: if we use M in the formula above as the file size rounded up to the
|
|
|
- next power of 2, we only need reshuffle free lists when the file size crosses
|
|
|
- a power of 2 boundary,
|
|
|
-\emph on
|
|
|
-and
|
|
|
-\emph default
|
|
|
-reshuffling the free lists is trivial: we simply merge every consecutive
|
|
|
- pair of free lists.
|
|
|
-d2164 107
|
|
|
-d2276 2
|
|
|
-d2280 59
|
|
|
-d2346 2
|
|
|
-d2363 2
|
|
|
-d2366 2
|
|
|
-d2371 2
|
|
|
-d2382 2
|
|
|
-d2389 57
|
|
|
-d2458 13
|
|
|
-d2474 32
|
|
|
-a2505 2
|
|
|
-We could implement snapshots using a similar method to the above, only using
|
|
|
- multiple different hash tables/free tables.
|
|
|
-@
|
|
|
-
|
|
|
-
|
|
|
-1.1
|
|
|
-log
|
|
|
-@Initial revision
|
|
|
-@
|
|
|
-text
|
|
|
-@d1 1
|
|
|
-a1 1
|
|
|
-#LyX 1.6.4 created this file. For more info see http://www.lyx.org/
|
|
|
-d36 3
|
|
|
-a38 3
|
|
|
-\tracking_changes false
|
|
|
-\output_changes false
|
|
|
-\author ""
|
|
|
-d662 5
|
|
|
-a666 1
|
|
|
- behavior of disallowing transactions should become the default.
|
|
|
-d1215 21
|
|
|
-d1527 2
|
|
|
-d1533 3
|
|
|
-a1535 1
|
|
|
- The algorithm for freeing is simple:
|
|
|
-d1642 26
|
|
|
-@
|