Materialized Path trees

This is an efficient implementation of Materialized Path trees for Django, as described by Vadim Tropashko in SQL Design Patterns. Materialized Path is probably the fastest way of working with trees in SQL without the need of extra work in the database, like Oracle’s CONNECT BY or sprocs and triggers for nested intervals.

In a materialized path approach, every node in the tree will have a path attribute, where the full path from the root to the node will be stored. This has the advantage of needing very simple and fast queries, at the risk of inconsistency because of the denormalization of parent/child foreign keys. This can be prevented with transactions.

django-treebeard uses a particular approach: every step in the path has a fixed width and has no separators. This makes queries predictable and faster at the cost of using more characters to store a step. To address this problem, every step number is encoded.

Also, two extra fields are stored in every node: depth and numchild. This makes the read operations faster, at the cost of a little more maintenance on tree updates/inserts/deletes. Don’t worry, even with these extra steps, materialized path is more efficient than other approaches.

Warning

As with all tree implementations, please be aware of the Known Caveats.

Note

The materialized path approach makes heavy use of LIKE in your database, with clauses like WHERE path LIKE '002003%'. If you think that LIKE is too slow, you’re right, but in this case the path field is indexed in the database, and all LIKE clauses that don’t start with a % character will use the index. This is what makes the materialized path approach so fast.

Inheritance diagram of MP_Node
class treebeard.mp_tree.MP_Node(*args, **kwargs)

Bases: Node

Abstract model to create your own Materialized Path Trees.

Warning

Do not change the values of path, depth or numchild directly: use one of the included methods instead. Consider these values read-only.

Warning

Do not change the values of the steplen, alphabet or node_order_by after saving your first object. Doing so will corrupt the tree.

Warning

If you need to define your own Manager class, you’ll need to subclass MP_NodeManager.

Also, if in your manager you need to change the default queryset handler, you’ll need to subclass MP_NodeQuerySet.

Example:

class SortedNode(MP_Node):
   node_order_by = ['numval', 'strval']

   numval = models.IntegerField()
   strval = models.CharField(max_length=255)

Read the API reference of treebeard.models.Node for info on methods available in this class, or read the following section for methods with particular arguments or exceptions.

steplen

Attribute that defines the length of each step in the path of a node. The default value of 4 allows a maximum of 1679615 children per node. Increase this value if you plan to store large trees (a steplen of 5 allows more than 60M children per node). Note that increasing this value, while increasing the number of children per node, will decrease the max depth of the tree (by default: 63). To increase the max depth, increase the max_length attribute of the path field in your model.

alphabet

Attribute: the alphabet that will be used in base conversions when encoding the path steps into strings. The default value, 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ is the most optimal possible value that is portable between the supported databases (which means: their default collation will order the path field correctly).

Note

In case you know what you are doing, there is a test that is disabled by default that will attempt to suggest an optimal default alphabet in your enviroment. To run the test you must enable the TREEBEARD_TEST_ALPHABET enviroment variable:

$ TREEBEARD_TEST_ALPHABET=1 py.test -k test_alphabet

In OS X Mavericks, good readable values for the three supported databases in their default configuration:

Database

Optimal Alphabet

Base

MySQL 5.6.17

0-9A-Z

36

PostgreSQL 9.3.4

0-9A-Za-z

62

Sqlite3

0-9A-Z

36

The default value is MySQL’s since it will work for all DBs, but when working with a better database, changing the alphabet value is recommended in order to increase the density of the paths.

For an even better approach, change the collation of the path column in the database to handle raw ASCII, and use the printable ASCII characters (0x20 to 0x7E) as the alphabet.

Warning

If you use a custom alphabet, you must ensure that the order of characters in the alphabet matches the sort order of your database collation.

Also note that PostgreSQL relies on the collation provided by the underlying operating system, yielding inconsistent results on different systems if both upper and lower case characters are included in the alphabet. See https://github.com/PostgresApp/PostgresApp/issues/216 and https://dba.stackexchange.com/questions/106964/why-is-my-postgresql-order-by-case-insensitive.

node_order_by

Attribute: a list of model fields that will be used for node ordering. When enabled, all tree operations will assume this ordering. This takes precedence over drag and drop ordering in the Django admin.

Example:

node_order_by = ['field1', 'field2', 'field3']

Warning

node_order_by values are used to determine correct node ordering before an object is inserted/moved. This means any fields that are auto-populated at a database level, e.g., AutoField(), or DateTimeField(auto_now=True) will be ignored for the purpose of ordering if a value isn’t provided manually.

path

CharField, stores the full materialized path for each node. The default value of it’s max_length, 255, is the max efficient and portable value for a varchar. Increase it to allow deeper trees (max depth by default: 63)

Note

django-treebeard uses Django’s abstract model inheritance, so to change the max_length value of the path in your model, you have to redeclare the path field in your model:

class MyNodeModel(MP_Node):
    path = models.CharField(max_length=1024, unique=True)

Note

For performance, and if your database allows it, you can safely define the path column as ASCII (not utf-8/unicode/iso8859-1/etc) to keep the index smaller (and faster). Also note that some databases (mysql) have a small index size limit. InnoDB for instance has a limit of 765 bytes per index, so that would be the limit if your path is ASCII encoded. If your path column in InnoDB is using unicode, the index limit will be 255 characters since in MySQL’s indexes, unicode means 3 bytes per character.

Note

django-treebeard uses numconv for path encoding.

depth

PositiveIntegerField, depth of a node in the tree. A root node has a depth of 1.

numchild

PositiveIntegerField, the number of children of the node.

classmethod add_root(**kwargs)

Adds a root node to the tree.

This method saves the node in database. The object is populated as if via:

` obj = cls(**kwargs) `

Raises:

PathOverflow – when no more root objects can be added

See: treebeard.models.Node.add_root()

add_child(**kwargs)

Adds a child to the node.

This method saves the node in database. The object is populated as if via:

` obj = self.__class__(**kwargs) `

Raises:

PathOverflow – when no more child nodes can be added

See: treebeard.models.Node.add_child()

add_sibling(pos=None, **kwargs)

Adds a new node as a sibling to the current node object.

This method saves the node in database. The object is populated as if via:

` obj = self.__class__(**kwargs) `

Raises:

PathOverflow – when the library can’t make room for the node’s new position

See: treebeard.models.Node.add_sibling()

move(target, pos=None)

Moves the current node and all it’s descendants to a new position relative to another node.

Raises:

PathOverflow – when the library can’t make room for the node’s new position

See: treebeard.models.Node.move()

classmethod get_tree(parent=None)
Returns:

A queryset of nodes ordered as DFS, including the parent. If no parent is given, the entire tree is returned.

See: treebeard.models.Node.get_tree()

Note

This method returns a queryset.

classmethod find_problems(parent=None)

Checks for problems in the tree structure, problems can occur when:

  1. your code breaks and you get incomplete transactions (always use transactions!)

  2. changing the steplen value in a model (you must dump_bulk() first, change steplen and then load_bulk()

Parameters:

parent – If provided, limits the check to the descendants of this node. If not provided, the entire tree will be checked.

Returns:

A tuple of five lists:

  1. a list of ids of nodes with characters not found in the alphabet

  2. a list of ids of nodes when a wrong path length according to steplen

  3. a list of ids of orphaned nodes

  4. a list of ids of nodes with the wrong depth value for their path

  5. a list of ids nodes that report a wrong number of children

Note

A node won’t appear in more than one list, even when it exhibits more than one problem. This method stops checking a node when it finds a problem and continues to the next node.

Note

Problems 1, 2 and 3 can’t be solved automatically.

Example:

MyNodeModel.find_problems()
classmethod fix_tree(fix_paths=False, parent=None)

Solves some problems that can appear when transactions are not used and a piece of code breaks, leaving the tree in an inconsistent state.

The problems this method solves are:

  1. Nodes with an incorrect depth or numchild values due to incorrect code and lack of database transactions.

  2. “Holes” in the tree. This is normal if you move/delete nodes a lot. Holes in a tree don’t affect performance,

  3. Incorrect ordering of nodes when node_order_by is enabled. Ordering is enforced on node insertion, so if an attribute in node_order_by is modified after the node is inserted, the tree ordering will be inconsistent.

Parameters:
  • fix_paths – A boolean value. If True, a slower, more complex fix_tree method will be attempted. If False (the default), it will use a safe (and fast!) fix approach, but it will only solve the depth and numchild nodes, it won’t fix the tree holes or broken path ordering.

  • parent

    If provided, limits the operation to descendants of the given node. If not provided, the entire tree will be fixed.

    Fixing only part of a tree will only work if the parent itself is valid.

Example:

MyNodeModel.fix_tree()
classmethod load_bulk(bulk_data, parent=None, keep_ids=False, bulk_create=False, batch_size=1000) list

Loads a list/dictionary structure to the tree.

Parameters:
  • bulk_data

    The data that will be loaded, the structure is a list of dictionaries with 2 keys:

    • data: will store arguments that will be passed for object

      creation, and

    • children: a list of dictionaries, each one has it’s own

      data and children keys (a recursive structure)

  • parent – The node that will receive the structure as children, if not specified the first level of the structure will be loaded as root nodes

  • keep_ids – If enabled, loads the nodes with the same primary keys that are given in the structure. Will error if there are nodes without primary key info or if the primary keys are already used.

  • bulk_create – Whether to bulk create objects using Django’s bulk_create() method. Only works for models without multi-table inheritance. Also does not work with MySQL and MSSQL database backends.

  • batch_size – The batch size for bulk_create() when creating descendant nodes. Default is 1000.

Returns:

A list of the added node ids.

The ordering of nodes in the loaded data is preserved. If this needs to be corrected (e.g., to cater for a new node_order_by) then fix_tree() can be run separately on the imported subtree.

class treebeard.mp_tree.MP_NodeManager(*args, **kwargs)

Bases: Manager

Custom manager for nodes in a Materialized Path tree.

class treebeard.mp_tree.MP_NodeQuerySet(model=None, query=None, using=None, hints=None)

Bases: QuerySet

Custom queryset for the tree node manager.

Needed only for the custom delete method.

Signals

The treebeard.mp_tree module defines several signals that are sent when bulk updates are made to the tree. Along with the standard Django post_save and post_delete signals that track changes to individual node instances, these can be used to keep external data stores such as search indexes in sync with the tree.

treebeard.mp_tree.path_updated

Sent after a bulk update has been performed to update the paths of a node and its descendants, with the following arguments:

sender

The model class where the update occurred.

old_path

The old path of the topmost node before the update. The update operation applies to all nodes whose path starts with this value.

new_path

The new path of the topmost node after the update. All nodes in the update operation have had the prefix old_path replaced with new_path. Note that if these paths are of different lengths, the depth of the nodes has also been updated accordingly.

using

The database alias being used.

treebeard.mp_tree.nodes_deleted

Sent after one or more nodes are deleted, with the following arguments:

sender

The model class where the deletion occurred.

pks_to_remove

A list of primary keys of leaf nodes that were deleted. These are passed as a separate list to the non-leaf nodes as it is more efficient to delete these by primary key rather than by path.

paths_to_remove

A list of paths of non-leaf nodes that were deleted (along with their descendants). All nodes with a path that starts with any of these values were deleted.

using

The database alias being used.