Developer Guide

ytree is developed using the same conventions as yt. The yt Developer Guide is a good reference for code style, communication with other developers, working with git, and issuing pull requests. Below is a brief guide of aspects that are specific to ytree.

Contributing in a Nutshell

Step zero, get out of that nutshell!

After that, the process for making contributions to ytree is roughly as follows:

  1. Fork the main ytree repository.
  2. Create a new branch.
  3. Make changes.
  4. Run tests. Return to step 3, if needed.
  5. Issue pull request.

The yt Developer Guide and github documentation will help with the mechanics of git and pull requests.

Testing

The ytree source comes with a series of tests that can be run to ensure nothing unexpected happens after changes have been made. These tests will automatically run when a pull request is issued or updated, but they can also be run locally very easily. At present, the suite of tests for ytree takes about three minutes to run.

Testing Data

The first order of business is to obtain the sample datasets. See Sample Data for how to do so. Next, ytree must be configure to know the location of this data. This is done by creating a configuration file in your home directory at the location ~/.config/ytree/ytreerc.

$ mkdir -p ~/.config/ytree
$ echo [ytree] > ~/.config/ytree/ytreerc
$ echo test_data_dir = /Users/britton/ytree_data >> ~/.config/ytree/ytreerc
$ cat ~/.config/ytree/ytreerc
[ytree]
test_data_dir = /Users/britton/ytree_data

This path should point to the outer directory containing all the sample datasets.

Installing Development Dependencies

A number of additional packages are required for testing. These can be installed with pip from within the ytree source by doing:

$ pip install -e .[dev]

To see how these dependencies are defined, have a look at the extras_require keyword argument in the setup.py file.

Run the Tests

The tests are run from the top level of the ytree source.

$ pytest tests
============================= test session starts ==============================
platform darwin -- Python 3.6.0, pytest-3.0.7, py-1.4.32, pluggy-0.4.0
rootdir: /Users/britton/Documents/work/yt/extensions/ytree/ytree, inifile:
collected 16 items

tests/test_arbors.py ........
tests/test_flake8.py .
tests/test_saving.py ...
tests/test_treefarm.py ..
tests/test_ytree_1x.py ..

========================= 16 passed in 185.03 seconds ==========================

Adding Support for a New Format

The Arbor class is reasonably generalized such that adding support for a new file format should be relatively straightforward. The existing frontends also provide guidance for what must be done. Below is a brief guide for how to proceed. If you are interested in doing this, we will be more than happy to help!

Where do the files go?

As in yt, the code specific to one file format is referred to as a “frontend”. Within the ytree source, each frontend is located in its own directory within ytree/frontends. Name your directory using lowercase and underscores and put it in there.

To allow your frontend to be directly importable at run-time, add the name to the _frontends list in ytree/frontends/api.py.

Building Your Frontend

A very good way to build a new frontend is to start with an existing frontend for a similar type of dataset. To see the variety of examples, consult the Internal Classes section of the API Reference.

To build a new frontend, you will need to make frontend-specific subclasses for a few components. A straightforward way to do this is to start with the script below, loading your data with it. Each line will run correctly after a distinct phase of the implementation is completed. As you progress, the next function needing implemented will raise a NotImplementedError exception, indicating what should be done next.

import ytree

# Arbor subclass with working _is_valid function
a = ytree.load(<your data>)

# Recognizing the available fields
print (a.field_list)

# Calculate the number of trees in the dataset
print (a.size)

# Create root TreeNode objects
my_tree = a[0]
print (my_tree)

# Query fields for individual trees
print (my_tree['mass'])

# Query fields for a whole tree
print (my_tree['tree', 'mass'])

# Create TreeNodes for whole tree
for node in my_tree['tree']:
    print (node)

# Query fields for all root nodes
print (a['mass'])

# Putting it all together
a.save_arbor()

The components and the files in which they belong are:

  1. The Arbor itself (arbor.py).
  2. The file i/o (io.py).
  3. Recognizing frontend-specific fields (fields.py).

In addition to this, you will need to add a file called __init__.py, which will allow your code to be imported. This file should minimally import the frontend-specific Arbor class. For example, the consistent-trees __init__.py looks like this:

from ytree.frontends.consistent_trees.arbor import \
    ConsistentTreesArbor

The _is_valid Function

Within every Arbor subclass should appear a method called _is_valid. This function is used by load to determine if the provided file is the correct type. This function can examine the file’s naming convention and/or open it and inspect its contents, whatever is required to uniquely identify your frontend. Have a look at the various examples.

Two Types of Arbors

There are generally two types of merger tree data that ytree ingests:

1. all merger tree data (full trees, halos, etc.) contained within a single file. These include the consistent-trees, consistent-trees-hdf5, lhalotree, and ytree frontends.

2. halos in files grouped by redshift (halo catalogs) that contain the halo id for the descendent halo which lives in the next catalog. An example of this is the rockstar frontend.

Depending on your case, different base classes should be subclassed. This is discussed below. There are also hybrid formats that use both merger tree and halo catalog files together. An example of this is the ahf (Amiga Halo Finder) frontend.

Merger Tree Data in One File (or a few)

If this is your case, then the consistent-trees and “ytree” frontends are the best examples to follow.

In arbor.py, your subclass of Arbor should implement two functions, _parse_parameter_file and _plant_trees.

_parse_parameter_file: This is the first thing called when your dataset is loaded. It is responsible for determining things like box size, cosmological parameters, and the list of fields.

_plant_trees: This function is responsible for creating arrays of the data required to build all the root TreeNode objects in the Arbor. The names of these attributes are declared in the _node_io_attrs attribute. For example, the ConsistentTreesHDF5Arbor class names three required attributes: _fi, the data file number in which this tree lives; _si, the starting index of the section in the data array corresponding to this tree; and _ei, the ending index in the data array.

In io.py, you will implement the machinery responsible for reading field data from disk. You must create a subclass of the TreeFieldIO class and implement the _read_fields function. This function accepts a single root node (a TreeNode that is the root of a tree) and a list of fields and should return a dictionary of NumPy arrays for each field.

Halo Catalog-style Data

If this is your case, then the rockstar and treefarm frontends are the best examples to follow.

For this type of data, you will subclass the CatalogArbor class, which is itself a subclass of Arbor designed for this type of data.

In arbor.py, your subclass should implement two functions, _parse_parameter_file and _get_data_files. The purpose of _parse_parameter_file is described above.

_get_data_files: This type of data is usually loaded by providing one of the set of files. This function needs to figure out how many other files there are and their names and construct a list to be saved.

In io.py, you will create a subclass of CatalogDataFile and implement two functions: _parse_header and _read_fields.

_parse_header: This function reads any metadata specific to this halo catalog. For exmaple, you might get the current redshift here.

_read_fields: This function is responsible for reading field data from disk. This should minimally take a list of fields and return a dictionary with NumPy arrays for each field for all halos contained in the file. It should also, optionally, take a list of TreeNode instances and return fields only for them.

Field Units and Aliases (fields.py)

The FieldInfoContainer class holds information about field names and units. Your subclass can define two tuples, known_fields and alias_fields. The known_fields tuple is used to set units for fields on disk. This is useful especially if there is no way to get this information from the file. The convention for each entry is (name on disk, units).

By creating aliases to standardized names, scripts can be run on multiple types of data with little or no alteration for frontend-specific field names. This is done with the alias_fields tuple. The convention for each entry is (alias name, name on disk, field units).

from ytree.data_structures.fields import \
     FieldInfoContainer

class NewCodeFieldInfo(FieldInfoContainer):
    known_fields = (
        # name on disk, units
        ("Mass", "Msun/h"),
        ("PX", "kpc/h"),
    )

    alias_fields = (
        # alias name, name on disk, units for alias
        ("mass", "Mass", "Msun"),
        ("position_x", "PX", "Mpc/h"),
        ...
    )

You made it!

That’s all there is to it! Now you too can do whatever it is people do with merger trees. There are probably important things that were left out of this document. If you find any, please consider making an addition or opening an issue. If you’re stuck anywhere, don’t hesitate to ask for help. If you’ve gotten this far, we really want to see you make it to the finish!

Everyone Loves Samples

It would be especially great if you could provide a small sample dataset with your new frontend, something less than a few hundred MB if possible. This will ensure that your new frontend never gets broken and will also help new users get started. Once you have some data, make an addition to the arbor tests by following the example in tests/test_arbors.py. Then, contact Britton Smith to arrange for your sample data to be added to the ytree data collection on the yt Hub.

Ok, now you’re totally done. Take the rest of the afternoon off.