Skip to content

Latest commit

 

History

History
146 lines (118 loc) · 7.91 KB

file_format.md

File metadata and controls

146 lines (118 loc) · 7.91 KB

File Format

Cell tracking data are loaded from a collection of Zarr arrays. This allows efficient access to data on static hosting, without the need for server-side processing. Necessary queries are baked-in to the file format at the cost of some data redundancy.

These queries are encoded in sparse CSR-format arrays, represented by 2 or 3 Zarr arrays. Sparse arrays in CSR format allow relatively efficient retrieval (typically 2 requests) of complete rows in the array. Conveniently, these Zarr arrays can be constructed by saving the indices and indptr attributes (numpy arrays) of a scipy.sparse array. The data array is only necessary if you have additional data to encode - it's used in the tracks_to_points where it stores the point coordinates. If present, the data array should have shape (len(indices), N), where N is the number of associated pieces of data (in the tracks_to_points example, this is 3).

Here is the directory layout including approximate sizes for the arrays the example Zebrafish dataset. Note the relatively small size of the indptr arrays, providing an opportunity for efficient caching and reducing the total number of requests when fetching tracks.

tracks_zebrafish_bundle.zarr (~550M)
├── points (198M)
├── points_to_tracks (62M)
│   ├── indices (61M)
│   └── indptr (1M)
├── tracks_to_points (259M)
│   ├── data (207M)
│   ├── indices (50M)
│   └── indptr (2M)
└── tracks_to_tracks (28M)
    ├── data (19M)
    ├── indices (7M)
    └── indptr (2M)

Arrays are stored as chunked zarr arrays. Chunk sizes should be optimized for your specific data.

A conversion script is provided to generate this zarr bundle arrays from a CSV file.

points

The points array is a dense, ragged array of 32-bit floats with shape (n_timepoints, num_values_per_point * max_points_per_timepoint). The number of rows is the total number of time points (n_timepoints). The number of columns is num_values_per_point * max_points_in_timepoint where: num_values_per_point is 3 (for z, y, x) or 4 (for z, y, x, radius) depending on whether radius is included. max_points_in_timepoint is the maximum number of points observed in any timepoint. So each row is a timepoint, and within that row are point coordinates as [x0, y0, z0, (radius1), x1, y1, z1, (radius2)...]. Rows with fewer points are padded at the end with -9999.9. Each point is then given a unique ID calculated by t * max_points_per_timepoint + n where t is the timepoint (row in the points array), and n is the index of the point within the timepoint. That is, the point ID is just a flat index value into the ragged array. This array is used for fetching all the points in a given timeframe.

points_to_tracks

The points_to_tracks array encodes the relationship between a given point ID (row) and any track IDs (columns) it is part of. The total shape of the array is then (n_points, n_tracks). Thus, fetching a row is used to get all tracks associated with a given point. This is effectively an adjacency matrix. In the sample data, each point belongs to at most one track. Values in the array don't matter, just the presence of a nonzero element which indicates a connection. This is because the data we're interested in (the track ID) is encoded by the column index.

This is the first query run when points are selected.

tracks_to_tracks

This array, shape (n_tracks, n_tracks) allows us to retrieve lineage (ancestors and descendents) for a given track. Each row is a track, and the columns are the tracks it is connected to. How this is computed will determine the lineage returned when selecting data. See the section on lineage computation in the included conversion script for more details on how it is currently computed.

The presence of a nonzero element in this graph represents a connection between two tracks (adjacency matrix). The value of each nonzero element encodes the direct parent track id for the corresponding column.

Figure 1 - structure of the tracks_to_tracks matrix. Diagonal elements are always present, meaning a track is always part of its own lineage. The value of each sparse element is the parent_track_id (or -1) for the track indicated by that column.

This is run for each track returned from the initial points_to_tracks query.

tracks_to_points

This is just a transpose of the points_to_tracks array, but it also contains the point locations. This redundancy is an optimization so each point location does not have to be re-fetched from the points array.

This is the last query run when points are selected, and is run for each track in the lineage.

Conversion script

A script (convert_tracks_csv_to_sparse_zarr.py) is provided to generate these arrays from a CSV file. The CSV file should be generated using Ultrack or a similar tracking software. To use this script, columns in the CSV file should be ordered as follows:

  • track_id - unique identifier for the track this point belongs to
  • t - timepoint
  • z, y, x - 3D coordinates
  • parent_track_id - reference to the parent track, if any (-1 otherwise)

The script requires numpy, scipy, and zarr libraries. This script is not fully optimized, and takes about 1 minute (for the largest Zebrafish dataset) to convert the example dataset on an Apple M1 Pro.

Lineage computation

Most of the logic in the script is pretty straightforward, but the lineage computation may require some explanation.

The lineage we're after for a given tracklet is all of its ancestors and descendents. The goal is to encode this in a single row of the tracks_to_tracks array. This means the lineage we fetch depends on which tracklet in the lineage corresponds to the selected point. Figure 2 shows two examples to illustrate this. To load the complete lineage for a given tracklet, find its oldest ancestor tracklet and select any point in it.


Figure 2 - lineage depends on which point is selected. (top) a cartoon depecting a group of related tracklets. (left) selecting points at earlier timepoints will load more descendent tracks. (right) selecting points at later timepoints will load all ancestors, but fewer descendents.

To do this, we need to pre-compute something like the transitive closure of the directed graph of track connections. The complete transitive closure would produce a graph where each cluster is a fully-connected graph of all the tracks in a lineage. Instead we start by first creating adjacency matrices for two undirected graphs: tracks_to_children and tracks_to_parents. Iterative squaring of these matrices converges on the transitive closure of each - this gives each tracklet a connection to all of its descendents (tracks_to_children) or all of its ancestors (tracks_to_parents).

The sum of these matrices produces the directed adjacency matrix we want, where a track is connected directly to all of its ancestors and descendents. Values in this matrix are overwritten by parent track IDs. This allows efficient lineage retrieval form a CSR matrix by fetching a single row.

The iterative squaring algorithm is not very efficient, but it's simple and works for the sizes of data we're dealing with. For larger datasets or more complex lineage (e.g. full lineage including "cousins"), a more efficient algorithm (Floyd–Warshall) may be necessary.