macrosdk package¶

Submodules¶

macrosdk.context module¶

Context module.

Contains logic for various context managers.

class macrosdk.context.CacheContext(disable_read=False, disable_write=False)¶

Bases: object

Resources computed within this context will follow the specified caching rules.

disable_read: Set this to True to disable reading from cache. Does not affect write behavior. disable_write: Set this to True to disable writing to cache. Does not affect read behavior.

class macrosdk.context.PackageContext(package_name)¶

Bases: object

Set the package name within a context block.

macrosdk.context.get_package_context()¶

macrosdk.context.is_cache_read_disabled()¶: Whether to disable reading of cached computed results.

macrosdk.context.is_cache_write_disabled()¶: Whether to disable writing of cached computed results.

macrosdk.data_blob module¶

class macrosdk.data_blob.DataBlob(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data blob.

A data blob encapsulates and mediates access to large binary data objects used as data input to a machine learning pipeline. Once retrieved from the cloud or local disk, it is preserved as a snapshot to ensure reproducible and download-efficient runs of downstream compute tasks.

To create a data blob, define a function that performs the necessary data fetches and instantiate it with a DataBlob.Definition:

def download_fn(handle):
    with handle.write_opener(tarfile.open, "data.tar.gz", "w:gz") as f:
        f.write(requests.get('https://macro.ai').text)
    return dict(metadata_key='metadata_value')

data_blob = DataBlob.create("my_data_blob", DataBlob.Definition(download_fn))

To use a previously-defined data blob, load it in the following way:

data_blob = DataBlob.load("my_data_blob")

with data_blob.read_opener(tarfile.open, "data.tar.gz", "r:gz") as f:
    print(f.read())  # retrieve data that was stored

print(data_blob.get_metadata())  # {'metadata_key': 'metadata_value'}

To leverage parallel processing for data blob operations, use the following API:

def download_fn(handle, *args):
    for data in datapoints:
        yield (handle, data,) + args  # yield tuple of data

def process_input(handle, data, arg1, arg2):
    with handle.write_opener(...) as f:
        ...  # download and write based on input
    return dict(
        metadata_key='metadata_value')  # gets combined into a single dict

data_blob = DataBlob.create(
    "my_data_blob",
    DataBlob.Definition(download_fn, parallelized_iter=process_input))

This will spin up a pool of worker processes under the hood automatically to call process_input for each item returned by the generator of the download function.

class Definition(download_fn, *args, parallelized_iter=None, **kwargs)¶

Bases: object

Definition object used to define a DataBlob.

By default, this is constructed with a download function (function that performs the necessary download of data for a blob). If additional functionality is needed, consider subclassing this class and overriding the get_download_function method.

If parallelized_iter is passed in, we expect download_fn to be a generator for tasks, and we will apply the parallelized_iter function to each task in a multiprocess pool.

get_download_function()¶

Accessor for the download function.

Subclasses may override this with custom behavior.

get_resource_dependencies()¶

class Handle(data_blob)¶

Bases: object

Handle to encapsulate the data blob, primarily the data dir and all writes.

add_metadata(key, value)¶: Add metadata to the data blob.

copy_file(src_path, dest_path)¶: Copy file from other location into this data blob.

download_url(url, dest_path, use_cache=True)¶: Convenience function to download contents of a url into the data blob.

from_relative_path(other_path)¶: Convert a relative path within the data blob to an absolute path.

get_data_dir()¶: Get the absolute data directory used to store data for this resource.

read_opener(opener, dest_path, *args, **kwargs)¶

Convenience wrapper function to apply a file read opener to a relative path within the data blob.

Unlike DataBlob.read_opener, this does not ensure the resource is computed. This is meant to be used if the download function of a DataBlob needs to read some of the data that has been written to it.

remove_file(dest_path)¶: Remove a file from the data blob.

to_relative_path(other_path)¶: Convert an absolute path to a relative path within the data blob.

write_opener(opener, dest_path, *args, **kwargs)¶

Convenience wrapper function to apply a file write opener to a relative path within the data blob.

Ensures the data directory is created before attempting to open the file.

get_metadata()¶: Get a copy of the metadata that was stored within the data blob.

handle()¶: Return a handle object that can be used to store data into the data blob.

hydrate(obj)¶: Hydrate the state of a data blob from a serializable representation of it.

init()¶

read_opener(opener, dest_path, *args, **kwargs)¶

Convenience function to apply a file read opener to a relative path within the data blob.

Ensures the data blob is computed before attempting to open the file.

resource_type = 'data_blob'¶

serialize()¶: Return the serializable representation of a data schema’s state.

macrosdk.data_schema module¶

class macrosdk.data_schema.DataSchema(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data schema.

A data schema is used to represent a class ontology or a collection of attributes. It is defined as an enumeration over a set of entries, with optional attributes for eache ntry.

It enables data sets and model trainers to be parameterized to handle different predictive tasks without needing to customize them repeatedly. The resulting resource can also be easily identified and used appropriately with the correct output schema.

To create a data schema, define a function that specifies the entries within the schema and instantiate it with a DataSchema.Definition:

def definition_fn(handle):
    handle.add_primary_key('class_name')
    handle.add_field('class_description')  # optional attribute(s)

    handle.add_entry(
        class_name='cat',
        class_description='an animal also known as a feline',
    )
    handle.add_entry(
        class_name='dog',
        class_description='an animal also known as a canine',
    )

data_schema = DataSchema.create(
    "my_data_schema", DataSchema.Definition(definition_fn))

To use a previously-defined data schema, load it in the following way:

data_schema = DataSchema.load("my_data_schema")

for entry in data_schema.entries():
    print(entry.primary_key)  # 'cat' or 'dog'
    print(entry.as_dict())  # {'class_name': ..., 'class_description': ...}

class Definition(definition_fn, *args, **kwargs)¶

Bases: object

Definition object used to define a DataSchema.

By default, this is constructed with a definition function (function that defines the fields and records of a data schema). If additional functionality is needed, consider subclassing this class and overriding the get_definition_function method.

get_definition_function()¶

Accessor for the definition function.

Subclasses may override this with custom behavior.

get_resource_dependencies()¶

class Entry(data_schema, primary_key)¶

Bases: object

Object representing a data entry in a data schema.

This is typically used as an identifying key for the data it represents.

as_dict()¶: Return all fields of the data schema entry as a dictionary.

as_tuple()¶: Return the info uniquely identifying this data schema entry as a tuple.

classmethod deserialize(obj)¶: Reconstruct a data schema entry from the serializable representation of its state.

classmethod from_tuple(tup)¶: Reconstruct a data schema entry from the tuple representation of its state.

serialize()¶: Return a serializable representation of an entry that can be used to reconstruct it.

class Handle(data_schema)¶

Bases: object

Handle to encapsulate the data schema object, for writes.

add_entry(**kwargs)¶: Add an entry to the data schema via its handle.

add_field(field_name)¶: Add a field to the data schema via its handle.

add_primary_key(primary_key)¶: Add the primary key field to the data schema via its handle.

add_entry(**kwargs)¶

Add an entry to this data schema.

The primary key and all the field names must be specified for each entry.

add_field(field_name)¶

Add a field to this data schema.

Each field name must be unique.

add_primary_key(field_name)¶

Add the primary key field for this data schema.

There must be exactly one primary key field set.

entries()¶

Returns an iterable over the entries of this data schema.

Ensures the data schema is computed before doing so.

entry(primary_key)¶

Returns a DataSchema.Entry object for a given primary key.

Ensures the data schema is computed and that primary key is defined before returning it.

field_names()¶

Returns a list of field names for this data schema, starting with the primary key.

Ensures the data schema is computed before doing so.

handle()¶: Return a handle object that can be used to add fields and entries to the data schema.

has_entry(primary_key)¶

Determine if this data schema has an entry with the specified primary key.

Ensures the data schema is computed before checking.

hydrate(obj)¶: Hydrate the state of a data schema from a serializable representation of it.

init()¶

resource_type = 'data_schema'¶

serialize()¶: Return the serializable representation of a data schema’s state.

tables()¶

Returns the entries and their fields for this data schema in table form.

Ensures the data schema is computed before doing so.

macrosdk.data_schema_mapper module¶

class macrosdk.data_schema_mapper.DataSchemaMapper(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data schema mapper.

A data schema mapper defines a mapping between two data schemas. Given a mapping between an input and output schema, the mapper allows the caller to use a set of convenience functions to transform resources from one data schema to the other.

To create a data schema mapper, subclass the DataSchemaMapper.Definition class and override the get_mapping method to return a dictionary that maps between an input and output data schema, and instantiate a DataSchemaMapper with it:

class InputToOutput(DataSchemaMapper.Definition):
    def get_mapping(self):
        return {
            self.input_schema.entry('cat'): self.output_schema.entry('feline'),
            self.input_schema.entry('dog'): self.output_schema.entry('canine'),
        }

data_schema_mapper = DataSchemaMapper.create(
    "my_data_schema_mapper", InputToOutput(input_schema, output_schema))

To use a previously-defined data schema mapper, load it in the following way:

data_schema_mapper = DataSchemaMapper.load("my_data_schema_mapper")

output_entry = data_schema_mapper.transform_data_schema_entry(input_entry)
output_field_set = data_schema_mapper.transform_field_set(input_field_set)
output_data_set = data_schema_mapper.transform_data_set(input_data_set)
output_model_wrapper = data_schema_mapper.transform_model_wrapper(input_model_wrapper)

class Definition(input_schema, output_schema)¶

Bases: object

Definition object used to define a DataSchemaMapper.

get_mapping()¶: Return a dictionary mapping entries from the input schema to the output schema.

get_resource_dependencies()¶

hydrate(obj)¶: Hydrate the state of a data schema mapper from a serializable representation of it.

init()¶

resource_type = 'data_schema_mapper'¶

serialize()¶: Return the serializable representation of a data schema mapper’s state.

transform_data_schema_entry(data_schema_entry)¶

Return the output data schema entry associated with the input entry in the mapping.

If the input entry is not found in the mapping, return None.

transform_data_set(data_set, skip_empty=True, output_name=None)¶

Return a data set transformed with this data schema mapper.

Each field set in the data set will be transformed by the data schema mapper.

If skip_empty is True, we skip any field sets that do not have the output schema after the transform step. If skip_empty is numeric, we skip these field sets with that probability.

transform_field_set(field_set)¶

Return a field set transformed with this data schema mapper.

If the field set has a field descriptor which involves the input schema, we map it to the output schema and add the mapped descriptor to the field set.

transform_model_wrapper(model_wrapper, output_name=None)¶

Return a model wrapper transformed with this data schema mapper.

Each field set returned by the model wrapper will be transformed.

class macrosdk.data_schema_mapper.TransformedModelWrapperDefinition(data_schema_mapper, model_wrapper)¶

Bases: macrosdk.model_wrapper.ModelWrapper.Definition

compute_model(handle, model_params)¶

Method to compute the model if it is required for inference.

Subclasses may override this with custom behavior.

get_load_function()¶

Accessor for the load function.

Subclasses may override this with custom behavior.

property model_params¶

macrosdk.data_set module¶

class macrosdk.data_set.DataSet(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data set.

A data set is used to manage ordered collections of FieldSets, which represent rows or records in a data set.

The data set object handles efficient reads and writes under the hood in a compressed data format, and exposes an iterator interface to access records. There are also accessor methods for querying metadata about the data set, such as the histogram of data schema entries represented.

To create a data set, define a generator function that yields FieldSets and instantiate it with a DataSet.Definition:

def generator_fn():
    for data in datapoints:
        record = FieldSet.create()
        ...  # add fields from data to field set
        yield record

data_set = DataSet.create("my_data_set", DataSet.Definition(generator_fn))

To use a previously-defined data set, load it in the following way:

data_set = DataSet.load("my_data_set")

for record in data_set.get_records():
    ...  # process record

To leverage parallel processing for data set operations, use the following API:

def generator_fn(*args):
    for data in datapoints:
        yield (data,) + args  # yield tuple of data

def process_input(data, arg1, arg2):
    record = FieldSet.create()
    ...  # add fields from data to field set
    return [record]  # return 0 or more records for each

data_set = DataSet.create(
    "my_data_set",
    DataSet.Definition(generator_fn, parallelized_iter=process_input))

This will spin up a pool of worker processes under the hood automatically to call process_input for each record to be processed.

static Concat(*data_sets, output_name=None, should_precompute=True)¶: Concatenate 2 or more data sets together, in order of input.

class Definition(generator_fn, *args, parallelized_iter=None, **kwargs)¶

Bases: object

Definition object used to define a DataSet.

By default, this is constructed with a generator function (function that returns a generator of data points). If additional functionality is needed, consider subclassing this class and overriding the get_generator_function method.

Specify args or kwargs if there are parameters to pass into the generator function.

If should_precompute=True is passed in, we assume this data set cannot be streamed on the fly (e.g. involves gpu-based computation, or feeds into multiple consumers), and we will first iterate to compute and then cache the results for subsequent use.

If parallelized_iter is passed in, we will apply this function to each data point in a multiprocess pool. This function should return an iterable of data points. This will set should_precompute to True if it was not specified.

get_generator_function()¶

Accessor for the generator function.

Subclasses may override this with custom behavior.

The generator function has to return a generator that yields field sets.

get_resource_dependencies()¶

static Take(data_set, num_rows, output_name=None, should_precompute=True)¶: Take num_rows worth of records from data set.

static Zip(*data_sets, output_name=None, should_precompute=True)¶: Zip 2 or more data sets together. The later data sets take precedence.

concat_with(*data_sets, output_name=None, should_precompute=True)¶

get_records()¶

hydrate(obj)¶: Hydrate the state of a data set from a serializable representation of it.

init()¶

resource_type = 'data_set'¶

serialize()¶: Return the serializable representation of a data set’s state.

tables()¶

take(num_rows, output_name=None, should_precompute=True)¶

zip_with(*data_sets, output_name=None, should_precompute=True)¶

macrosdk.data_set_mapper module¶

class macrosdk.data_set_mapper.DataSetMapper(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent a data set mapper.

A data set mapper defines a mapping between two data sets. It can do more than just a one-to-one mapping between records of a data set. You could combine multiple input records into one output record (e.g. combine individual images into a video), or you could convert one input record into multiple output records (e.g. the reverse operation of converting a video into individual images).

To create a data set mapper, subclass the DataSetMapper.Definition class and provide implementations for its interface methods. Then pass that definition instance to the DataSetMapper.create method:

class ConvertDataSet(DataSetMapper.Definition):
    def initialize(self):
        # Returns one or more field sets prior to iterating through data set.
        return []

    def finalize(self):
        # Returns one or more field sets after iterating through data set.
        return []

    def ingest(self, field_set):
        # Takes a field set as input and returns one or more output field sets.
        return [field_set]

    def should_continue(self, num_iterations, num_ingested, num_output):
        # Returns true if we should continue processing the input data set.
        return num_iterations == 0

data_set_mapper = DataSetMapper.create(
    "my_data_set_mapper", ConvertDataSet())

To use a previously-defined data set mapper, load it in the following way:

data_set_mapper = DataSetMapper.load("my_data_set_mapper")

output_data_set = data_set_mapper.transform_data_set(input_data_set)

class Definition(resource_dependencies, should_precompute=True)¶

Bases: object

Definition object used to define a DataSetMapper.

In particular, it takes input field sets and returns optional output field set(s).

finalize()¶: Returns one or more field sets after iterating through data set.

get_resource_dependencies()¶

ingest(field_set)¶: Takes a field set as input and returns one or more output field sets.

initialize()¶: Returns one or more field sets prior to iterating through data set.

should_continue(num_iterations, num_ingested, num_output)¶

Returns true if we should continue processing the input data set.

The number of iterations, number of ingested records, and number of output records so far are given as input to this method.

init()¶

resource_type = 'data_set_mapper'¶

transform_data_set(data_set, should_precompute=True, output_name=None)¶: Transform an input data set using this data set mapper, and return the output data set.

macrosdk.data_visualization module¶

class macrosdk.data_visualization.DataVisualization(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data visualization.

A data visualization is a resource that allows interactive exploration of a sequence of data. It is created by a data visualizer acting on a sequence of field sets from a data set.

To create one, you would typically use a data visualizer in the following way:

data_visualizer = DataVisualizer.load("my_data_visualizer")

data_visualization = data_visualizer.create_data_visualization_from_data_set(
    data_set, variable_schemas)

class Definition(data_set, data_visualizer, variable_schemas, num_rows)¶

Bases: object

Definition object used to define a DataVisualizer.

get_resource_dependencies()¶

class Handle(data_visualization)¶

Bases: object

Handle to encapsulate the data visualization, primarily the data dir and all writes.

get_data_dir()¶: Get the absolute data directory used to store data for this resource.

get_num_render_sets()¶

Return the number of render sets within this data visualization.

Ensures the data visualization is computed first.

get_render_set(i)¶

Return the i-th render set read from files within this data visualization.

Ensures the data visualization is computed first.

get_render_sets()¶

Return an iterable of render sets read from files within this data visualization.

Ensures the data visualization is computed first.

handle()¶: Return a handle object that can be used to store data into the data visualization.

hydrate(obj)¶: Hydrate the state of a data visualization from a serializable representation of it.

init()¶

resource_type = 'data_visualization'¶

serialize()¶: Return the serializable representation of a data visualization’s state.

macrosdk.data_visualizer module¶

class macrosdk.data_visualizer.DataVisualizer(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent a data visualizer.

A data visualizer takes a data set as input, and products a data visualization as its output.

To create a data visualizer, subclass the DataVisualizer.Definition class and provide implementations for its interface methods. Then pass that definition instance to the DataVisualizer.create() method:

class CustomVisualizer(DataVisualizer.Definition):
    @classmethod
    def create_render_set_from_field_set(cls, field_set, fixed_schemas, variable_schemas):
        # Return a render set from a given field set.
        render_set = Render.Set.create()
        ...
        return render_set

data_visualizer = DataVisualizer.create(
    "my_data_visualizer",
    CustomVisualizer(fixed_schemas))

To use a previously-defined data visualizer, load it in the following way:

data_visualizer = DataVisualizer.load("my_data_visualizer")

data_visualization = data_visualizer.create_data_visualization_from_data_set(
    data_set, variable_schemas)

See macro.Render() for the rendering primitives available for data visualizers to use.

class Definition(*fixed_schemas)¶

Bases: object

Definition object used to define a DataVisualizer.

classmethod create_render_set_from_field_set(field_set, fixed_schemas, variable_schemas)¶

Return a render set from a given field set.

See macro.Render.Set() for how to create one.

get_resource_dependencies()¶

create_data_visualization_from_data_set(data_set, variable_schemas=None, num_rows=20, output_name=None)¶: Create data visualization from data set using this data visualizer’s logic.

init()¶

resource_type = 'data_visualizer'¶

macrosdk.deptrack module¶

Dependency tracking for Resources and Packages

This is an unoptimized prototypey implementation that should be sufficiently performant as long as the number of Dependables doesn’t become too large (probably 1M+).

macrosdk.deptrack.deptrack_package(package_name, package_dependencies)¶

Deptrack a package by its string name.

We do this by its name instead of ImportedPackage because we might need to query the graph for package dependencies whilst a package is being imported, and the ImportedPackage is only available after the import is done.

macrosdk.deptrack.deptrack_resource(resource, resource_definition)¶

Deptrack a Resource

This registers the Resource’s dependencies to support dependency querying.

macrosdk.deptrack.get_dependency_packages_for(package_name, transitive=True)¶

Returns list of names of all packages that the given package depends on.

If transitive is True, we include transitive dependencies.

macrosdk.deptrack.get_dependency_resources_for(resource, transitive=True)¶

Returns list of all Resources that the given Resource depends on.

If transitive is True, we include indirect dependencies: If Resource A depends on B, and B depends on C, then get_dependency_resources_for(A) will include both B and C.

macrosdk.deptrack.get_dependent_packages_for(package_name, transitive=True)¶

Returns list of names of all packages that depend on the given package.

If transitive is True, we include transitive dependents.

macrosdk.deptrack.get_dependent_resources_for(resource, transitive=True)¶

Returns list of all Resources that depend on the given Resource.

If transitive is True, we include indirect dependents: If Resource A depends on B, and B depends on C, then get_dependent_resources_for(C) will include both A and B.

macrosdk.deptrack.reset_deptrack()¶

macrosdk.deptrack.traverse_dependency_packages_for(package_name, visit_fn, **kwargs)¶

Traverses the package dependency graph, calling each package with the visit function.

The visit function has the signature:

def visit_fn(package, depth=None, index=None, **kwargs) -> Any:: …

We return a list of outputs returned by the visit_fn at each package in topological sorted order.

Any kwargs passed into this call will get relayed to the visit function. Accepting kwargs on the visit function also allow for additional recursion state to be passed through in future, similar to depth or index.

macrosdk.deptrack.traverse_dependency_resources_for(resource, visit_fn, **kwargs)¶

Traverses the resource dependency graph, calling each resource with the visit function.

The visit function has the signature:

def visit_fn(resource, depth=None, index=None, **kwargs) -> Any:: …

We return a list of outputs returned by the visit_fn at each resource in topological sorted order.

Any kwargs passed into this call will get relayed to the visit function. Accepting kwargs on the visit function also allow for additional recursion state to be passed through in future, similar to depth or index.

macrosdk.download_utils module¶

macrosdk.download_utils.download_url(url, dest_path, use_cache=True)¶

Download an external url to dest_path.

We will use gdown for Google drive links, requests.get for others. If use_cache is True, we check our global cache for existence before initiating download, and store afterwards.

macrosdk.evaluation_metric module¶

class macrosdk.evaluation_metric.EvaluationMetric(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent an evaluation metric.

An evaluation metric contains the logic to score a model prediction with its ground truth labels. Given a data set with labels and a data set with predictions, this resource computes one or more custom metrics for each data point and returns a summarized set of metrics across the whole data set.

To create an evaluation metric, subclass the EvaluationMetric.Definition class and provide implementations for its interface methods. Then pass that definition instance to the EvaluationMetric.create() method:

class CustomMetric(EvaluationMetric.Definition):
    @classmethod
    def compute_metrics(self, class_schema, label_field_set, prediction_field_set):
        # Invoke metric computation logic for a given pair of label and prediction field sets.
        ...

    @classmethod
    def summarize_metrics(cls, metrics):
        # Summarize the metrics computed for field sets across the whole data set.
        ...

    @classmethod
    def generate_artifacts(cls, metrics):
        # Generate artifacts of the metrics computed for field sets across the whole data set.
        ...

evaluation_metric = EvaluationMetric.create(
    "my_evaluation_metric",
    CustomMetric(label_entry, prediction_entry))

To use a previously-defined evaluation metric, load it in the following way:

evaluation_metric = EvaluationMetric.load("my_evaluation_metric")

computed_metrics = evaluation_metric.compute_metrics(
    class_schema, label_field_set, prediction_field_set)

evaluation_data_set, evaluation_result = evaluation_metric.create_evaluation_result(
    model_wrapper, data_set, class_schema)

class Definition(label_entry, prediction_entry)¶

Bases: object

Definition object used to define an EvaluationMetric.

compute_metrics(class_schema, label_field_set, prediction_field_set)¶

Invoke metric computation logic for a given pair of label and prediction field sets.

Subclasses should implement this method.

classmethod generate_artifacts(metrics)¶

Generate artifacts of the metrics computed for field sets across the whole data set.

Returns a mapping of filename to data to be stored in the evaluation result resource.

Subclasses may optionally override this method.

get_resource_dependencies()¶

classmethod summarize_metrics(metrics)¶

Summarize the metrics computed for field sets across the whole data set.

Subclasses should implement this method.

compute_metrics(class_schema, label_field_set, prediction_field_set)¶: Invoke metric computation logic for a given pair of label and prediction field sets.

create_evaluation_result(model_wrapper, data_set, class_schema, label_data_set=None, include_original_fields=False, num_processes=1, num_rows=20, output_name=None, sorting_key=None, ascending=True)¶

Create an evaluation result by applying the specified model wrapper to a given input data set.

If label_data_set is specified, use that as the data set with labels. Otherwise, defaults to the input data set., If include_original_fields is True, the original input field set is unioned with the output field set during eval.

Returns a tuple of the evaluation data set and the evaluation result.

create_sorted_evaluation_data_set(eval_result, eval_data_set, sorting_key, ascending=True, num_rows=20, output_name=None)¶: Create an evaluation data set where field sets are returned in the order given by the sorting key.

generate_artifacts(metrics)¶: Generate artifacts of the metrics computed for field sets across the whole data set.

init()¶

resource_type = 'evaluation_metric'¶

summarize_metrics(metrics)¶: Summarize the metrics computed for field sets across the whole data set.

macrosdk.evaluation_ranking module¶

class macrosdk.evaluation_ranking.EvaluationRanking(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent an evaluation ranking.

An evaluation ranking is a list of evaluation results (scored model predictions), sorted by a specified key.

To create an evaluation ranking, first collect all evaluation results that make sense to rank together, choose a sorting key, and use these to instantiate an EvaluationRanking.Definition instance. Then pass that definition instance to the EvaluationRanking.create() method:

evaluation_metric = EvaluationMetric.load("my_evaluation_metric")
data_set = DataSet.load('my_data_set')
evaluation_results = [
    evaluation_result for evaluation_result in get_resources_by_type('evaluation_result')
    if evaluation_result.definition.evaluation_metric is evaluation_metric
    and evaluation_result.definition.label_data_set is data_set
]
evaluation_ranking = EvaluationRanking.create(
    "my_evaluation_ranking",
    EvaluationRanking.Definition(
        evaluation_results,
        sorting_key=lambda summary_metrics: summary_metrics['score'],
        ascending=False,
    ),
)

To use a previously-defined evaluation ranking, load it in the following way:

evaluation_ranking = EvaluationRanking.load("my_evaluation_ranking")

rankings = evaluation_ranking.get_rankings()

class Definition(evaluation_results, sorting_key=None, ascending=True)¶

Bases: object

Definition object used to define an EvaluationRanking.

get_resource_dependencies()¶

get_rankings()¶

Return list of evaluation results, ranked based on sorting key in definition.

Each item in this list is a tuple of the form (summary_metrics, evaluation_result.name, evaluation_result.package).

hydrate(obj)¶: Hydrate the state of an evaluation ranking from a serializable representation of it.

init()¶

resource_type = 'evaluation_ranking'¶

serialize()¶: Return the serializable representation of an evaluation ranking’s state.

tables()¶

Returns the entries and their fields for this data schema in table form.

Ensures the data schema is computed before doing so.

macrosdk.evaluation_result module¶

class macrosdk.evaluation_result.EvaluationResult(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent an evaluation result.

An evaluation result is the output of an evaluation metric on a pair of data sets containing the ground truth labels and model predictions respectively.

To create one, you would typically use an evaluation metric in the following way:

evaluation_metric = EvaluationMetric.load("my_evaluation_metric")

evaluation_data_set, evaluation_result = evaluation_metric.create_evaluation_result(
    model_wrapper, data_set, class_schema)

class Definition(evaluation_metric, model_wrapper, label_data_set, prediction_data_set, class_schema, num_rows)¶

Bases: object

Definition object used to define an EvaluationResult.

get_generator()¶: Return a generator function, args and kwargs, which creates a generator that yields evaluation outputs.

get_resource_dependencies()¶

get_all_metrics()¶: Get all computed metrics for this evaluation result, one for each field set.

get_summary_metrics()¶: Get summary metrics for this evaluation result.

html()¶

hydrate(obj)¶: Hydrate the state of an evaluation result from a serializable representation of it.

init()¶

resource_type = 'evaluation_result'¶

serialize()¶: Return the serializable representation of an evaluation result’s state.

tables()¶: Get summary metrics for this evaluation result, in table form.

macrosdk.field module¶

Fields module that contains code for working with data fields.

FieldDescriptor: a struct that describes a piece of data.
FieldSet: a set of data (each datum described by a FieldDescriptor).

class macrosdk.field.FieldDescriptor(schema_entries)¶

Bases: object

Struct that describes a piece of data.

It is an immutable container of schema entries. Order of schema entries does not matter.

get_entries_of_schema(data_schema)¶

Return all entries of a given schema, if any.

Convenience function to more easily access the schema entries associated with this descriptor.

class macrosdk.field.FieldSet¶

Bases: object

Struct that represents a set of fields.

It is a mutable collection of fields, where a field is some data that is described by a descriptor.

add_field(field_descriptor, data)¶

Add data for a field to this field set by its descriptor.

Raises an AssertionError if the field set already contains this field descriptor.

copy()¶: Make a shallow copy of this field set.

classmethod create()¶: Create an empty field set.

get_field(field_descriptor)¶

Get field data for the specified field descriptor.

Raises an AssertionError if the field descriptor does not exist in the field set.

has_field(field_descriptor)¶: Returns true if this field set has the specified field descriptor.

iter_fields()¶: Returns an iterable of (field descriptor, field data) tuples over the fields of this field set.

remove_field(field_descriptor)¶

Remove data for a field from this field set by its descriptor.

Raises an AssertionError if the field set does not contain this field descriptor.

union_with(*field_sets)¶

Take union of this field set with other field set(s).

If there is an overlap of field descriptors, the later field sets takes precedence.

macrosdk.inspect_utils module¶

macrosdk.inspect_utils.capture_local_variables(fn)¶

Function decorator that captures local variables of a function’s execution.

For example:

@capture_local_variables
def foo():
    a = 1
    b = 2

foo() will return {‘a’: 1, ‘b’: 2}

Note that the wrapped function’s return value is completely discarded.

Adapted from https://stackoverflow.com/questions/9186395/python-is-there-a-way-to-get-a-local-function-variable-from-within-a-decorator

macrosdk.inspect_utils.get_locals_from_module(module)¶

Returns non-builtin locals within a Python module.

Currently unused.

macrosdk.model_server module¶

class macrosdk.model_server.ModelServer(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent a model server.

A model server takes some resources (typically a model wrapper) as input and defines the request endpoint (a string) and associated handler logic for serving requests.

To create a model server, choose an endpoint string and define a factory function that returns an async request handler, and use those to instantiate a ModelServer.Definition instance. Then pass that definition instance to the ModelServer.create() method:

def request_handler_maker(*args):
    async def request_handler(request):
        data = await request.post()
        ...  # process data
        resp = web.StreamResponse(
            status=200,
            reason='OK',
            headers={'Content-Type': 'text/json'})
        await resp.prepare(request)
        await resp.write(json.dumps(output_json).encode())
        return resp
    return request_handler

model_server = ModelServer.create(
    "my_model_server",
    ModelServer.Definition(
        "my_endpoint",
        request_handler_maker,
        *args))

To use a previously-defined model server, load it in the following way:

model_server = ModelServer.load("my_model_server")

# add route to an app server like one using aiohttp.web
app.router.add_route(
    'POST',
    "/" + model_server.get_request_endpoint(),
    model_server.get_request_handler())

class Definition(request_endpoint, request_handler_fn, *args, **kwargs)¶

Bases: object

Definition object used to define a ModelServer.

get_request_endpoint()¶: Accessor function for request endpoint string.

get_request_handler()¶

Factory function for a request handler.

Request handler is an async function that takes a request and returns a response.

get_resource_dependencies()¶

get_request_endpoint()¶: Return request endpoint.

get_request_handler()¶: Return an async request handler.

init()¶

resource_type = 'model_server'¶

macrosdk.model_trainer module¶

class macrosdk.model_trainer.ModelTrainer(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent a model trainer.

A model trainer takes a data set as input, runs training, and produces a trained model in the form of a model wrapper.

To create a model trainer, subclass the ModelTrainer.Definition class and provide implementations for its interface methods. Then pass that definition instance to the ModelTrainer.create() method:

class CustomTrainer(ModelTrainer.Definition):
    @classmethod
    def build_local_data_set(cls, handle, data_set, input_descriptors, output_schemas):
        # Write data set to disk in a format used by the training script
        ...

    @classmethod
    def run_training(cls, handle, model_params, data_set_params, **kwargs):
        # Run training and store model artifacts via the handle
        ...

    @classmethod
    def get_model_evaluator(cls, handle, training_params):
        # Load model and return a model evaluator function that fulfills
        # the following interface:
        def model_evaluator(inputs_dict, *output_schemas):
            output = FieldSet.create()
            ...
            return output
        return model_evaluator

model_trainer = ModelTrainer.create(
    "my_model_trainer",
    CustomTrainer(model_params))

To use a previously-defined model trainer, load it in the following way:

model_trainer = ModelTrainer.load("my_model_trainer")

_, model_wrapper = model_trainer.create_model_wrapper(
    data_set, input_descriptors, output_schemas)

class Definition(model_params)¶

Bases: object

Definition object used to define a ModelTrainer.

classmethod build_local_data_set(handle, data_set, input_descriptors, output_schemas)¶

Build a representation of the input data set on local disk for training.

Subclasses should implement this method.

get_generate_compute_function()¶

Returns a function that computes the model.

This would run training using the input data blob.

get_generate_download_function()¶

Returns a function that produces a download function for a training data blob.

This is the step that takes a data set and builds a representation of it on local disk for training, as a data blob.

get_generate_load_function()¶: Returns a function that loads a model and returns a model evaluator function.

classmethod get_model_evaluator(handle, training_params)¶

Load model and return a model evaluator function that fulfills the following interface:

def model_evaluator(inputs_dict, *output_schemas):
    output = FieldSet.create()
    return output

return model_evaluator

Subclasses should override this method to specify custom function.

get_resource_dependencies()¶

classmethod initialize_with_model_wrapper(handle, initialization_model_wrapper)¶

Initialize training with an existing model wrapper.

Subclasses should implement this method.

classmethod run_training(handle, model_params, data_set_params, **kwargs)¶

Run training of model given a set of input data set and model parameters.

Subclasses should implement this method.

create_data_blob(data_set, input_descriptors, output_schemas, output_name=None)¶: Create data blob containing data set that has been processed into a format suitable for training.

create_model_wrapper(data_set, input_descriptors, output_schemas, initialization_model_wrapper=None, validation_data_set=None, validation_output_name=None, replace_invalid_with_none=False, output_name=None)¶: Create model wrapper from data set, creating an intermediary data blob containing processed training data.

create_model_wrapper_from_data_blob(data_blob, input_descriptors, output_schemas, initialization_model_wrapper=None, validation_data_blob=None, replace_invalid_with_none=False, output_name=None)¶

Create model wrapper from training on data blob containing processed data set.

If initialization_model_wrapper is specified, use it to initialize training.

init()¶

resource_type = 'model_trainer'¶

macrosdk.model_wrapper module¶

class macrosdk.model_wrapper.ModelWrapper(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a model wrapper.

A model wrapper represents a standalone algorithm or the trained model output of a model trainer.

It provides an initialization context to load the model into memory, and an API that takes an input field set or data set and return the corresponding output field set or data set.

To create a model wrapper from a model trainer, see ModelTrainer.create_model_wrapper().

To create a model wrapper for a standalone function, subclass the ModelWrapper.Definition class and provide implementations for its interface methods. Then pass that definition instance to the ModelWrapper.create() method:

class CustomModel(ModelWrapper.Definition):
    @classmethod
    def get_model_evaluator(cls, handle, model_params):
        # Load model and return a model evaluator function that fulfills
        # the following interface:
        def model_evaluator(inputs_dict, *output_schemas):
            output = FieldSet.create()
            ...
            return output
        return model_evaluator

model_wrapper = ModelWrapper.create(
    "my_model_wrapper",
    CustoModel(model_params, input_descriptors, output_schemas))

To use a previously-defined model wrapper, load it in the following way:

model_wrapper = ModelWrapper.load("my_model_wrapper")

output_field_set = model_wrapper.create_field_set_from_eval(input_field_set)
output_data_set = model_wrapper.create_data_set_from_eval(input_data_set)

class Definition(model_params, input_descriptors, output_schemas, replace_invalid_with_none=False)¶

Bases: object

Definition object used to define a ModelWrapper.

If replace_invalid_with_none is True, then invalid input field sets to the model (i.e. has missing input descriptors) will be replaced entirely with a None value instead of asserting. The model evaluation function has to be able to handle None values as input accordingly.

classmethod compute_model(handle, model_params)¶

Method to compute the model if it is required for inference.

Subclasses may override this with custom behavior.

get_load_function()¶

Accessor for the load function.

Subclasses may override this with custom behavior.

classmethod get_model_evaluator(handle, model_params)¶

Load model and return a model evaluator function that fulfills the following interface:

def model_evaluator(inputs_dict, *output_schemas):
    output = FieldSet.create()
    return output

return model_evaluator

Subclasses should override this method to specify custom functionality.

get_resource_dependencies()¶

class Handle(model_wrapper)¶

Bases: object

Handle to encapsulate the model wrapper, primarily for writes to its data dir.

copy_file(src_path, dest_path)¶: Copy file from other location into this model wrapper.

download_url(url, dest_path, use_cache=True)¶: Convenience function to download contents of a url into the model wrapper.

from_relative_path(other_path)¶: Convert a relative path within the model wrapper to an absolute path.

get_data_dir()¶: Get the absolute data directory used to store data for this resource.

read_opener(opener, dest_path, *args, **kwargs)¶: Convenience wrapper function to apply a file read opener to a relative path within the model wrapper.

to_relative_path(other_path)¶: Convert an absolute path to a relative path within the model wrapper.

write_opener(opener, dest_path, *args, **kwargs)¶

Convenience wrapper function to apply a file write opener to a relative path within the data blob.

Ensures the data directory is created before attempting to open the file.

create_data_set_from_eval(data_set, include_original_fields=False, num_rows=None, num_processes=1, should_precompute=True, output_name=None)¶

Create an output data set by evaluating this model wrapper on an input data set.

If include_original_fields is True, the original input field set is unioned with the output field set. If should_precompute is False, we don’t precompute the data set. If num_processes is None, use all available cores (default is 1).

create_field_set_from_eval(field_set, include_original_fields=False)¶

Create an output field set by evaluating this model wrapper on an input field set.

If include_original_fields is True, the original input field set is unioned with the output field set.

handle()¶: Return a handle object that can be used to read and write data for the model wrapper.

hydrate(obj)¶: Hydrate the state of a model wrapper from a serializable representation of it.

init()¶

load_model()¶

Context manager to load a model so that it is ready for repeated inference.

Ensures the model wrapper is computed prior to yielding.

resource_type = 'model_wrapper'¶

serialize()¶: Return the serializable representation of a model wrappers’ state.

macrosdk.mp_utils module¶

macrosdk.mp_utils.run()¶

Decorator to run function in separate process.

The decorated function has to be at the module level (i.e. not nested) for this to work.

macrosdk.package module¶

Package module. A package is a version of a project, identified by the project name and a version hash.

This module handles aspects of importing and running the code in the package. Turning a project into a package with a specific version hash is left to a separate (CLI) library.

This module also maintains global state of any imported packages, and exports some accessors for inspecting that state.

class macrosdk.package.ImportedPackage(name, config, timestamp, module, module_dir, data_dir, log_dir, macro_project_locals)¶

Bases: object

Representation of an imported Macro package.

name: package name, e.g. v0/data_schemas/macro/structure==596d9e3bcd config: an instance of ProjectConfig representing the project’s config. module: Python module of the package. module_dir: Directory of the module. macro_project_locals: dict containing the local variables within the

package’s macro_project() function invocation, keyed by name.

property dependencies¶

property description¶

directory_content(rel_path)¶

file_content(rel_path)¶

has_directory(rel_path)¶

has_file(rel_path)¶

property metadata¶

property project_name¶

exception macrosdk.package.PackageImportError¶: Bases: Exception

macrosdk.package.clean_slate_context()¶

Ensure we handle code in the context block with a clean slate.

And clean up sys path after.

macrosdk.package.data_cache_dir()¶: Fetch the global data cache directory. Only applicable for local project runs.

macrosdk.package.get_package_data_dir()¶: Return the data directory for the current package we’re in.

macrosdk.package.get_package_data_dir_for_resource(resource_type, name)¶: Return the data directory for a resource in a package.

macrosdk.package.get_package_log_dir()¶: Return the log directory for the current package we’re in.

macrosdk.package.get_package_log_dir_for_resource(resource_type, name)¶: Return the log directory for a resource in a package.

macrosdk.package.get_package_module_dir()¶: Return the module directory for the current package we’re in.

macrosdk.package.get_package_names()¶: Return sorted list of distinct package names that have been imported.

macrosdk.package.get_package_names_by_recency()¶: Return list of distinct package names that have been imported, sorted by recency.

macrosdk.package.get_package_names_for_project_by_recency(project_name)¶: Return list of package names for a project, sorted by recency.

macrosdk.package.get_package_with_name(package_name)¶: Return the ImportedPackage instance for a given package name.

macrosdk.package.get_project_config()¶: Return the project config of the current project.

macrosdk.package.get_project_names()¶: Return sorted list of distinct project names that have been imported.

macrosdk.package.import_package(dep)¶

Import package by its qualified name

Example: ‘v0/data_schemas/macro/structure==596d9e3bcd’

Returns an ImportedPackage containing information such as the package’s python module and metadata.

This needs to be executed within the run_script_context so that python module paths can be correctly resolved.

macrosdk.package.log_cache_dir()¶: Fetch the global log cache directory. Only applicable for local project runs.

macrosdk.package.override_data_cache_dir(cache_dir)¶: Override the global data cache directory. Used for testing.

macrosdk.package.override_log_cache_dir(cache_dir)¶: Override the global log cache directory. Used for testing.

macrosdk.package.package_execution_context(package_name, deploy_dir=None, local_root_dir=None)¶

Package execution context.

Allows us to import and run a package without exiting the run script context. Yields an ImportedPackage instance which encapsulates information about the package.

If deploy_dir is specified, we use that as the base directory for our imports instead of going to the root/parent directory containing our project’s directory.

If local_root_dir is specified, we use that as the base directory for our imports, and we import local projects instead of packages. There are no version hashes and link dependencies are allowed.

macrosdk.package.register_imported_package(package_name, imported_package)¶: Register the ImportedPackage instance to a given package name.

macrosdk.package.reset_imported_packages()¶: Clear all state about which packages have been imported.

macrosdk.package.reset_state()¶

Reset all state.

This includes imported projects, dependency tracking, and registered resources.

macrosdk.package.run_inline_script(package_name)¶

Context manager for running an inline script within a package’s context.

Sometimes, e.g. in a notebook, you want to execute some code within a package’s context without exiting the run_script_context.

macrosdk.package.run_package(package_name, deploy_dir=None, local_root_dir=None)¶

Run the given package.

This convenience helper sets up the run_script_context and runs the package by importing it.

If deploy_dir is specified, we use that as the base directory for our imports instead of going to the root/parent directory containing our project’s directory.

If local_root_dir is specified, we use that as the base directory for our imports, and we import local projects instead of packages. There are no version hashes and link dependencies are allowed.

macrosdk.package.run_script_context(package_name, deploy_dir=None, local_root_dir=None)¶

Context manager to run a script within a package context.

This sets up the necessary pre-requisites for importing the python module of a package (e.g. being in the correct directory so that module paths will resolve properly.)

This is a helper so we can run a package from both the CLI and also a notebook or script.

If deploy_dir is specified, we execute the package within that directory. Otherwise, we are working locally within the project directory.

If local_root_dir is specified, we use that as the base directory for our imports, and we import local projects instead of packages. There are no version hashes and link dependencies are allowed.

macrosdk.remote module¶

Remote layer for caching.

macrosdk.remote.RemoteCacheContext(remote_cache)¶: Context manager to set the specified remote cache instance as the one to be used.

class macrosdk.remote.RemoteCacheInterface¶

Bases: object

Generic interface for any implementation of remote cache.

abstract contains(unique_key)¶

abstract download(unique_key, data_dir)¶

abstract upload(unique_key, data_dir)¶

macrosdk.remote.get_remote_cache()¶: Return the instance of the remote cache being used.

macrosdk.remote.is_remote_cache_available()¶: Returns true if we have configured a remote cache to be used.

macrosdk.render module¶

Render module that contains code for working with rendering.

class macrosdk.render.Render¶

Bases: object

class Animation(images, label, include_static_thumbnail=False, max_height=160, max_width=240)¶

Bases: macrosdk.render.Render._Item

Representation of a thumbnail animation, a rendering primitive.

To create one, do the following:

# images are (H, W, 3) numpy arrays.
render_animation = Render.Animation(images, label='text')
render_set.add_thumbnail(render_animation)

MAX_HEIGHT = 160¶

MAX_WIDTH = 240¶

get_rendered_html()¶

class Image(image, label=None, max_height=160, max_width=240)¶

Bases: macrosdk.render.Render._Item

Representation of a thumbnail image, a rendering primitive.

To create one, do the following:

# image is a (H, W, 3) numpy array.
render_image = Render.Image(image, label='text')
render_set.add_thumbnail(render_image)

MAX_HEIGHT = 160¶

MAX_WIDTH = 240¶

get_rendered_html()¶

class Set(lazy_load_filename=None)¶

Bases: object

Object to encapsulate set of data to be passed to the view for rendering.

We use lazy_load_filename if we want to defer loading of its contents to later.

To create one, do the following:

render_set = Render.Set.create()
thumbnail = ...  # pick one of the render primitives
render_set.add_thumbnail(thumbnail)

To add additional details for a given datapoint, do the following:

render_set.set_details(dict(
    component=...,
    title=...,
    path=...,
))

ELEMENT_BEGIN = '<article class="renderedHtmlItem" style="border: 1px solid #E3E7EE; border-radius: 0.375rem; display: flex; flex-direction: column; gap: 0.25rem; padding: 0.5rem; position: relative;">'¶

ELEMENT_END = '</article>'¶

add_thumbnail(thumbnail)¶: Add a thumbnail to this render set.

classmethod create()¶: Create an empty render set.

ensure_loaded()¶: Ensure that this render set is loaded, if we are doing lazy loading.

get_details()¶: Get the details for the detailed view of this render set.

get_rendered_html()¶: Get the rendered html for displaying this render set.

get_thumbnails()¶: Get all thumbnails of this render set.

classmethod read_from_file(filename)¶: Read the contents of this render set from the given filename.

set_details(details)¶: Set the details for the detailed view of this render set.

write_to_file(filename)¶: Write the contents of this render set to the given filename.

class Table(data, label=None, column_labels=None, row_labels=None)¶

Bases: macrosdk.render.Render._Item

Representation of a thumbnail table, a rendering primitive..

To create one, do the following:

# data is a list of rows, each containing a list of column data.
render_table = Render.Table(data, label='text')
render_set.add_thumbnail(render_table)

get_rendered_html()¶

class TagList(texts, text_colors=None, background_colors=None)¶

Bases: macrosdk.render.Render._Item

Representation of a thumbnail tag list, a rendering primitive.

To create one, do the following:

# text_colors and background_colors are either lists of colors
# with same size as texts, or a dictionary mapping text to a color string.
render_tag_list = Render.TagList(texts, text_colors, background_colors)
render_set.add_thumbnail(render_tag_list)

get_rendered_html()¶

class Text(text, label)¶

Bases: macrosdk.render.Render._Item

Representation of a thumbnail text, a rendering primitive.

To create one, do the following:

# text can be any string
render_text = Render.Text(text, label='text')
render_set.add_thumbnail(render_text)

get_rendered_html()¶

macrosdk.resource module¶

Resource module.

Contains logic to deal with resources, which are usually stateful first-class objects used in a project as part of a machine learning pipeline.

This contains state for a global registry of resources, and exposes some accessors to inspect that state.

class macrosdk.resource.Computable(name, *args, **kwargs)¶

Bases: macrosdk.resource.Storable

Mix-in that handles aspects of making a resource stateful and computable.

compute()¶

is_computed()¶

is_disk_cached()¶: Returns true if computed state is available on disk.

is_remote_cached()¶: Returns true if computed state is available on remote.

load_from_disk_cache()¶

Load persisted state from disk to memory.

Note this is only for the portion of a resource’s computed state that was in memory in the first place. Some of the computed state might be directly written to disk as part of _compute().

load_from_remote_cache()¶: Load persisted state from remote to disk.

set_computed()¶

set_disk_cached()¶: Set a flag that indicates computing and writing state is done.

write_to_disk_cache()¶

Write state in memory to disk for persistence.

Note that not all of a resource’s computed state might be in memory. Some might already be written to disk as part of _compute().

write_to_remote_cache()¶: Write computed state on disk to remote.

class macrosdk.resource.Registry¶

Bases: object

Global registry of resources that have been registered in memory.

A resource is registered when it is declared via the create() method.

add_resource(resource)¶: Add a resource object to our registry.

get_all_resource_types()¶: Return a sorted list of distinct resource types that have been registered.

get_all_resources()¶: Return a list of all resource objects that have been registered.

get_resource(resource_type, name, package=None)¶

Get the singleton representing a given resource by its type and name.

If there are multiple packages with the same resource type and name, disambiguate it by passing in the package name. Otherwise, we do inference based on the package dependency tree and the current package context.

get_resource_names_by_package_and_type(leaf_package=None)¶

Get all resource names, indexed by package name, and resource type.

If leaf_package is a specified package, only limit search to packages in the ancestor graph of that package. If leaf_package is None, infer leaf package from current package context. If leaf_package is False, include resources from all packages.

get_resources_by_type(resource_type, leaf_package=None)¶

Get all resources by type.

If leaf_package is a specified package, only limit search to packages in the ancestor graph of that package. If leaf_package is None, infer leaf package from current package context. If leaf_package is False, include resources from all packages.

get_resources_of_package(package)¶: Get all resources of a package, in topological sort order.

has_resource(resource)¶: Returns true if the registry has a given resource object.

remove_resource(resource)¶

Remove a resource object from our registry.

This is rarely used, usually only when we want to recreate a resource on-demand.

reset()¶

class macrosdk.resource.Resource(name, *, _allowed=False, **kwargs)¶

Bases: object

Base class that all Resource objects derive from.

classmethod create(name, definition, *args, **kwargs)¶: Create an instance of a resource with given name and definition.

init()¶

classmethod load(name, package=None)¶: Load an existing singleton instance for a resource given its name and package.

resource_type = None¶

unique_key()¶

exception macrosdk.resource.ResourceAlreadyExists¶: Bases: Exception

class macrosdk.resource.Storable(name, *args, **kwargs)¶

Bases: object

Mix-in that handles some aspects of data storage for a resource.

CACHE_YAML = 'cache.yaml'¶

macrosdk.resource.get_all_resource_types()¶

macrosdk.resource.get_all_resources()¶

macrosdk.resource.get_resource(resource_type, name, package=None)¶

macrosdk.resource.get_resource_names_by_package_and_type(leaf_package=None)¶

macrosdk.resource.get_resources_by_type(resource_type, leaf_package=None)¶

macrosdk.resource.get_resources_of_package(package)¶

macrosdk.resource.has_resource(resource)¶

macrosdk.resource.remove_resource(resource)¶

macrosdk.resource.reset_registry()¶

macrosdk.resource.set_resource_compute_logging_context(logging_ctx)¶

Set context factory function to be used for logging resource computation.

This context factory function should take a resource object as input, and return a context manager. When entered, this context manager typically should redirect stdout/stderr to log files.

This will only apply when calling the _compute() inner logic, and exclude any caching operations.

We instantiate a new context each time because some resources might have nested calls to compute other resources, in which case, a different instantiated context manager would be used for each resource.

macrosdk.resource.set_resource_log_upload_function(upload_fn)¶

Set upload function to be used for uploading resource logs.

This function should take a resource object as input, and perform uploading of the resource’s log to remote cache.

macrosdk.serde_utils module¶

macrosdk.serde_utils.deserialize_dict(d, deserializer=None)¶

Return an OrderedDict from a serialized list of items.

If optional deserializer is specified, apply it to both keys and values as part of deserialization.

macrosdk.serde_utils.serialize_dict(d, serializer=None)¶

Given dictionary d which might be an OrderedDict, return a serialization with primitive types.

An OrderedDict can’t be serialized into yaml. Instead, convert items to a list to preserve key order.

If optional serializer is specified, apply it to both keys and values as part of serialization.

macrosdk.storage module¶

Storage module.

Contains primitives to handle efficient read/write storage for resources. These wrap a Storable object, which exposes methods for storing data on disk with that object. (Currently Storable is defined in resource module for legacy reasons.)

Written records are sharded into different files on disk based on the num_shards parameter. We also support reading/writing with multiple processes. Shards are mapped to processes so these counts are independent but it is helpful for number of shards to be larger than number of processes.

class macrosdk.storage.NullWriter¶

Bases: object

Equivalent of a StreamWriter that does nothing.

write_record(record)¶

class macrosdk.storage.StreamReader(storable, stream_name, num_processes=None)¶

Bases: object

Read interface to stream store.

Usage:

with StreamReader(
        storable, stream_name,
        num_processes=4,
) as stream_reader:
    # iterator
    for record in stream_reader:
        ...
    ...

get_records()¶

class macrosdk.storage.StreamWriter(storable, stream_name, num_processes=None, num_shards=100)¶

Bases: object

Write interface to stream store.

Usage:

with StreamWriter(
        storable, stream_name,
        num_processes=4, num_shards=10,
) as stream_writer:
    record = ...
    stream_writer(record)

write_record(record)¶

macrosdk.sys_utils module¶

macrosdk.sys_utils.restore_sys_path_and_modules()¶

Context manager that ensures sys.path and sys.modules are restored.

We want to ensure importing/running a project doesn’t leave behind any side-effects (which can affect importing/running another project).

Module contents¶

class macrosdk.Computable(name, *args, **kwargs)¶

Bases: macrosdk.resource.Storable

Mix-in that handles aspects of making a resource stateful and computable.

compute()¶

is_computed()¶

is_disk_cached()¶: Returns true if computed state is available on disk.

is_remote_cached()¶: Returns true if computed state is available on remote.

load_from_disk_cache()¶

Load persisted state from disk to memory.

Note this is only for the portion of a resource’s computed state that was in memory in the first place. Some of the computed state might be directly written to disk as part of _compute().

load_from_remote_cache()¶: Load persisted state from remote to disk.

set_computed()¶

set_disk_cached()¶: Set a flag that indicates computing and writing state is done.

write_to_disk_cache()¶

Write state in memory to disk for persistence.

Note that not all of a resource’s computed state might be in memory. Some might already be written to disk as part of _compute().

write_to_remote_cache()¶: Write computed state on disk to remote.

class macrosdk.DataBlob(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data blob.

A data blob encapsulates and mediates access to large binary data objects used as data input to a machine learning pipeline. Once retrieved from the cloud or local disk, it is preserved as a snapshot to ensure reproducible and download-efficient runs of downstream compute tasks.

To create a data blob, define a function that performs the necessary data fetches and instantiate it with a DataBlob.Definition:

def download_fn(handle):
    with handle.write_opener(tarfile.open, "data.tar.gz", "w:gz") as f:
        f.write(requests.get('https://macro.ai').text)
    return dict(metadata_key='metadata_value')

data_blob = DataBlob.create("my_data_blob", DataBlob.Definition(download_fn))

To use a previously-defined data blob, load it in the following way:

data_blob = DataBlob.load("my_data_blob")

with data_blob.read_opener(tarfile.open, "data.tar.gz", "r:gz") as f:
    print(f.read())  # retrieve data that was stored

print(data_blob.get_metadata())  # {'metadata_key': 'metadata_value'}

To leverage parallel processing for data blob operations, use the following API:

def download_fn(handle, *args):
    for data in datapoints:
        yield (handle, data,) + args  # yield tuple of data

def process_input(handle, data, arg1, arg2):
    with handle.write_opener(...) as f:
        ...  # download and write based on input
    return dict(
        metadata_key='metadata_value')  # gets combined into a single dict

data_blob = DataBlob.create(
    "my_data_blob",
    DataBlob.Definition(download_fn, parallelized_iter=process_input))

This will spin up a pool of worker processes under the hood automatically to call process_input for each item returned by the generator of the download function.

class Definition(download_fn, *args, parallelized_iter=None, **kwargs)¶

Bases: object

Definition object used to define a DataBlob.

By default, this is constructed with a download function (function that performs the necessary download of data for a blob). If additional functionality is needed, consider subclassing this class and overriding the get_download_function method.

If parallelized_iter is passed in, we expect download_fn to be a generator for tasks, and we will apply the parallelized_iter function to each task in a multiprocess pool.

get_download_function()¶

Accessor for the download function.

Subclasses may override this with custom behavior.

get_resource_dependencies()¶

class Handle(data_blob)¶

Bases: object

Handle to encapsulate the data blob, primarily the data dir and all writes.

add_metadata(key, value)¶: Add metadata to the data blob.

copy_file(src_path, dest_path)¶: Copy file from other location into this data blob.

download_url(url, dest_path, use_cache=True)¶: Convenience function to download contents of a url into the data blob.

from_relative_path(other_path)¶: Convert a relative path within the data blob to an absolute path.

get_data_dir()¶: Get the absolute data directory used to store data for this resource.

read_opener(opener, dest_path, *args, **kwargs)¶

Convenience wrapper function to apply a file read opener to a relative path within the data blob.

Unlike DataBlob.read_opener, this does not ensure the resource is computed. This is meant to be used if the download function of a DataBlob needs to read some of the data that has been written to it.

remove_file(dest_path)¶: Remove a file from the data blob.

to_relative_path(other_path)¶: Convert an absolute path to a relative path within the data blob.

write_opener(opener, dest_path, *args, **kwargs)¶

Convenience wrapper function to apply a file write opener to a relative path within the data blob.

Ensures the data directory is created before attempting to open the file.

get_metadata()¶: Get a copy of the metadata that was stored within the data blob.

handle()¶: Return a handle object that can be used to store data into the data blob.

hydrate(obj)¶: Hydrate the state of a data blob from a serializable representation of it.

init()¶

read_opener(opener, dest_path, *args, **kwargs)¶

Convenience function to apply a file read opener to a relative path within the data blob.

Ensures the data blob is computed before attempting to open the file.

resource_type = 'data_blob'¶

serialize()¶: Return the serializable representation of a data schema’s state.

class macrosdk.DataSchema(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data schema.

A data schema is used to represent a class ontology or a collection of attributes. It is defined as an enumeration over a set of entries, with optional attributes for eache ntry.

It enables data sets and model trainers to be parameterized to handle different predictive tasks without needing to customize them repeatedly. The resulting resource can also be easily identified and used appropriately with the correct output schema.

To create a data schema, define a function that specifies the entries within the schema and instantiate it with a DataSchema.Definition:

def definition_fn(handle):
    handle.add_primary_key('class_name')
    handle.add_field('class_description')  # optional attribute(s)

    handle.add_entry(
        class_name='cat',
        class_description='an animal also known as a feline',
    )
    handle.add_entry(
        class_name='dog',
        class_description='an animal also known as a canine',
    )

data_schema = DataSchema.create(
    "my_data_schema", DataSchema.Definition(definition_fn))

To use a previously-defined data schema, load it in the following way:

data_schema = DataSchema.load("my_data_schema")

for entry in data_schema.entries():
    print(entry.primary_key)  # 'cat' or 'dog'
    print(entry.as_dict())  # {'class_name': ..., 'class_description': ...}

class Definition(definition_fn, *args, **kwargs)¶

Bases: object

Definition object used to define a DataSchema.

By default, this is constructed with a definition function (function that defines the fields and records of a data schema). If additional functionality is needed, consider subclassing this class and overriding the get_definition_function method.

get_definition_function()¶

Accessor for the definition function.

Subclasses may override this with custom behavior.

get_resource_dependencies()¶

class Entry(data_schema, primary_key)¶

Bases: object

Object representing a data entry in a data schema.

This is typically used as an identifying key for the data it represents.

as_dict()¶: Return all fields of the data schema entry as a dictionary.

as_tuple()¶: Return the info uniquely identifying this data schema entry as a tuple.

classmethod deserialize(obj)¶: Reconstruct a data schema entry from the serializable representation of its state.

classmethod from_tuple(tup)¶: Reconstruct a data schema entry from the tuple representation of its state.

serialize()¶: Return a serializable representation of an entry that can be used to reconstruct it.

class Handle(data_schema)¶

Bases: object

Handle to encapsulate the data schema object, for writes.

add_entry(**kwargs)¶: Add an entry to the data schema via its handle.

add_field(field_name)¶: Add a field to the data schema via its handle.

add_primary_key(primary_key)¶: Add the primary key field to the data schema via its handle.

add_entry(**kwargs)¶

Add an entry to this data schema.

The primary key and all the field names must be specified for each entry.

add_field(field_name)¶

Add a field to this data schema.

Each field name must be unique.

add_primary_key(field_name)¶

Add the primary key field for this data schema.

There must be exactly one primary key field set.

entries()¶

Returns an iterable over the entries of this data schema.

Ensures the data schema is computed before doing so.

entry(primary_key)¶

Returns a DataSchema.Entry object for a given primary key.

Ensures the data schema is computed and that primary key is defined before returning it.

field_names()¶

Returns a list of field names for this data schema, starting with the primary key.

Ensures the data schema is computed before doing so.

handle()¶: Return a handle object that can be used to add fields and entries to the data schema.

has_entry(primary_key)¶

Determine if this data schema has an entry with the specified primary key.

Ensures the data schema is computed before checking.

hydrate(obj)¶: Hydrate the state of a data schema from a serializable representation of it.

init()¶

resource_type = 'data_schema'¶

serialize()¶: Return the serializable representation of a data schema’s state.

tables()¶

Returns the entries and their fields for this data schema in table form.

Ensures the data schema is computed before doing so.

class macrosdk.DataSchemaMapper(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data schema mapper.

A data schema mapper defines a mapping between two data schemas. Given a mapping between an input and output schema, the mapper allows the caller to use a set of convenience functions to transform resources from one data schema to the other.

To create a data schema mapper, subclass the DataSchemaMapper.Definition class and override the get_mapping method to return a dictionary that maps between an input and output data schema, and instantiate a DataSchemaMapper with it:

class InputToOutput(DataSchemaMapper.Definition):
    def get_mapping(self):
        return {
            self.input_schema.entry('cat'): self.output_schema.entry('feline'),
            self.input_schema.entry('dog'): self.output_schema.entry('canine'),
        }

data_schema_mapper = DataSchemaMapper.create(
    "my_data_schema_mapper", InputToOutput(input_schema, output_schema))

To use a previously-defined data schema mapper, load it in the following way:

data_schema_mapper = DataSchemaMapper.load("my_data_schema_mapper")

output_entry = data_schema_mapper.transform_data_schema_entry(input_entry)
output_field_set = data_schema_mapper.transform_field_set(input_field_set)
output_data_set = data_schema_mapper.transform_data_set(input_data_set)
output_model_wrapper = data_schema_mapper.transform_model_wrapper(input_model_wrapper)

class Definition(input_schema, output_schema)¶

Bases: object

Definition object used to define a DataSchemaMapper.

get_mapping()¶: Return a dictionary mapping entries from the input schema to the output schema.

get_resource_dependencies()¶

hydrate(obj)¶: Hydrate the state of a data schema mapper from a serializable representation of it.

init()¶

resource_type = 'data_schema_mapper'¶

serialize()¶: Return the serializable representation of a data schema mapper’s state.

transform_data_schema_entry(data_schema_entry)¶

Return the output data schema entry associated with the input entry in the mapping.

If the input entry is not found in the mapping, return None.

transform_data_set(data_set, skip_empty=True, output_name=None)¶

Return a data set transformed with this data schema mapper.

Each field set in the data set will be transformed by the data schema mapper.

If skip_empty is True, we skip any field sets that do not have the output schema after the transform step. If skip_empty is numeric, we skip these field sets with that probability.

transform_field_set(field_set)¶

Return a field set transformed with this data schema mapper.

If the field set has a field descriptor which involves the input schema, we map it to the output schema and add the mapped descriptor to the field set.

transform_model_wrapper(model_wrapper, output_name=None)¶

Return a model wrapper transformed with this data schema mapper.

Each field set returned by the model wrapper will be transformed.

class macrosdk.DataSet(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data set.

A data set is used to manage ordered collections of FieldSets, which represent rows or records in a data set.

The data set object handles efficient reads and writes under the hood in a compressed data format, and exposes an iterator interface to access records. There are also accessor methods for querying metadata about the data set, such as the histogram of data schema entries represented.

To create a data set, define a generator function that yields FieldSets and instantiate it with a DataSet.Definition:

def generator_fn():
    for data in datapoints:
        record = FieldSet.create()
        ...  # add fields from data to field set
        yield record

data_set = DataSet.create("my_data_set", DataSet.Definition(generator_fn))

To use a previously-defined data set, load it in the following way:

data_set = DataSet.load("my_data_set")

for record in data_set.get_records():
    ...  # process record

To leverage parallel processing for data set operations, use the following API:

def generator_fn(*args):
    for data in datapoints:
        yield (data,) + args  # yield tuple of data

def process_input(data, arg1, arg2):
    record = FieldSet.create()
    ...  # add fields from data to field set
    return [record]  # return 0 or more records for each

data_set = DataSet.create(
    "my_data_set",
    DataSet.Definition(generator_fn, parallelized_iter=process_input))

This will spin up a pool of worker processes under the hood automatically to call process_input for each record to be processed.

static Concat(*data_sets, output_name=None, should_precompute=True)¶: Concatenate 2 or more data sets together, in order of input.

class Definition(generator_fn, *args, parallelized_iter=None, **kwargs)¶

Bases: object

Definition object used to define a DataSet.

By default, this is constructed with a generator function (function that returns a generator of data points). If additional functionality is needed, consider subclassing this class and overriding the get_generator_function method.

Specify args or kwargs if there are parameters to pass into the generator function.

If should_precompute=True is passed in, we assume this data set cannot be streamed on the fly (e.g. involves gpu-based computation, or feeds into multiple consumers), and we will first iterate to compute and then cache the results for subsequent use.

If parallelized_iter is passed in, we will apply this function to each data point in a multiprocess pool. This function should return an iterable of data points. This will set should_precompute to True if it was not specified.

get_generator_function()¶

Accessor for the generator function.

Subclasses may override this with custom behavior.

The generator function has to return a generator that yields field sets.

get_resource_dependencies()¶

static Take(data_set, num_rows, output_name=None, should_precompute=True)¶: Take num_rows worth of records from data set.

static Zip(*data_sets, output_name=None, should_precompute=True)¶: Zip 2 or more data sets together. The later data sets take precedence.

concat_with(*data_sets, output_name=None, should_precompute=True)¶

get_records()¶

hydrate(obj)¶: Hydrate the state of a data set from a serializable representation of it.

init()¶

resource_type = 'data_set'¶

serialize()¶: Return the serializable representation of a data set’s state.

tables()¶

take(num_rows, output_name=None, should_precompute=True)¶

zip_with(*data_sets, output_name=None, should_precompute=True)¶

class macrosdk.DataSetMapper(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent a data set mapper.

A data set mapper defines a mapping between two data sets. It can do more than just a one-to-one mapping between records of a data set. You could combine multiple input records into one output record (e.g. combine individual images into a video), or you could convert one input record into multiple output records (e.g. the reverse operation of converting a video into individual images).

To create a data set mapper, subclass the DataSetMapper.Definition class and provide implementations for its interface methods. Then pass that definition instance to the DataSetMapper.create method:

class ConvertDataSet(DataSetMapper.Definition):
    def initialize(self):
        # Returns one or more field sets prior to iterating through data set.
        return []

    def finalize(self):
        # Returns one or more field sets after iterating through data set.
        return []

    def ingest(self, field_set):
        # Takes a field set as input and returns one or more output field sets.
        return [field_set]

    def should_continue(self, num_iterations, num_ingested, num_output):
        # Returns true if we should continue processing the input data set.
        return num_iterations == 0

data_set_mapper = DataSetMapper.create(
    "my_data_set_mapper", ConvertDataSet())

To use a previously-defined data set mapper, load it in the following way:

data_set_mapper = DataSetMapper.load("my_data_set_mapper")

output_data_set = data_set_mapper.transform_data_set(input_data_set)

class Definition(resource_dependencies, should_precompute=True)¶

Bases: object

Definition object used to define a DataSetMapper.

In particular, it takes input field sets and returns optional output field set(s).

finalize()¶: Returns one or more field sets after iterating through data set.

get_resource_dependencies()¶

ingest(field_set)¶: Takes a field set as input and returns one or more output field sets.

initialize()¶: Returns one or more field sets prior to iterating through data set.

should_continue(num_iterations, num_ingested, num_output)¶

Returns true if we should continue processing the input data set.

The number of iterations, number of ingested records, and number of output records so far are given as input to this method.

init()¶

resource_type = 'data_set_mapper'¶

transform_data_set(data_set, should_precompute=True, output_name=None)¶: Transform an input data set using this data set mapper, and return the output data set.

class macrosdk.DataVisualization(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a data visualization.

A data visualization is a resource that allows interactive exploration of a sequence of data. It is created by a data visualizer acting on a sequence of field sets from a data set.

To create one, you would typically use a data visualizer in the following way:

data_visualizer = DataVisualizer.load("my_data_visualizer")

data_visualization = data_visualizer.create_data_visualization_from_data_set(
    data_set, variable_schemas)

class Definition(data_set, data_visualizer, variable_schemas, num_rows)¶

Bases: object

Definition object used to define a DataVisualizer.

get_resource_dependencies()¶

class Handle(data_visualization)¶

Bases: object

Handle to encapsulate the data visualization, primarily the data dir and all writes.

get_data_dir()¶: Get the absolute data directory used to store data for this resource.

get_num_render_sets()¶

Return the number of render sets within this data visualization.

Ensures the data visualization is computed first.

get_render_set(i)¶

Return the i-th render set read from files within this data visualization.

Ensures the data visualization is computed first.

get_render_sets()¶

Return an iterable of render sets read from files within this data visualization.

Ensures the data visualization is computed first.

handle()¶: Return a handle object that can be used to store data into the data visualization.

hydrate(obj)¶: Hydrate the state of a data visualization from a serializable representation of it.

init()¶

resource_type = 'data_visualization'¶

serialize()¶: Return the serializable representation of a data visualization’s state.

class macrosdk.DataVisualizer(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent a data visualizer.

A data visualizer takes a data set as input, and products a data visualization as its output.

To create a data visualizer, subclass the DataVisualizer.Definition class and provide implementations for its interface methods. Then pass that definition instance to the DataVisualizer.create() method:

class CustomVisualizer(DataVisualizer.Definition):
    @classmethod
    def create_render_set_from_field_set(cls, field_set, fixed_schemas, variable_schemas):
        # Return a render set from a given field set.
        render_set = Render.Set.create()
        ...
        return render_set

data_visualizer = DataVisualizer.create(
    "my_data_visualizer",
    CustomVisualizer(fixed_schemas))

To use a previously-defined data visualizer, load it in the following way:

data_visualizer = DataVisualizer.load("my_data_visualizer")

data_visualization = data_visualizer.create_data_visualization_from_data_set(
    data_set, variable_schemas)

See macro.Render() for the rendering primitives available for data visualizers to use.

class Definition(*fixed_schemas)¶

Bases: object

Definition object used to define a DataVisualizer.

classmethod create_render_set_from_field_set(field_set, fixed_schemas, variable_schemas)¶

Return a render set from a given field set.

See macro.Render.Set() for how to create one.

get_resource_dependencies()¶

create_data_visualization_from_data_set(data_set, variable_schemas=None, num_rows=20, output_name=None)¶: Create data visualization from data set using this data visualizer’s logic.

init()¶

resource_type = 'data_visualizer'¶

class macrosdk.EvaluationMetric(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent an evaluation metric.

An evaluation metric contains the logic to score a model prediction with its ground truth labels. Given a data set with labels and a data set with predictions, this resource computes one or more custom metrics for each data point and returns a summarized set of metrics across the whole data set.

To create an evaluation metric, subclass the EvaluationMetric.Definition class and provide implementations for its interface methods. Then pass that definition instance to the EvaluationMetric.create() method:

class CustomMetric(EvaluationMetric.Definition):
    @classmethod
    def compute_metrics(self, class_schema, label_field_set, prediction_field_set):
        # Invoke metric computation logic for a given pair of label and prediction field sets.
        ...

    @classmethod
    def summarize_metrics(cls, metrics):
        # Summarize the metrics computed for field sets across the whole data set.
        ...

    @classmethod
    def generate_artifacts(cls, metrics):
        # Generate artifacts of the metrics computed for field sets across the whole data set.
        ...

evaluation_metric = EvaluationMetric.create(
    "my_evaluation_metric",
    CustomMetric(label_entry, prediction_entry))

To use a previously-defined evaluation metric, load it in the following way:

evaluation_metric = EvaluationMetric.load("my_evaluation_metric")

computed_metrics = evaluation_metric.compute_metrics(
    class_schema, label_field_set, prediction_field_set)

evaluation_data_set, evaluation_result = evaluation_metric.create_evaluation_result(
    model_wrapper, data_set, class_schema)

class Definition(label_entry, prediction_entry)¶

Bases: object

Definition object used to define an EvaluationMetric.

compute_metrics(class_schema, label_field_set, prediction_field_set)¶

Invoke metric computation logic for a given pair of label and prediction field sets.

Subclasses should implement this method.

classmethod generate_artifacts(metrics)¶

Generate artifacts of the metrics computed for field sets across the whole data set.

Returns a mapping of filename to data to be stored in the evaluation result resource.

Subclasses may optionally override this method.

get_resource_dependencies()¶

classmethod summarize_metrics(metrics)¶

Summarize the metrics computed for field sets across the whole data set.

Subclasses should implement this method.

compute_metrics(class_schema, label_field_set, prediction_field_set)¶: Invoke metric computation logic for a given pair of label and prediction field sets.

create_evaluation_result(model_wrapper, data_set, class_schema, label_data_set=None, include_original_fields=False, num_processes=1, num_rows=20, output_name=None, sorting_key=None, ascending=True)¶

Create an evaluation result by applying the specified model wrapper to a given input data set.

If label_data_set is specified, use that as the data set with labels. Otherwise, defaults to the input data set., If include_original_fields is True, the original input field set is unioned with the output field set during eval.

Returns a tuple of the evaluation data set and the evaluation result.

create_sorted_evaluation_data_set(eval_result, eval_data_set, sorting_key, ascending=True, num_rows=20, output_name=None)¶: Create an evaluation data set where field sets are returned in the order given by the sorting key.

generate_artifacts(metrics)¶: Generate artifacts of the metrics computed for field sets across the whole data set.

init()¶

resource_type = 'evaluation_metric'¶

summarize_metrics(metrics)¶: Summarize the metrics computed for field sets across the whole data set.

class macrosdk.EvaluationRanking(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent an evaluation ranking.

An evaluation ranking is a list of evaluation results (scored model predictions), sorted by a specified key.

To create an evaluation ranking, first collect all evaluation results that make sense to rank together, choose a sorting key, and use these to instantiate an EvaluationRanking.Definition instance. Then pass that definition instance to the EvaluationRanking.create() method:

evaluation_metric = EvaluationMetric.load("my_evaluation_metric")
data_set = DataSet.load('my_data_set')
evaluation_results = [
    evaluation_result for evaluation_result in get_resources_by_type('evaluation_result')
    if evaluation_result.definition.evaluation_metric is evaluation_metric
    and evaluation_result.definition.label_data_set is data_set
]
evaluation_ranking = EvaluationRanking.create(
    "my_evaluation_ranking",
    EvaluationRanking.Definition(
        evaluation_results,
        sorting_key=lambda summary_metrics: summary_metrics['score'],
        ascending=False,
    ),
)

To use a previously-defined evaluation ranking, load it in the following way:

evaluation_ranking = EvaluationRanking.load("my_evaluation_ranking")

rankings = evaluation_ranking.get_rankings()

class Definition(evaluation_results, sorting_key=None, ascending=True)¶

Bases: object

Definition object used to define an EvaluationRanking.

get_resource_dependencies()¶

get_rankings()¶

Return list of evaluation results, ranked based on sorting key in definition.

Each item in this list is a tuple of the form (summary_metrics, evaluation_result.name, evaluation_result.package).

hydrate(obj)¶: Hydrate the state of an evaluation ranking from a serializable representation of it.

init()¶

resource_type = 'evaluation_ranking'¶

serialize()¶: Return the serializable representation of an evaluation ranking’s state.

tables()¶

Returns the entries and their fields for this data schema in table form.

Ensures the data schema is computed before doing so.

class macrosdk.EvaluationResult(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent an evaluation result.

An evaluation result is the output of an evaluation metric on a pair of data sets containing the ground truth labels and model predictions respectively.

To create one, you would typically use an evaluation metric in the following way:

evaluation_metric = EvaluationMetric.load("my_evaluation_metric")

evaluation_data_set, evaluation_result = evaluation_metric.create_evaluation_result(
    model_wrapper, data_set, class_schema)

class Definition(evaluation_metric, model_wrapper, label_data_set, prediction_data_set, class_schema, num_rows)¶

Bases: object

Definition object used to define an EvaluationResult.

get_generator()¶: Return a generator function, args and kwargs, which creates a generator that yields evaluation outputs.

get_resource_dependencies()¶

get_all_metrics()¶: Get all computed metrics for this evaluation result, one for each field set.

get_summary_metrics()¶: Get summary metrics for this evaluation result.

html()¶

hydrate(obj)¶: Hydrate the state of an evaluation result from a serializable representation of it.

init()¶

resource_type = 'evaluation_result'¶

serialize()¶: Return the serializable representation of an evaluation result’s state.

tables()¶: Get summary metrics for this evaluation result, in table form.

class macrosdk.FieldDescriptor(schema_entries)¶

Bases: object

Struct that describes a piece of data.

It is an immutable container of schema entries. Order of schema entries does not matter.

get_entries_of_schema(data_schema)¶

Return all entries of a given schema, if any.

Convenience function to more easily access the schema entries associated with this descriptor.

class macrosdk.FieldSet¶

Bases: object

Struct that represents a set of fields.

It is a mutable collection of fields, where a field is some data that is described by a descriptor.

add_field(field_descriptor, data)¶

Add data for a field to this field set by its descriptor.

Raises an AssertionError if the field set already contains this field descriptor.

copy()¶: Make a shallow copy of this field set.

classmethod create()¶: Create an empty field set.

get_field(field_descriptor)¶

Get field data for the specified field descriptor.

Raises an AssertionError if the field descriptor does not exist in the field set.

has_field(field_descriptor)¶: Returns true if this field set has the specified field descriptor.

iter_fields()¶: Returns an iterable of (field descriptor, field data) tuples over the fields of this field set.

remove_field(field_descriptor)¶

Remove data for a field from this field set by its descriptor.

Raises an AssertionError if the field set does not contain this field descriptor.

union_with(*field_sets)¶

Take union of this field set with other field set(s).

If there is an overlap of field descriptors, the later field sets takes precedence.

class macrosdk.ImportedPackage(name, config, timestamp, module, module_dir, data_dir, log_dir, macro_project_locals)¶

Bases: object

Representation of an imported Macro package.

name: package name, e.g. v0/data_schemas/macro/structure==596d9e3bcd config: an instance of ProjectConfig representing the project’s config. module: Python module of the package. module_dir: Directory of the module. macro_project_locals: dict containing the local variables within the

package’s macro_project() function invocation, keyed by name.

property dependencies¶

property description¶

directory_content(rel_path)¶

file_content(rel_path)¶

has_directory(rel_path)¶

has_file(rel_path)¶

property metadata¶

property project_name¶

class macrosdk.ModelServer(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent a model server.

A model server takes some resources (typically a model wrapper) as input and defines the request endpoint (a string) and associated handler logic for serving requests.

To create a model server, choose an endpoint string and define a factory function that returns an async request handler, and use those to instantiate a ModelServer.Definition instance. Then pass that definition instance to the ModelServer.create() method:

def request_handler_maker(*args):
    async def request_handler(request):
        data = await request.post()
        ...  # process data
        resp = web.StreamResponse(
            status=200,
            reason='OK',
            headers={'Content-Type': 'text/json'})
        await resp.prepare(request)
        await resp.write(json.dumps(output_json).encode())
        return resp
    return request_handler

model_server = ModelServer.create(
    "my_model_server",
    ModelServer.Definition(
        "my_endpoint",
        request_handler_maker,
        *args))

To use a previously-defined model server, load it in the following way:

model_server = ModelServer.load("my_model_server")

# add route to an app server like one using aiohttp.web
app.router.add_route(
    'POST',
    "/" + model_server.get_request_endpoint(),
    model_server.get_request_handler())

class Definition(request_endpoint, request_handler_fn, *args, **kwargs)¶

Bases: object

Definition object used to define a ModelServer.

get_request_endpoint()¶: Accessor function for request endpoint string.

get_request_handler()¶

Factory function for a request handler.

Request handler is an async function that takes a request and returns a response.

get_resource_dependencies()¶

get_request_endpoint()¶: Return request endpoint.

get_request_handler()¶: Return an async request handler.

init()¶

resource_type = 'model_server'¶

class macrosdk.ModelTrainer(name, *, _allowed=False, **kwargs)¶

Bases: macrosdk.resource.Resource

Class used to represent a model trainer.

A model trainer takes a data set as input, runs training, and produces a trained model in the form of a model wrapper.

To create a model trainer, subclass the ModelTrainer.Definition class and provide implementations for its interface methods. Then pass that definition instance to the ModelTrainer.create() method:

class CustomTrainer(ModelTrainer.Definition):
    @classmethod
    def build_local_data_set(cls, handle, data_set, input_descriptors, output_schemas):
        # Write data set to disk in a format used by the training script
        ...

    @classmethod
    def run_training(cls, handle, model_params, data_set_params, **kwargs):
        # Run training and store model artifacts via the handle
        ...

    @classmethod
    def get_model_evaluator(cls, handle, training_params):
        # Load model and return a model evaluator function that fulfills
        # the following interface:
        def model_evaluator(inputs_dict, *output_schemas):
            output = FieldSet.create()
            ...
            return output
        return model_evaluator

model_trainer = ModelTrainer.create(
    "my_model_trainer",
    CustomTrainer(model_params))

To use a previously-defined model trainer, load it in the following way:

model_trainer = ModelTrainer.load("my_model_trainer")

_, model_wrapper = model_trainer.create_model_wrapper(
    data_set, input_descriptors, output_schemas)

class Definition(model_params)¶

Bases: object

Definition object used to define a ModelTrainer.

classmethod build_local_data_set(handle, data_set, input_descriptors, output_schemas)¶

Build a representation of the input data set on local disk for training.

Subclasses should implement this method.

get_generate_compute_function()¶

Returns a function that computes the model.

This would run training using the input data blob.

get_generate_download_function()¶

Returns a function that produces a download function for a training data blob.

This is the step that takes a data set and builds a representation of it on local disk for training, as a data blob.

get_generate_load_function()¶: Returns a function that loads a model and returns a model evaluator function.

classmethod get_model_evaluator(handle, training_params)¶

Load model and return a model evaluator function that fulfills the following interface:

def model_evaluator(inputs_dict, *output_schemas):
    output = FieldSet.create()
    return output

return model_evaluator

Subclasses should override this method to specify custom function.

get_resource_dependencies()¶

classmethod initialize_with_model_wrapper(handle, initialization_model_wrapper)¶

Initialize training with an existing model wrapper.

Subclasses should implement this method.

classmethod run_training(handle, model_params, data_set_params, **kwargs)¶

Run training of model given a set of input data set and model parameters.

Subclasses should implement this method.

create_data_blob(data_set, input_descriptors, output_schemas, output_name=None)¶: Create data blob containing data set that has been processed into a format suitable for training.

create_model_wrapper(data_set, input_descriptors, output_schemas, initialization_model_wrapper=None, validation_data_set=None, validation_output_name=None, replace_invalid_with_none=False, output_name=None)¶: Create model wrapper from data set, creating an intermediary data blob containing processed training data.

create_model_wrapper_from_data_blob(data_blob, input_descriptors, output_schemas, initialization_model_wrapper=None, validation_data_blob=None, replace_invalid_with_none=False, output_name=None)¶

Create model wrapper from training on data blob containing processed data set.

If initialization_model_wrapper is specified, use it to initialize training.

init()¶

resource_type = 'model_trainer'¶

class macrosdk.ModelWrapper(name, *args, **kwargs)¶

Bases: macrosdk.resource.Computable, macrosdk.resource.Resource

Class used to represent a model wrapper.

A model wrapper represents a standalone algorithm or the trained model output of a model trainer.

It provides an initialization context to load the model into memory, and an API that takes an input field set or data set and return the corresponding output field set or data set.

To create a model wrapper from a model trainer, see ModelTrainer.create_model_wrapper().

To create a model wrapper for a standalone function, subclass the ModelWrapper.Definition class and provide implementations for its interface methods. Then pass that definition instance to the ModelWrapper.create() method:

class CustomModel(ModelWrapper.Definition):
    @classmethod
    def get_model_evaluator(cls, handle, model_params):
        # Load model and return a model evaluator function that fulfills
        # the following interface:
        def model_evaluator(inputs_dict, *output_schemas):
            output = FieldSet.create()
            ...
            return output
        return model_evaluator

model_wrapper = ModelWrapper.create(
    "my_model_wrapper",
    CustoModel(model_params, input_descriptors, output_schemas))

To use a previously-defined model wrapper, load it in the following way:

model_wrapper = ModelWrapper.load("my_model_wrapper")

output_field_set = model_wrapper.create_field_set_from_eval(input_field_set)
output_data_set = model_wrapper.create_data_set_from_eval(input_data_set)

class Definition(model_params, input_descriptors, output_schemas, replace_invalid_with_none=False)¶

Bases: object

Definition object used to define a ModelWrapper.

If replace_invalid_with_none is True, then invalid input field sets to the model (i.e. has missing input descriptors) will be replaced entirely with a None value instead of asserting. The model evaluation function has to be able to handle None values as input accordingly.

classmethod compute_model(handle, model_params)¶

Method to compute the model if it is required for inference.

Subclasses may override this with custom behavior.

get_load_function()¶

Accessor for the load function.

Subclasses may override this with custom behavior.

classmethod get_model_evaluator(handle, model_params)¶

Load model and return a model evaluator function that fulfills the following interface:

def model_evaluator(inputs_dict, *output_schemas):
    output = FieldSet.create()
    return output

return model_evaluator

Subclasses should override this method to specify custom functionality.

get_resource_dependencies()¶

class Handle(model_wrapper)¶

Bases: object

Handle to encapsulate the model wrapper, primarily for writes to its data dir.

copy_file(src_path, dest_path)¶: Copy file from other location into this model wrapper.

download_url(url, dest_path, use_cache=True)¶: Convenience function to download contents of a url into the model wrapper.

from_relative_path(other_path)¶: Convert a relative path within the model wrapper to an absolute path.

get_data_dir()¶: Get the absolute data directory used to store data for this resource.

read_opener(opener, dest_path, *args, **kwargs)¶: Convenience wrapper function to apply a file read opener to a relative path within the model wrapper.

to_relative_path(other_path)¶: Convert an absolute path to a relative path within the model wrapper.