Introduction
------------
In a nutshell, Macro.ai is a framework for writing reusable scripts (macros) for computer vision machine learning tasks and publishing them to a shared registry. Adopting this framework for your CV/ML problems allows you to make rapid progress towards delivering business outcomes using CV/ML without the technical burdens commonly encountered when taking research projects to production.

The sequence of using the Macro.ai framework:

#. Frame the data and components of the machine learning pipeline with a declarative syntax
#. Flesh out implementations of these obeying a standardized interface, or fetch existing implementations from a public registry
#. Focus on your desired output deliverables and let the framework take care of the technical details of executing the pipeline


Data Model
----------

There are 4 main concepts in the data model for Macro.ai:

Resources
^^^^^^^^^

Resources refer to components of a machine learning pipeline. They are declared abstractly and composed together to form a computational graph for the platform to execute.
Each resource type maps to a known component in the ML pipeline and has a specific and standardized interface. (See Fig 1 below for examples).

Projects
^^^^^^^^

Projects represent a workspace where resources are declared and used.
A project may simply declare resources or describe how to compose them together to fulfill an ML problem. (See Fig 2 below).
A project is the unit for publishing and pulling from a registry.
Each project is placed in a folder based on its name, and is structured like a Python module (has an __init__.py file and can import other python modules in the folder). Projects also have a Macro.yaml config file that describes the project-specific configuration.
Projects may add other projects as dependencies to automatically import their declarations, and build on top of them. These get imported as individual Python modules at run-time.

Packages
^^^^^^^^

Packages are versioned snapshots of published projects.
Whenever a project is published to a registry, we create a package with a unique identifier (a hash of the contents of the package). This allows you to explore how a project has evolved over time, as there will be a distinct package created whenever a project has been modified and published/executed.

Registries
^^^^^^^^^^

A registry is a collection of projects with their published packages.


Project/Package Versioning
--------------------------

We automatically version each project as a function of the contents of the directory (we create a gzipped tarball and take a prefix of the sha256 hash, currently 10 characters long) and its dependencies. So if code or dependencies change, the project's version hash will change. We use this version hash in determining the cache folder for the project, so that we cache data as a function of the project code.

As a project or its dependencies evolve, we are always able to track the provenance of the code or data that goes into producing a model artifact. Organizing your ML workflow in the form of projects and resource components increases your agility in collaboration, while enhancing reproducibility of results.

We expect usage of Macro.ai's versioning to be complemented with tracking code for projects in GitHub (perhaps with CI/CD executions of pipelines happening after a project update via git push), although this isn't strictly needed. One benefit of decoupling model and pipeline versioning from git is that you don't have to implicitly make a git commit each time you want to execute something, and you can easily recover (and re-use) the code that went into a successful execution by installing that project as a dependency to future projects. Doing so with a git repository is difficult if you need to execute code from 2 separate branches simultaneously.