Building on poprox-recommender
#
Warning
This guide is still under development. It is possible that essential steps have been omitted. Use at your own risk!
Prerequisites#
The POPROX recommender code has several software dependencies, as well as requirements for data storage, deployment, and hardware.
Software Dependencies#
To begin working with the code, you need the following:
git
Pixi, our dependency manager
All other dependencies (including DVC) are specified in our Pixi dependency file.
Installing prerequisites
On macOS, you can install the prerequisites with Homebrew:
brew install git pixi
On Windows, prerequisites are available with winget
(included in recent Windows versions):
winget install Git.Git
winget install prefix-dev.pixi
On Linux, you should:
Install
git
from your distribution’s package collectionFetch the latest Pixi binary, or install with:
curl -fsSL https://pixi.sh/install.sh | bash
Once you have Pixi installed and the repository cloned, you can start a shell with access to the development dependencies:
pixi shell -e dev
Data Dependencies#
To build and evaluate a recommender, you need a repository to store the data (training data, checkpoints, and output files) and share them with your team. We ship the code with read-only access to a repository we provide with MIND outputs, but you will need a repository for your own outputs (unless you will only be working on a single machine and not deploying).
Any DVC remote type can be used for this repository; the easiest is an S3 bucket. Once you have created your bucket, you can add it as the default remote to DVC (run from within a Pixi shell, noted above):
dvc remote add cloud s3://my-bucket-name
dvc config core.remote cloud
If you have the AWS CLI installed and log in with aws sso login
, DVC will use
that authentication session automatically. If you want to use access keys, can
store your AWS credentials in ~/.aws/credentials
, environment variables, or
put them in the file .dvc/config.local
.
.dvc/config.local
example
.dvc/config.local
is a local (not shared via Git) configuration file for
DVC. You can use it to store credentials (shared access key):
['remote "cloud"']
access_key_id = "<ACCESS KEY ID>"
secret_access_key = "<SECRET ACCESS KEY>"
Deployment Requirements#
To deploy your recommender using our template, you will need AWS credentials capable of creating lambdas, containers, and CloudFormation deployments, and accessing S3 (if that is where you have your repository stored); see Deploying the Recommender below.
Deployment is handled via Serverless v3, and is automated by the deploy.sh
script and the .github/workflows/deploy.yml
GitHub Actions workflow.
Hardware Requirements#
For working on the code and evaluating recommender outputs, the hardware requirements are relatively modest: a reasonable laptop with sufficient CPU, memory, etc. for software development and basic Python analytics computing. The software dependencies and data take 2–5 GB (5-10GB on Linux, for the CUDA-based components).
Disk usage
By default, Pixi uses hard links to share space between your installed software
environments and the Pixi cache, so long as both are on the same filesystem (Pixi
stores its cache in your home directory; ~/.cache/pixi
on Linux or a similar
location on macOS and Windows). If your project environment is on a different
file system, then Pixi cannot share space. You can use the Pixi “detached
environments” feature to put the environment in your home directory.
To ensure data safety (preventing programs from modifying data stored in the content-addressable storage cache), Data Version Control may make two copies of the data files it tracks. When possible, it uses copy-on-write or reflinks to present two copies while only using one copy’s worth of space on disk. This is supported by APFS on macOS (the default for the last several versions), and by XFS and BTRFS on Linux. ext4 and ZFS do not support reflinks, and neither does any Windows filesystem. You can reduce the space needed by configuring DVC to use alternate link strategies, at the risk of some safety:
dvc config --local cache.type 'reflink,hardlink'
For batch-generating recommendations over the test data, a GPU is very helpful; both A40 and A4000 chips significantly accelerate this process.
Getting Started#
Fork the
poprox-recommender
repository into your personal or organizational account. If you do not want to make your customizations public yet, fork it as a private repository.Install the Software Dependencies above.
Clone your fork of the repository with
git clone
(orgh repo clone
).Start a Pixi shell:
pixi shell -e dev
Note
pixi shell
starts a new shell with the specified environment active and on its$PATH
. This is a good way to use the repository and its dependencies for development and testing. You can also run individual commands withindev
(or any other environment) withpoprox-run
:popprox run -e dev dvc pull
Obtain a copy of the MIND data set, specifically the Validation and Test sets. Save these files in the
data
directory of the repository.Obtain our public data, model checkpoints, and evaluation results (in a
pixi shell
):dvc pull -r public
dvc pull
dvc pull
is the DVC command to pull tracked artifacts (data files, etc.) from the DVC repository into the local tree.-r public
tells it to pull from thepublic
remote, which is the remote we have set up to share the publicly-distributable POPROX data and model files.DVC remotes
DVC remotes are distinct from Git remotes; they are used by Data Version Control to store, share, and retrieve data artifacts. DVC supports S3, Azure Blob Storage, WebDAV, HDFS, and other repository access methods.
Repository Layout#
The recommender repo is organized into several directories for ease of navigation and modification:
src/
Contains the source code for the recommender pipeline components, evaluation logic, and other supporting code.
data/
Data used to evaluate (or train, if training is integrated into the repository) the recommender pipeline and components.
models/
Model checkpoints; this includes both pre-trained third-party models from sources like HuggingFace, and checkpoints for custom models. Training for those checkpoints can be integrated into the
poprox-recommender
repo and automated with DVC, or it can be done in a separate project or repository and the checkpoint files copied to this directory.tests/
Test suite for the code in
src/
outputs/
Evaluation outputs.
Running the Evaluation#
You can re-run our evaluations with DVC:
dvc repro
This will ensure the entire chain of generating recommendations and measuring them against the test data is up-to-date with current files and code. As you add new configurations to test, you can connect them in to the evaluation pipeline to reproducibly test them.
Tip
If you are using a CUDA-enabled Linux system, you can use the eval-cuda
or
dev-cuda
Pixi environment, and set the environment variable
POPROX_REC_DEVICE=cuda
to use your GPU for batch inference.
Writing Components#
Todo
Document how to write new components.
The pipeline documentation describes how the POPROX recommendation pipelines are configured. To write new recommendation logic for POPROX, you will create or modify components to fit into these pipelines.
Adding Dependencies#
Seeing Results#
For offline evaluation, you can view the current set of metrics with dvc
:
dvc metrics show
The outputs are in outputs
/:
mind-val-metrics.csv
contains the summary metrics (same as dvc shows), one value per algorithmmind-val-user-metrics.csv.gz
contains user-level metrics for more in-depth analysis (e.g. variance or statistical inference)mind-val-recommendations.parquet
contains all of the recommendation lists produced, from multiple pipeline stages (e.g. both top-K and final reranked or sampled recommendations).
dvc repro
reproduces or updates these files.
Testing Code#
The POPROX recommender code includes a range of unit and integration tests to
ensure that the code is functional and deployable. Most of these tests are run
with pytest
.
Some of the integration tests depend on serverless
, which is installed via npm
:
npm ci
You can run the test swith pytest
:
pytest tests
Note
Currently, the integration tests only fully work on macOS and Linux. Some tests will be skipped on Windows.
We strongly encourage you to write tests for your own components.
Continuous Integration Testing#
Our repository is configured with GitHub Actions to test the code:
Run the unit tests.
Run the integration tests (with serverless).
Run an integration test for deployment with the Docker image. This creates a Docker image with the recommender code and checkpoints, as it would be deployed to AWS Lambda, and tests that this image runs and correctly returns recommendation results.
This continuous integration depends on access to the DVC repository.
Todo
Document how to set up these credentials.
Deploying the Recommender#
To deploy manually, log in to the AWS CLI (aws sso login
) with a user who
has access to read from your DVC repo and to create container images, lambda
functions, and CloudFormation deployments. Then run:
./deploy.sh
Automatic deployment from GitHub Actions is also possible.
Todo
Document automatic deployment.