Building on `poprox-recommender`#

Warning

This guide is still under development. It is possible that essential steps have been omitted. Use at your own risk!

Prerequisites#

The POPROX recommender code has several software dependencies, as well as requirements for data storage, deployment, and hardware.

Software Dependencies#

To begin working with the code, you need the following:

git
Pixi, our dependency manager

All other dependencies (including DVC) are specified in our Pixi dependency file.

Installing prerequisites

macOS

On macOS, you can install the prerequisites with Homebrew:

brew install git pixi

Windows

On Windows, prerequisites are available with winget (included in recent Windows versions):

winget install Git.Git
winget install prefix-dev.pixi

Linux

On Linux, you should:

Install git from your distribution’s package collection

Fetch the latest Pixi binary, or install with:

curl -fsSL https://pixi.sh/install.sh | bash

Once you have Pixi installed and the repository cloned, you can start a shell with access to the development dependencies:

pixi shell -e dev

Data Dependencies#

To build and evaluate a recommender, you need a repository to store the data (training data, checkpoints, and output files) and share them with your team. We ship the code with read-only access to a repository we provide with MIND outputs, but you will need a repository for your own outputs (unless you will only be working on a single machine and not deploying).

Any DVC remote type can be used for this repository; the easiest is an S3 bucket. Once you have created your bucket, you can add it as the default remote to DVC (run from within a Pixi shell, noted above):

dvc remote add cloud s3://my-bucket-name
dvc config core.remote cloud

If you have the AWS CLI installed and log in with aws sso login, DVC will use that authentication session automatically. If you want to use access keys, can store your AWS credentials in ~/.aws/credentials, environment variables, or put them in the file .dvc/config.local.

Deployment Requirements#

To deploy your recommender using our template, you will need AWS credentials capable of creating lambdas, containers, and CloudFormation deployments, and accessing S3 (if that is where you have your repository stored); see Deploying the Recommender below.

Deployment is handled via Serverless v3, and is automated by the deploy.sh script and the .github/workflows/deploy.yml GitHub Actions workflow.

Hardware Requirements#

For working on the code and evaluating recommender outputs, the hardware requirements are relatively modest: a reasonable laptop with sufficient CPU, memory, etc. for software development and basic Python analytics computing. The software dependencies and data take 2–5 GB (5-10GB on Linux, for the CUDA-based components).

Disk usage

By default, Pixi uses hard links to share space between your installed software environments and the Pixi cache, so long as both are on the same filesystem (Pixi stores its cache in your home directory; ~/.cache/pixi on Linux or a similar location on macOS and Windows). If your project environment is on a different file system, then Pixi cannot share space. You can use the Pixi “detached environments” feature to put the environment in your home directory.

To ensure data safety (preventing programs from modifying data stored in the content-addressable storage cache), Data Version Control may make two copies of the data files it tracks. When possible, it uses copy-on-write or reflinks to present two copies while only using one copy’s worth of space on disk. This is supported by APFS on macOS (the default for the last several versions), and by XFS and BTRFS on Linux. ext4 and ZFS do not support reflinks, and neither does any Windows filesystem. You can reduce the space needed by configuring DVC to use alternate link strategies, at the risk of some safety:

dvc config --local cache.type 'reflink,hardlink'

For batch-generating recommendations over the test data, a GPU is very helpful; both A40 and A4000 chips significantly accelerate this process.

Getting Started#

Fork the poprox-recommender repository into your personal or organizational account. If you do not want to make your customizations public yet, fork it as a private repository.
Install the Software Dependencies above.
Clone your fork of the repository with git clone (or gh repo clone).
Start a Pixi shell:
```
pixi shell -e dev
```
Note

pixi shell starts a new shell with the specified environment active and on its $PATH. This is a good way to use the repository and its dependencies for development and testing. You can also run individual commands within dev (or any other environment) with poprox-run:
```
popprox run -e dev dvc pull
```
Obtain a copy of the MIND data set, specifically the Validation and Test sets. Save these files in the data directory of the repository.
Obtain our public data, model checkpoints, and evaluation results (in a pixi shell):
```
dvc pull -r public
```
dvc pull

dvc pull is the DVC command to pull tracked artifacts (data files, etc.) from the DVC repository into the local tree. -r public tells it to pull from the public remote, which is the remote we have set up to share the publicly-distributable POPROX data and model files.

DVC remotes

DVC remotes are distinct from Git remotes; they are used by Data Version Control to store, share, and retrieve data artifacts. DVC supports S3, Azure Blob Storage, WebDAV, HDFS, and other repository access methods.

Repository Layout#

The recommender repo is organized into several directories for ease of navigation and modification:

src/: Contains the source code for the recommender pipeline components, evaluation logic, and other supporting code.
data/: Data used to evaluate (or train, if training is integrated into the repository) the recommender pipeline and components.
models/: Model checkpoints; this includes both pre-trained third-party models from sources like HuggingFace, and checkpoints for custom models. Training for those checkpoints can be integrated into the poprox-recommender repo and automated with DVC, or it can be done in a separate project or repository and the checkpoint files copied to this directory.
tests/: Test suite for the code in src/
outputs/: Evaluation outputs.

Running the Evaluation#

You can re-run our evaluations with DVC:

dvc repro

This will ensure the entire chain of generating recommendations and measuring them against the test data is up-to-date with current files and code. As you add new configurations to test, you can connect them in to the evaluation pipeline to reproducibly test them.

Tip

If you are using a CUDA-enabled Linux system, you can use the eval-cuda or dev-cuda Pixi environment, and set the environment variable POPROX_REC_DEVICE=cuda to use your GPU for batch inference.

Writing Components#

Todo

Document how to write new components.

The pipeline documentation describes how the POPROX recommendation pipelines are configured. To write new recommendation logic for POPROX, you will create or modify components to fit into these pipelines.

Adding Dependencies#

Seeing Results#

For offline evaluation, you can view the current set of metrics with dvc:

dvc metrics show

The outputs are in outputs/:

mind-val-metrics.csv contains the summary metrics (same as dvc shows), one value per algorithm
mind-val-user-metrics.csv.gz contains user-level metrics for more in-depth analysis (e.g. variance or statistical inference)
mind-val-recommendations.parquet contains all of the recommendation lists produced, from multiple pipeline stages (e.g. both top-K and final reranked or sampled recommendations).

dvc repro reproduces or updates these files.

Testing Code#

The POPROX recommender code includes a range of unit and integration tests to ensure that the code is functional and deployable. Most of these tests are run with pytest.

Some of the integration tests depend on serverless, which is installed via npm:

npm ci

You can run the test swith pytest:

pytest tests

Note

Currently, the integration tests only fully work on macOS and Linux. Some tests will be skipped on Windows.

We strongly encourage you to write tests for your own components.

Continuous Integration Testing#

Our repository is configured with GitHub Actions to test the code:

Run the unit tests.
Run the integration tests (with serverless).
Run an integration test for deployment with the Docker image. This creates a Docker image with the recommender code and checkpoints, as it would be deployed to AWS Lambda, and tests that this image runs and correctly returns recommendation results.

This continuous integration depends on access to the DVC repository.

Todo

Document how to set up these credentials.

Deploying the Recommender#

To deploy manually, log in to the AWS CLI (aws sso login) with a user who has access to read from your DVC repo and to create container images, lambda functions, and CloudFormation deployments. Then run:

./deploy.sh

Automatic deployment from GitHub Actions is also possible.

Todo

Document automatic deployment.

Building on poprox-recommender#

Prerequisites#

Software Dependencies#

Data Dependencies#

Deployment Requirements#

Hardware Requirements#

Getting Started#

Repository Layout#

Running the Evaluation#

Writing Components#

Adding Dependencies#

Seeing Results#

Testing Code#

Continuous Integration Testing#

Deploying the Recommender#

Building on `poprox-recommender`#