Flyte is a powerful workflow automation platform designed for machine learning and data engineering tasks. This guide will walk you through the setup process, including configuring a virtual environment, installing dependencies, setting up Docker, and running Flyte in a sandbox environment.

1. Setting Up a Virtual Environment

Before installing Flyte, it’s recommended to use a virtual environment to manage dependencies and ensure an isolated setup.

Check Python Version and Install Virtual Environment

Run the following commands to ensure you have Python 3 installed and set up a virtual environment:

python3 --version
sudo apt update
sudo apt install python3-venv
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install poetry

Install Required Packages

If you are starting a new project, you can initialize a Poetry environment with:

poetry init

If you are adopting Flyte in an existing project, ensure that your dependencies are properly managed within Poetry. Run:

poetry install

This will install all dependencies defined in your project’s pyproject.toml file. If you haven’t defined Flyte dependencies yet, you may need to add them manually:

poetry add flytekit flytectl

Poetry will manage package versions and dependencies, making it easier to integrate Flyte into your existing workflow.

2. Installing Docker

Flyte requires Docker to run in a sandbox environment. Follow these steps to install Docker on your system.

Install Docker Dependencies

sudo apt-get update
sudo apt-get install ca-certificates curl

Add Docker’s Official GPG Key

sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

Add Docker Repository to Apt Sources

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Install Docker Engine and CLI

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Verify Docker Installation

Check if Docker is running correctly by listing the Docker socket:

ls -l /var/run/docker.sock

Add User to Docker Group

newgrp docker
sudo usermod -aG docker $(whoami)

Reboot your system to apply these changes:

sudo reboot

3. Installing flytectl (Flyte CLI)

Flyte’s command-line tool, flytectl, allows you to manage projects, workflows, and tasks. Install jq and flytectl

sudo apt install jq
curl -sL https://ctl.flyte.org/install | sudo bash -s -- -b /usr/local/bin

Start Flyte Sandbox

flytectl demo start

After running this command, you should see output indicating that Flyte is successfully running. The Flyte UI will be available at: http://localhost:30080/console Additional services like Minio (for storage) will also be available: http://localhost:30080/minio

5. Example Flyte Project Directory Structure

When setting up a Flyte project, your directory structure may look something like this:

flyte_project/
├── pyproject.toml
├── poetry.lock
├── dataset
└── workflow_codes/
    ├── imageSpec.yaml
    ├── project_definition.yaml
    └── codes/
        ├── __init__.py
        ├── workflow.py
        ├── tasks.py
        └── utils.py

Explanation of Files and Directories

  • pyproject.toml – Configuration file for Poetry dependencies.
  • poetry.lock – Auto-generated file that locks dependency versions.
  • dataset - dataset directory. We will copy this when running example workflow.
  • workflow_codes/ – Main directory containing Flyte workflows.
    • imageSpec.yaml – Defines the Docker image specification for Flyte workflows.
    • project_definition.yaml – Defines the Flyte project metadata.
    • codes/ – Contains Python files defining Flyte workflows, tasks, and utilities.

6. Creating a New Flyte Project

Once Flyte is running, you can create a new project to organize your workflows.

Define Project Configuration

Ensure you have a project definition file (e.g., workflow_codes/project_definition.yaml) and populate it with necessary metadata.

Create Project in Flyte

export FLYTECTL_CONFIG=~/.flyte/config-sandbox.yaml
flytectl create project --file ./workflow_codes/project_definition.yaml

project_definition.yaml example

id: "flyteproject"
name: "flyteproject"
description: "The pipeline that orchestrate data processing and training process of my project."

Example workflow

Once your project is set up, you can define and execute workflows. Here’s an example of a simple Flyte workflow:

# workflow.py
import os
from pathlib import Path
from flytekit import task, workflow

# Define the dataset path based on the copied directory inside the container
DATASET_DIR = Path("/dataset")

@task
def list_files_in_dataset() -> list:
    """Lists all files inside the dataset directory."""
    if not DATASET_DIR.exists():
        raise FileNotFoundError(f"Dataset directory {DATASET_DIR} not found!")

    return [str(file) for file in DATASET_DIR.iterdir()]

@task
def say_hello(name: str, dataset_files: list) -> str:
    """Prints a greeting along with dataset files."""
    dataset_info = f"Dataset contains {len(dataset_files)} files: {', '.join(dataset_files)}" if dataset_files else "Dataset is empty."
    return f"Hello, {name}! {dataset_info}"

@workflow
def greeting_workflow(name: str) -> str:
    """Workflow that lists dataset files and prints a greeting."""
    dataset_files = list_files_in_dataset()
    return say_hello(name=name, dataset_files=dataset_files)

7. Running workflows in Flyte Sandbox Cluster

To execute the workflow on the Flyte sandbox cluster, use the pyflyte run command. This ensures the workflow is executed in a Flyte-managed environment.

Execute the workflow in Flyte Sandbox

pyflyte run --remote --image ./workflow_codes/imageSpec.yaml -p flyteproject -d development ./workflow_codes/codes/workflow.py greeting_workflow

Explanation of Command Arguments

  • --remote → Runs the workflow in the Flyte sandbox cluster instead of locally.
  • --image ./workflow_codes/imageSpec.yaml → Specifies the Docker image to use for execution.
  • -p flyteproject → Specifies the Flyte project name.
  • -d development → Specifies the domain (development/staging/production).
  • ./workflow_codes/workflow.py greeting_workflow → Specifies the script and workflow function.

imageSpec.yaml example

python_version: 3.12.3
registry: localhost:30000
packages:
  - beautifulsoup4
  - pandas
  - tqdm
  - chardet
  - ujson
  - lxml
  - jupyter
  - tokenizers
  - loguru
  - pytz
  - spacy
  - plotly
  - datasets
  - csvkit
  - lz4
  - dask[dataframe]
  - distributed
  - cloudpickle
  - flytekit
  - https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
copy:
  - dataset
env:
  Debug: "True"

Explanation

  • python_version: Specifies the Python version.
  • registry: Uses a local Docker registry.
  • packages: List of dependencies for Flyte workflows.
  • copy: Directories to copy into the container.
  • env: Environment variables.