Skip to main content

Feature Guide

MLOps is dedicated to deconstructing the complex and opaque world of AI model engineering. The system organizes the workspace by algorithm application category, with each category sharing unified core functional domains. Below is a complete breakdown of the MLOps module's architecture and core capabilities.

1. Unified Algorithm Scenario Management

This is the heartbeat bus of the entire engine, providing soft-isolated spaces for underlying physical and logical structures based on different business forms.

  • Multi-source heterogeneous model integration: All 6 built-in algorithm dashboards (anomaly detection/time series prediction/log clustering/text classification/image classification/object detection) provide a unified, homogeneous information architecture pipeline, solving the governance gap where users struggle with unified management across different scenarios and vendor models.
  • Pluggable Algorithm Configuration: Each specific algorithm in a scenario (such as ECOD, Prophet, Spell, etc.) is dynamically managed through the database algorithm configuration table. Configuration items include: algorithm identifier (name), display name (display_name), scenario description (scenario_description), Docker training/inference images (image), and dynamic form definition (form_config). Administrators can enable/disable specific algorithms through the interface; when disabled, the system checks whether there are active training tasks using that algorithm and rejects the disabling if found, ensuring in-use algorithms are never accidentally taken offline. All built-in algorithm presets can be imported in one go through an initialization command at system startup.
  • Built-in Algorithm List:
    • Anomaly Detection: ECOD (multi-dimensional anomaly point detection), EWMA (progressive time series anomaly detection), PELT (changepoint/state transition detection)
    • Time Series Prediction: Prophet (trend and seasonality forecasting)
    • Log Clustering: Spell (online log template mining)
    • Text Classification: XGBoost, GradientBoosting, RandomForest
    • Image Classification: YOLOClassification
    • Object Detection: YOLODetection

2. Dataset Management

"A model becomes what it eats." This module is not just a storage disk but the source workshop that controls feeding standards.

  • Structured sample management and pre-labeling: Beyond basic multi-scenario file or media image upload/CRUD operations, it includes a built-in purpose labeling configurator that designates each individual entry's purpose as "training" (is_train_data), "validation" (is_val_data), or "test" (is_test_data). These three flags are not mutually exclusive and can be combined (addressing the baseline requirement that without quality delineation after data asset ingestion, the entire training cycle becomes uncontrollable).
  • Temporal snapshot release controller: Leveraging a powerful published version baseline snapshot model, it supports archiving, restoring, or full-volume compressed download of prepared data baselines at any time. Version status transitions: pending publication → publishing → published. Manual archiving and restoration of archives are also supported, helping teams manage historical version lifecycles (preventing rollback incidents where continuous manual sample annotation overwrites early high-value experiment model dependencies).
  • Organization isolation: Datasets and their sub-resources (training data entries, published versions) are bound to specific organizations through the team field. Child resource permissions automatically inherit from the parent dataset's organization without additional configuration. When starting a training task, the platform validates that the associated dataset version's organization matches the task's organization, preventing cross-organization data misuse.

3. Training Task Orchestration and Observation (Training)

Transforms what was previously console-based python .py training scripts into a highly visual, controllable state machine on the platform.

  • Controlled periodic engine orchestration: Enforces that tasks must be bound to a specific historical dataset and a unique model algorithm parameter form, with full UI-based start and stop operations. The platform uses optimistic locking (claim_train_job_running atomic lock) to prevent concurrent duplicate launches of the same task and automatically rolls back task state on launch failure. Old training containers are automatically cleaned up before starting to ensure a clean starting point.
  • Celery asynchronous status polling: After training starts, Celery Beat schedules an asynchronous task (poll_train_job_status) to query the latest run status from MLflow every 30 seconds (up to approximately 3 hours). If MLflow experiences 10+ consecutive query exceptions, the platform verifies the actual running state of the training container via webhookd: if the container is still running, it downgrades to low-frequency (5-minute) polling; if the container has disappeared, it marks the task as failed, avoiding training failures or state deadlocks caused by monitoring chain failures.
  • MLflow Run management: Each "Start Training" creates a new run record in MLflow. The platform displays all historical run lists for the current training task (including duration, start/end times, and status), supports viewing single run parameters (run_params) and metric history curves (metrics_history), and allows soft-delete operations on unexpected historical runs (marking as deleted without data loss, enabling recovery). Running runs are protected and cannot be deleted.
  • Model artifact download: For any completed run, the platform supports pulling model files from MLflow artifacts, packaging them as ZIP for direct browser download for offline analysis or external deployment (available for image classification and object detection scenarios; also supported in traditional ML scenarios).
  • Hyperparameter configuration sync to MinIO: When creating or modifying a training task, the platform merges hyperparameter configuration (hyperopt_config) with the complete MLflow/model configuration, generates a JSON file, and uploads it to MinIO as the training container's configuration source. If MinIO upload fails, the entire save operation rolls back transactionally to prevent database-file inconsistency.

4. Capability Publishing and Real-Time Inference (Serving)

Receives model outputs and transforms them into enterprise public service endpoints that continuously feed external systems.

  • Hot-swappable seamless model deployment engine: Deeply integrated with the system's actual runtime, supporting one-click mounting and publishing of new prediction inference services along with underlying container resource allocation (eliminating the lengthy, inefficient process where data scientists must seek operations engineers to set up K8s applications and Nginx rules). Container orchestration is dispatched through the platform's underlying webhookd executor, compatible with both docker and kubernetes runtimes. The inference image is dynamically specified by each scenario's "Algorithm Configuration," so the platform is not tied to any fixed packaging framework.
  • Full lifecycle service hosting:
    • Create: When creating a service, the platform automatically calls webhookd to launch the inference container. If a same-named container already exists, it syncs the container's current state.
    • Start: Manually start a stopped service, also handling existing container edge cases.
    • Stop: Stop and delete the container (release port resources), with service record retained.
    • Remove: Force delete the container (can handle running containers).
    • Delete Record: Clean up the container first, then delete the database record only on success, preventing zombie container residue.
    • Configuration change auto-restart: When updating model version, associated training task, or specified port, if the container is running, the platform deletes the old container and launches a new one to apply changes.
  • Container state real-time sync: Each list and detail page access batch-queries actual container state (state, port, etc.) from webhookd and writes back to database. If webhookd requests fail, the interface downgrades to displaying historical database state with error indicators, without affecting page usability.
  • Inference address dynamic construction: The inference URL is not persisted; instead, it is dynamically assembled at each request according to runtime:
    • Docker mode: http://{serving_id}:3000/predict (container name addressing)
    • Kubernetes mode: http://{service_name}.{namespace}.svc.cluster.local:3000/predict (in-cluster DNS)
    • Host mode: http://{DEFAULT_ZONE_VAR_NODE_SERVER_URL_host}:{port}/predict
  • Sandbox visual inference experience: The web interface includes a built-in online inference workspace where you can submit data and instantly view model results without writing code or integrating APIs.
  • Online inference request scale limit protection: To safeguard the stability and response quality of inference services, the platform enforces default limits on data volume for single online inference (Predict) requests. For text and structured data scenarios (anomaly detection, time series prediction, log clustering, text classification), up to 10,000 records can be submitted per request; for image scenarios (image classification, object detection), up to 100 images can be submitted per request, with each image not exceeding 10 MB. Requests exceeding these limits will be intercepted with clear notifications. If your business scenario requires higher throughput, platform administrators can adjust the corresponding limit parameters in the deployment configuration through the MLOPS_PREDICT_MAX_BATCH_SIZE environment variable.

5. Multi-Organization Permission Isolation

MLOps root resources (datasets, training tasks, capability publishing) support binding to one or multiple organizations (teams) for hard isolation of cross-department data and models.

  • Resource ownership binding: Datasets, training tasks, and capability publishing must specify their organization when created. List queries automatically filter by the current login organization, and non-superuser users can only see resources belonging to their organization.
  • Child resource auto-inheritance: Training data entries and dataset published versions do not hold separate team fields; permissions are inherited from the parent dataset's organization (ORM filters via dataset__team).
  • Pre-training organization validation: Before starting a training task, the platform validates that the associated dataset version's organization matches the training task's organization, preventing cross-organization data misuse.
  • Inference service organization validation: When creating capability publishing, the platform validates that the associated training task's organization matches the current organization, preventing cross-organization model service launching.
  • MLflow run scope isolation: Metric history, run parameters, run deletion, model download, and other side-channel interfaces first validate whether the target run belongs to a training task accessible by the current user. Externally forged run_ids will not leak other organizations' training data.
  • Superuser exception: Superusers (is_superuser) are not subject to organization filters and can view full resources across organizations, suitable for platform O&M and troubleshooting scenarios.

6. NATS Data Interface

MLOps exposes two standardized interfaces through the NATS message bus for other BK-Lite modules (such as alerts and monitoring) to query resource lists:

  • get_mlops_module_list: Returns an enumeration list of all modules and their sub-modules (Datasets / Training Tasks / Capability Publishing × 6 algorithm scenarios = 18 sub-modules).
  • get_mlops_module_data: Queries corresponding resource id/name lists by module, sub-module, pagination parameters, and organization ID, returning up to MLOPS_NATS_MAX_PAGE_SIZE (default 500) records per query.

🖼️ UI Guide:

Capability Publishing and Online Inference Workspace

  • Configuration logic: This is the experience page after you have deployed a model service. Regardless of whether the underlying algorithm scenario is CV image classification or log analysis, you simply place formatted text in the input workspace on this interface, click the submit button, and the system will cross the built-in service firewall in real time to retrieve results and visually project them in the result panel on the right for your reference.

⚠️ Warning / Security Best Practices:

  1. Once data is mounted for training or published as an immutable version baseline, the original mappings cannot be forcibly destroyed.
  2. Publishing a model service essentially dispatches tasks to the platform's underlying physical container pool for actual runtime allocation and occupies hardware port-level instances. After completing periodic prediction batch jobs, it is recommended that administrators develop the habit of entering the "Capability Publishing" area in the backend to manually stop or remove tasks that remain in a running state, preventing zombie resource consumption on the platform.