← All posts
ENGINEERINGPublished · April 22, 2026

Shipping a PyTorch model into a Flutter app

From ONNX export, ARM64 build flags, to isolate-based inference threads. The full path EyeRace Pro took to move models from cloud to mobile.

3 min read

EyeRace Pro originally ran all ML inference in the cloud — Flask + GPU instances. The first demo at a pigeon-racing club destroyed that assumption: their lofts are in the mountains, 4G is flaky, and each photo took 10s to upload + 5s to come back. Unusable.

So we shipped the model to mobile.

Why ONNX instead of TFLite

To get a PyTorch-trained EfficientNet-B0 onto mobile, the common paths are:

  1. PyTorch Mobile: Lite build, but iOS packaging is awkward and community momentum has faded
  2. TFLite: Google's flagship, but you go PyTorch → ONNX → TFLite — one extra conversion loss
  3. ONNX Runtime: PyTorch exports ONNX directly, ONNX Runtime mobile runs it

I went with option 3. Reasons:

  • Fewest conversion steps: a single torch.onnx.export
  • Mature runtime: Microsoft maintains it; shape inference and quantization are first-class
  • Cross-platform: same .onnx file for iOS and Android — no dual maintenance

Export

export_to_onnx.py
import torch
from model import EyeClassifier
 
m = EyeClassifier()
m.load_state_dict(torch.load("checkpoints/best.pt"))
m.train(False)  # switch to inference mode
 
# Run a dummy forward at the actual input shape to trace
dummy = torch.randn(1, 3, 224, 224)
 
torch.onnx.export(
    m,
    dummy,
    "eye_classifier.onnx",
    input_names=["pixel_values"],
    output_names=["scores"],
    dynamic_axes={
        "pixel_values": {0: "batch"},
        "scores": {0: "batch"},
    },
    opset_version=17,  # supported by ONNX Runtime mobile 1.16+
)

Flutter integration

There's no first-party Dart binding for ONNX Runtime, but the community package onnxruntime_flutter works well. Three landmines:

Landmine 1: ARM64 vs x86_64

iOS Simulator runs x86_64, but Apple Silicon is arm64. Make this explicit in pubspec / Podfile:

pubspec.yaml
dependencies:
  onnxruntime_flutter: ^1.16.0
 
# Add to Podfile post_install:
# config.build_settings['EXCLUDED_ARCHS[sdk=iphonesimulator*]'] = 'arm64'
# Without this, Apple Silicon simulators won't launch.

Landmine 2: Isolate inference

ONNX Runtime inference is CPU-bound. Running on the Flutter main thread freezes the UI. Push it to an isolate:

lib/inference/eye_inference.dart
Future<Map<String, double>> inferEye(Uint8List imageBytes) async {
  // compute() auto-spawns an isolate
  return compute(_runInference, imageBytes);
}
 
Map<String, double> _runInference(Uint8List bytes) {
  final session = OrtSession.fromBytes(_modelBytes); // pre-cached
  final input = _preprocess(bytes); // resize 224x224 + normalize
  final outputs = session.run({'pixel_values': input});
  return _decodeScores(outputs['scores']);
}

Landmine 3: Model size & quantization

The raw EfficientNet-B0 is ~50MB. App Store's "huge app" cellular download threshold is 200MB, but UX-wise it's still too big.

I did int8 quantization:

import onnxruntime.quantization as q
 
q.quantize_dynamic(
    "eye_classifier.onnx",
    "eye_classifier_int8.onnx",
    weight_type=q.QuantType.QInt8,
)

Result:

  • Size: 50MB → 12MB
  • Inference: 380ms → 220ms (M1 iPad)
  • Accuracy: F1 0.917 → 0.875

A 4% drop is unacceptable for pigeon grading. We landed on a hybrid: cloud fp32 when online, on-device int8 as fallback.

Lessons

GIGO matters more than model tuning. EyeRace's "smart capture engine" (pre-shot stability check + post-shot quality check) improved accuracy more than backend model optimization did — by 2x.

If you're shipping cloud ML to mobile, the order should be:

  1. Get the flow working end-to-end (50MB model? fine for now)
  2. Profile what's actually slow (preprocessing? inference? postprocessing?)
  3. Then try quantization / pruning
  4. Always keep a cloud fallback

Don't optimize for size or speed up front — you'll lose the ability to iterate on the model.

Found it useful?

Share with someone who might benefit.