Shipping a PyTorch model into a Flutter app
From ONNX export, ARM64 build flags, to isolate-based inference threads. The full path EyeRace Pro took to move models from cloud to mobile.
EyeRace Pro originally ran all ML inference in the cloud — Flask + GPU instances. The first demo at a pigeon-racing club destroyed that assumption: their lofts are in the mountains, 4G is flaky, and each photo took 10s to upload + 5s to come back. Unusable.
So we shipped the model to mobile.
Why ONNX instead of TFLite
To get a PyTorch-trained EfficientNet-B0 onto mobile, the common paths are:
- PyTorch Mobile: Lite build, but iOS packaging is awkward and community momentum has faded
- TFLite: Google's flagship, but you go PyTorch → ONNX → TFLite — one extra conversion loss
- ONNX Runtime: PyTorch exports ONNX directly, ONNX Runtime mobile runs it
I went with option 3. Reasons:
- Fewest conversion steps: a single
torch.onnx.export - Mature runtime: Microsoft maintains it; shape inference and quantization are first-class
- Cross-platform: same
.onnxfile for iOS and Android — no dual maintenance
Export
import torch
from model import EyeClassifier
m = EyeClassifier()
m.load_state_dict(torch.load("checkpoints/best.pt"))
m.train(False) # switch to inference mode
# Run a dummy forward at the actual input shape to trace
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(
m,
dummy,
"eye_classifier.onnx",
input_names=["pixel_values"],
output_names=["scores"],
dynamic_axes={
"pixel_values": {0: "batch"},
"scores": {0: "batch"},
},
opset_version=17, # supported by ONNX Runtime mobile 1.16+
)Flutter integration
There's no first-party Dart binding for ONNX Runtime, but the community package onnxruntime_flutter works well. Three landmines:
Landmine 1: ARM64 vs x86_64
iOS Simulator runs x86_64, but Apple Silicon is arm64. Make this explicit in pubspec / Podfile:
dependencies:
onnxruntime_flutter: ^1.16.0
# Add to Podfile post_install:
# config.build_settings['EXCLUDED_ARCHS[sdk=iphonesimulator*]'] = 'arm64'
# Without this, Apple Silicon simulators won't launch.Landmine 2: Isolate inference
ONNX Runtime inference is CPU-bound. Running on the Flutter main thread freezes the UI. Push it to an isolate:
Future<Map<String, double>> inferEye(Uint8List imageBytes) async {
// compute() auto-spawns an isolate
return compute(_runInference, imageBytes);
}
Map<String, double> _runInference(Uint8List bytes) {
final session = OrtSession.fromBytes(_modelBytes); // pre-cached
final input = _preprocess(bytes); // resize 224x224 + normalize
final outputs = session.run({'pixel_values': input});
return _decodeScores(outputs['scores']);
}Landmine 3: Model size & quantization
The raw EfficientNet-B0 is ~50MB. App Store's "huge app" cellular download threshold is 200MB, but UX-wise it's still too big.
I did int8 quantization:
import onnxruntime.quantization as q
q.quantize_dynamic(
"eye_classifier.onnx",
"eye_classifier_int8.onnx",
weight_type=q.QuantType.QInt8,
)Result:
- Size: 50MB → 12MB
- Inference: 380ms → 220ms (M1 iPad)
- Accuracy: F1 0.917 → 0.875
A 4% drop is unacceptable for pigeon grading. We landed on a hybrid: cloud fp32 when online, on-device int8 as fallback.
Lessons
GIGO matters more than model tuning. EyeRace's "smart capture engine" (pre-shot stability check + post-shot quality check) improved accuracy more than backend model optimization did — by 2x.
If you're shipping cloud ML to mobile, the order should be:
- Get the flow working end-to-end (50MB model? fine for now)
- Profile what's actually slow (preprocessing? inference? postprocessing?)
- Then try quantization / pruning
- Always keep a cloud fallback
Don't optimize for size or speed up front — you'll lose the ability to iterate on the model.