Use ML to decide whether or not to pre-generate file previews.
Annual savings in preview generation compute: $1.7M. ML Infra: $9k.
This seems like a great trade off, although I'm be curious about the $$$ now required to operate the ML infra.
Our internal system for securely generating file previews, Riviera, handles preview generation for the hundreds of supported file types.
Riviera pre-generates and caches preview assets (a process we call pre-warming). The CPU and storage costs of pre-warming are considerable for the volume of files we support.
In general, there is a complexity vs. interpretability tradeoff in ML: more complex models usually have more accurate predictions at the cost of less interpretability of why certain predictions are made, as well as possibly increased complexity in deployment
The v1 model was a gradient-boosted classifier trained on input features including file extension, the type of Dropbox account the file was stored in, and most recent 30 days of activity in that account. On an offline holdout set, we found this model could predict previews up to 60 days after time of pre-warm with >70% accuracy
Cannes is now deployed to almost all Dropbox traffic. As a result, we replaced an estimated $1.7 million in annual pre-warm costs with $9,000 in ML infrastructure per year