"We Have No Idea How Models will Behave in Production until ...

"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

How ML engineers operationalize machine learning, their workflow stages (data preparation, experimentation, evaluation and deployment, monitoring and response), and the challenges they face in each stage.

arxiv.org

RelatedHighlights

ata Collection Experimentation Evaluation and Deployment Monitoring and Response Metadata Data catalogs, Amundsen, AWS Glue, Hive metas-tores Weights & Biases, MLFlow, train/test set parameter configs, A/B test tracking tools Dashboards, SQL, metric functions and window sizes Unit Data cleaning tools Tensorflow, ML-lib, PyTorch, Scikit-learn, X... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

Nicolay Gerold

Several engineers also maintained fallback models for reverting to: either older or simpler versions (Lg2, Lg3, Md6, Lg5, Lg6). Lg5 mentioned that it was important to always keep some model up and running, even if they “switched to a less economic model and had to just cut the losses.” Similarly, when doing data science work, both Passi and Jackson... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

Nicolay Gerold

Over time, ML pipelines may turn into “jungles” of rules and models . Sculley et al . [87] introduce the phrase “pipeline jungles” (i.e., different versions of data transformations and models glued together), which was later adopted by participants in our study. While prior work demonstrates their existence and maintenance challenges, we provide in... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

NNicolay Gerold

business is open. [It might say] “9 am,” but the model doesn’t know that. So if we detect

time, then we filter that [reply]. We have a lot of filters.

engineers continuously monitored features for and predictions from production models (Lg1, Md1, Lg3, Sm3, Md4, Sm6, Md6, Lg5, Lg6): Md1 discussed hard constraints for feature columns (e.g., bounds on values), Lg3 talked about monitoring completeness (i.e., fraction of non-null values) for features, Sm6 mentioned embedding their pipelines with "comm... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

Nicolay Gerold

You have this classic issue where most researchers are evaluat[ing] against fixed data sets [. . . but] most industry methods change their datasets. We found that these dynamic validation sets served two purposes: (1) the obvious goal of making sure the validation set stays current with live data as much as possible, given new knowledge about the p... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

Nicolay Gerold

Amershi et al . [3] state that software teams “flight” changes or updates to ML models, often by testing them on a few cases prior to live deployment. Our work provides further context into the evaluation and deployment process for production ML pipelines: we found that several organizations, particularly those with many customers, employed a multi... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

Nicolay Gerold

MLEs are happy to delegate experiment tracking and execution work to ML experiment execution frameworks, such as Weights & Biases 3 , but prefer to choose subsequent experiments themselves. To be able to make informed choices of subsequent experiments to run, MLEs must maintain awareness of what they have tried and what they haven’t (Lg2 calls ... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

NNicolay Gerold

learnings from one experiment into the next, like a guided search to find the best idea (Lg2, Sm4,

Lg5). Lg5 described their ideological shift from random search to guided search:

Previously, I tried to do a lot of parallelization. If I focus on one idea, a week at a time,

then it boosts my productivity a lot more.

By following a guided search, engineers are, essentially, significantly pruning a large subset of

experiment ideas without executing them. While it may seem like there are unlimited computational

resources, the search space is much larger, and developer time and energy is limited. At the end of

the day, experiments are human-validated and deployed. Mature ML engineers know their personal

tradeoff between parallelizing disjoint experiment ideas and pipelining ideas that build on top of

each other, ultimately yielding successful deployments

“I look for features from data scientists, [who have ideas of] things that are correlated with what I’m trying to predict.” We found that organizations explicitly prioritized cross-team collaboration as part of their ML culture. Md3 said: We really think it’s important to bridge that gap between what’s often, you know, a [subject matter expert] in ... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

Nicolay Gerold

Participants noted that the impact on models was hard to assess when the ground truth involved live data—for example, Sm2 felt strongly about the negative impact of feedback delays on their ML pipelines: I have no idea how well [models] actually perform on live data. Feedback is always delayed by at least 2 weeks. Sometimes we might not have feedba... See more

Shreya Shankar • "We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine Learning.

Nicolay Gerold