You're not wrong, though I think the value of pedagogical and throw-away situati...

wenc · on June 5, 2018

> At which point, forget Jupyter notebooks, I'm typically not even working in Python anymore for that part of the job.

This is what is typically done out there but I suggest it breaks the feedback loop between the scientist roles and the developer roles. In rapidly changing environments those feedback loops could be crucial.

It's similar to what Wall Street folks did (still do?)--quants write models in Excel/VBA and pass them over to developers who would rewrite them in Java for production. There's a natural impedance mismatch, and back-and-forths are difficult.

I think a better approach would be for data scientists to write somewhat production-ready code, send it to prod (with the help of devs), get feedback from the production environment as well as get a sense of what tricks are needed for prod, and then iterate on that code. It also helps to remove the insulation between data scientists and the real world.

bunderbunder · on June 5, 2018

Well, emphasis on the pronouns there. I'm not doing the proverbial "throw it over the wall to engineering", I'm also writing the production version. I also dislike the "2 teams" approach. Even if you have separate roles for data scientists and software engineers, better to mix them onto a single team than force them to communicate across a partition.

For me it's really down to efficiency. Writing somewhat production-ready code is more expensive and time-consuming than blithely hacking. In the early stages of a new project, I know that almost everything I'm doing will get thrown away. For the most interesting projects, there's even a decent chance that it will be a complete failure and everything gets thrown away. So, at that stage in the game, I'm inclined to say that any extra effort spent on production readiness is just a waste of time and money. Fail fast, YAGNI, etc.

wenc · on June 5, 2018

Oh I understand, I'm one of the few people on my team who does devops + data engineering + data science (some people on my team only do 1 or 2, but not all 3). My point is more about the impedance mismatch between roles and code produced by each role, whether or not they are carried out by the same person. For instance, I find it difficult iterating between my own model code and production code, especially if the model code was conceived in an interactive notebook environment.

I do agree that notebooks are good for writing throwaway code, but of n failed notebooks, typically there's one that we'd like to bring to production. That's typically the one notebook we'd want to be production ready.

When I say production-readiness, I don't mean actually working in production boilerplate in the first iteration (maybe in later iterations...). I mean writing the code in a way that lends itself to easy productionization through observance of certain constraints, e.g. being cognizant of environment/scoping/global state/namespace conflicts, writing model code in modular units (functions or classes depending on the use case) rather than just imperative line-by-line code, etc. These tiny disciplines are almost effortless but can lower the friction of iterating between model and production.

In data science work, the real proof of the pudding is in production, not in unit tests. Most people don't want to admit this but unit testing doesn't work as well in the mathematical modeling world as they do in the software development world -- much of the time our inputs aren't discrete/enumerable, and the state-space is large or infinite. So it's really important to be able to iterate between production and modeling. If I ever need to go back to my interactive environment to experiment and change the logic, there should be an easy path to flow that back into production. Right now notebook environments don't aid in that. I've observed OTOH that IDE environments do.

mlthoughts2018 · on June 5, 2018

> “It's exploratory work, figuring out how the data behaves, if the data behaves, where it needs to be cleaned, churning through great heaps of experiments and iterations before hitting on the ultimate plan”

The thing is though, you should be involving code reviewers even at this stage, to review both the statistical methodology you intend for your experiments, and also the source code you believe implements that methodology. (Even when working alone, but absolutely when part of a team).

Instead of seeing the notebook as a big series of scratch-pad attempts to get something right, you should be using pull requests and code review as that scratch pad.

Additionally, the functions, classes and modules you create to do the work of exploring data fidelity, cleaning pre-treatments, or parameter sweeps through sets of experiment-specific parameter bundles — all that should be written like proper, testable, well-designed code, that lives in separate libraries or packages to facilitate re-using it without reinventing the wheel or copy/pasting from some old notebook, etc.

By that point, the notebook you’d use to explore data behavior or to invoke distributed training across a bunch of parameter values would be a tiny notebook that just imports everything it needs from properly maintained helper libraries you wrote.

And the value of the notebook over the same code just living in an easy-to-review script starts to be extremely questionable.

_ofdw · on June 5, 2018

What are you working in at that later part of the job?