Can you think of another term for ‘data science project’?
Go on. Get a thesaurus out!
My favourite is ‘data science adventure’. Yes, I also struggle to imagine C-Suite executives discussing their new ‘data science adventures’ around a conference table.
But here’s the thing: ‘adventure’ does a much better job of summarising what actually happens in real life than ‘project’ does.
And I bet I can convince you to agree with me…
Projects essentially have only two outcomes.
My main issue with the term ‘project’ is that there are assumptions attached to it already. And these assumptions date much further back in the business world than data science.
So if we stick to the basics, projects can be seen to have two possible outcomes:
- The project succeeds.
- The project fails.
(It’s possible the project managers out there will not be too happy with my simplification above… Sorry project managers!)
But where data science is concerned, tasks are often about making accurate predictions on something (given some data of course). Therefore, if we match the well-known data science outcomes with the assumed outcomes we have about projects, we get:
- Can make accurate predictions about something — the project succeeds!
- Cannot make accurate predictions about something — the projects fails!
Do you see the issue here? People may not even realise this association is happening. And if they do, they might not realise the detrimental effects it could be causing.
For example, this kind of subconscious thought process could lead to a lot of pressure on the data scientist to get state-of-the-art results in all tasks they do. Nobody wants their work to be associated with failure. So for data scientists, either you succeed with good model performance, or you fail with bad performance.
How completely wrong!
It’s no wonder that we sometimes hear horror stories of data scientists massaging data to fit the model, or not adhering to best practises just to get good test set performance. They probably feel pressured into doing so, and the name ‘project’ doesn’t mitigate this problem!
And even more often, we hear about how models developed to a 99% test accuracy suddenly performed dreadfully once put into production. Could it be that the data scientist was facing undue importunity to make their ‘project succeed’? And is it possible that they did what they felt they had to do, just so that they could report a good test score, before passing it on to others to put into production?
Projects should have achievable objectives.
Expanding a little on the above, projects should be set with achievable and realistic objectives.
For example, take the construction industry — a classic project-based industry. The project objective is to build something that, following rigorous planning and preparation, they are 100% certain is buildable.
No construction company says: “Our project objective is to construct a floating building. We don’t know if this is possible, but we’ll find out”.
But this is exactly what a data science ‘project’ is! Nobody knows whether we can make accurate predictions until we try! And the fact of the matter is that sometimes accurate predictions are impossible to make at a particular time. It could be because of lack of data, computational resource limits, too complex a problem or perhaps not enough technical expertise.
Whatever the reason though, that stuff is near-impossible to plan. Sometimes our data scientists could be going head-first into a completely unsolvable problem (given the resources and constraints at that time) without knowing it.
Yet still we name it a ‘project’, as if we are certain getting accurate predictions is an achievable outcome? Hmm. Makes you wonder.
If we knew that accurate predictions were perfectly attainable, as long as we follow a specific set of steps — well, we wouldn’t really need data scientists so much, would we? And nor would timescales be such a contentious topic in data science. Which leads me on to my next point…
Projects should have predictable timescales.
A huge aspect of a given project is planning how long it’ll take. Whether using Agile or Waterfall, it shouldn’t be impossible to estimate how long tasks, and consequently the whole project, will take.
This is due to the fact that the end-result of a task is normally quite clear. Tasks are also often uniform, in that the person carrying out the task has done similar ones before, and therefore can give accurate timescale estimates.
So how do you estimate the length of time it will take to conduct exploratory data analysis (EDA)? What does the end-result of EDA look like?
How can you estimate how long it will take to get, say, 90% recall on a model? Forgetting whether this is even possible for a particular problem, there’s no way of knowing how long it will take.
Instead, what happens in data science is the timescale of tasks, and the whole project, are essentially given arbitrarily. Rather than timescales, these are time limits and deadlines. “We’ll allocate 1 month for EDA, and do as much as we can in that time”.
And this is precisely why data science ‘projects’ are so often delivered late. It is also the exact reason why, at the end of every data science ‘project’, it always feels like we’re left thinking “Oh if only we had more time, we would have done x, y, z…”
Even with the best data science knowledge in the world, it is impossibly difficult to accurately estimate how long it will take to achieve your objective. You can suggest how much time would be reasonable to try, but because you don’t know what the end-result will look like, you can’t be as specific as you can in other types of ‘project’.
The real outcomes of data science.
As you may have guessed by now, my main problem with connecting the terms ‘data science’ and ‘projects’ together is to do with the journey and the outcomes.
On the one hand, ‘projects’ have been developed in business to improve predictability and gain a firm grasp over the entire timeline of events — down to every single task if needs be. This is coherent and logical, and it fits in well with a lot of processes in the world of business . Basically, we want to know what we’ll do during the project, how long those things will take, and what we can and will end up with when it finishes.
Data science, on the other hand, is about discovery; it is about finding truth. We don’t know what we will find along the way of conducting good data science, nor how long that will take, nor what we can and will end up with when the journey is complete.
The outcome of data science, therefore, is neither success nor failure, as it is for a project. It is truth, and just truth alone.
If the discovery of your data science ‘adventure’ (yes, I’m trying to coin the term) is that something is completely unpredictable, that’s not failure — it’s success! Because this is an answer in itself: the answer may be that you require more data; or more computing power; or your staff need more training. Maybe even all three!
The point is, you have your answer. It may not be what everyone was hoping for, but it’s an answer nonetheless.
So, long story short…
‘Project’ is not an effective explanation of what actually happens in data science. It doesn’t reflect the journey that a data scientist takes, how long that journey will take, nor the possible outcomes of data science.
Maybe if we start changing the way we name things, we can also start to address the fundamental dilemmas with data science.
Maybe, just maybe… We should all immediately start using ‘data science adventure’ instead!
By Edward Sims | The man who coined the term ‘Data Science Adventures’