by Emma DAMITIO,
IA Manager - Data Scientist // Solution BI Canada
1 - What is data science?
Data science is a field that combines scientific methods and computer programming with a sole objective: to increase knowledge and bring added value. This can be done using the goldmine that companies possess – their data. There are so many distinct types of data in a company; they can be structured or unstructured, internal or external. Some examples of data would be sales by product, employees’ salaries, or occupancy costs.
Data science uses different methods (see diagram below)
To create added value or business value, one needs to interpret and take advantage of the data’s full potential and that is where the role of the data scientist comes into play.
There is no one unanimous definition for Data Science, each case, project, or practice of it leads us to uses, techniques and needs for different tools. Limiting data science to one definition is always complicated and risky.
One thing for certain is that when you are talking about data science, one of the first steps is popularization/outreach. You’re not simplifying the explication – you just make the overarching principles of data science more audible for the layperson. The democratization of this field will allow companies to have competitive edge, optimize their processes and/or improve their products or services.
Put simply, data science isn’t about tools but about good practices and simplicity. This is where it all begins and where we get to the heart of our business: Creating business value thanks to data for an optimal decision-making.
2 - Who is data science for?
If we generalize about the present state of affairs, there are 3 levels of companies in relation to their age and data use:
The “advanced”
A data-driven guideline is established; the needs are known, and projects move forward to create increased value.
Objectives: defining the right use cases so that the created value is genuinely useful to different businesses.
Challenges: standardizing the level of use and understanding of data in all spheres of the business.
The “seekers”
They have some knowledge of the field and what it can bring to the company but the terms and methods are fuzzy, vague, and too often in the imaginary stages, intended for the biggest
Objectives: defining the first projects that will be quite simple cases and that will bring added value quickly to demonstrate the benefits to management.
Challenges: not skipping any steps. Projects must be built by iteration so as to not miss the goal: develop a tool or solution that will have real business value
The “latecomers”
Those who have not put data at the heart of their priorities
Objectives: mobilizing stakeholders to show the importance of centralizing data and checking its quality in order to take advantage of this goldmine.
Challenges: not going too fast and being patient because this is a long-term project – adequate infrastructure and data analysis are the first steps in this large project. To illustrate the practice: “When building a house, we need to start by building solid foundations, before we can put a roof on it”
Examples of use that are accessible to all companies whose stored data is of decent quality:
3 - What is the process of data science?
There are various stages and none of them should be played down. All of this must be accompanied by a lot of education to ensure that contributors and users understand it – this is essential to the solution being both useful and usable.
Understanding business issues :
To quantify data quality, to identify the processes of data use by businesses and to define the best “use cases” to bring BUSINESS value. Once the project has been well defined, including its needs, solutions, and delivery, we can begin implementation.
Implementation:
In 4 large steps, the project moves towards its objective.
Working in iterations here ensures that the first delivery is made within a reasonable period of time; thus verifying that the “end-to-end” chain is valid, and that the delivery meets expectations. Afterwards, the solution is enhanced to increase performance.
1. Exploratory analysis of Data: 50% of the work for implementation - understand data - clean it up to make it more relevant - transform it - build the work dataset - analyze the correlations - etc.
2. Selecting and training models: 20% of the work - make the best technical choice based on the iteration - find the best balance for the user
3. validating the models: 5% of the work - introduction and popularization/outreach for maximum user uptake
4. Deployment to production: 25% of the work - ensuring proper delivery and continued availability to users – deployment of the MLOps process to ensure the solution’s durability
Suggested reading:
Artificial Intelligence, where are we really?
Data glossary: Let’s clear things up!