Aptude has served as expert data consultants to some of the most well-known companies in the world, including some we can’t name. Our clients span nearly every industry and ask us to help them with a variety of projects. While we can do almost anything, some of our best work involves deep expertise in data science, especially when it comes to Python and Data Science.

Yet we will be the first to admit that many companies do not need a data scientist, let alone a whole team. Not yet.

In this article, we’ll discuss why we believe that your first data hire (whether internal or external) should not be a data scientist. We’ll also give you questions to ask to determine your readiness for data science projects.

You Might Not Need a Data Scientist. Here’s Why.

As we discussed in another blog comparing Data Engineering, Data Analysis, and Data Science, advanced data science such as machine learning requires large amounts of data – “big data” – in data warehouses where the data can be consumed easily.

To do this, the data must not just be available. It must be cleaned, structured, and put into pipelines which can be accessed by your analytics teams. This involves a lot of work, because just owning the historical data isn’t enough. The data must make sense and relate to each other in a way that’s usable.

For example, let’s say you have first name fields in different databases. In one database used by your sales team, First Name is known as “First_Name” and can contain up to 50 alphabetic characters. In another database, say the one used by marketing, First Name is known as FIRSTNAME and can contain 75 alphanumeric characters plus special characters such as hyphens and apostrophes. It should be clear that these two fields are not aligned. What happens when you want to pull first name data out of these two disparate sources?

It’s not pretty, especially when you consider that some database languages calculate some mathematical fields differently than in other languages; two and two may not be four!

For most companies, the very first step should really involve data engineering and cleanup, rather than trying to engage in cutting-edge predictions.

There’s an even greater reason for this: advanced technologies like machine learning require a large amount of data to even work. If you don’t have enough clean data, the project will fail no matter how experienced and talented your data scientists are.

Which is why we say that you might not need a data scientist – yet.

Questions to Ask Before You Launch a Data Science Project.

While at Aptude we love working on projects involving complex algorithms solving highly challenging and ROI-driven use cases, we know that it’s not always feasible to start with machine learning.

Here are some questions to ask your team:

  • How siloed is our data?
  • How clean is our data?
  • Do we have a large enough data set for the initiative?
  • Do we have a clear use case?
  • Which parts of the project can our internal team handle now?
  • What kind of ROI are we looking for?
  • Do we know which area we might need more urgently than others?
  • Do we really just need visualizations first before we try ML?
  • Which questions do we want to answer… and which do we need to answer?
  • Which tools would we like to use for this? (Oracle, Hadoop, Python, SQL Server, Power BI, Tableau)

Answering these questions should shed light on the skeletons in your organization’s closet when it comes to data collection, management, and quality assurance. And you should have a good sense of which direction you likely need to head to make progress… even if that direction is getting your stakeholders and decision-makers in a room more often to talk about your organization’s data siloes.

If Not Data Science, Then What?

For most data projects, you can’t go wrong with a thorough data audit to determine:

  • All of the sources of historical data in your organization
  • The location and management of each of these data sources
  • The quality and completeness of the data in each of these sources
  • The alignment of data between these sources
  • The questions you can answer with your current data
  • The questions you can’t answer with your current data

If you’d like some expert helping in figuring out where to start and what you need in terms of data, manpower, tools, and budget, we can help. Many of our projects involve data-related initiatives, especially since we now have a Python Center of Excellence in Mexico City, Mexico. Getting our help is as easy as contacting us via email, form, or phone.