To Data or Not to Data.

Author:Murphy | View: 20543 | Time: 2025-03-22 23:10:15

The question is not anymore whether we can solve the problem with AI but to what extent it returns sustainable and reliable results. Good craftsmanship, governance, ethics, and education on AI are what we need now.

Since I was a kid, I have always been intrigued and interested in new technology and today we live in a world where smart and connected devices have become a part of our daily lives. When you have a background in engineering or another technical background this might be the greatest time to be around. However, there is a catch to the new and smart technology and it becomes slowly but surely more visible; a device where all physical parts are working perfectly fine can become a brick due to a missing internet connection or a wrong software update. One key part is that the logic is not necessarily hard coded anymore on the device itself but is a model that is externally stored and loaded. The new logic is becoming data-driven and critical in the transition towards smarter devices as it should better interact with our personal preferences. In this blog I will discuss that not all solutions require data and data does not need to be incorporated into all solutions but there are solutions where we do need data.

The Fight For Data Collection with The Rise Of Smart Devices.

I feel very lucky that I have been around in times when a color television with a remote control was the most advanced device in many homes for many-many years. Technological advancement goes at such a high pace today that you can read about a breakthrough in one or another field every single day. Great innovations have led to mass production, lower prices, and making it affordable for the average consumer. From smartwatches monitoring our health to smart assistants managing our homes. Smart devices were first the rich but are now becoming cheaper than the old-fashioned traditional devices.

However, the new generation of smart devices can have a major drawback as they do not guarantee the luxury of long-term usage as they can be dependent on incremental software updates. The device's logic is more often stored on external servers making it easier and cheaper for manufacturers to ship it. This is certainly not always the case but more examples are seen from doorbells, and house lights to wasmachines which do not function anymore without an active internet connection. Meanwhile, the support of a device has become merely a few years as security and software updates do not match with the new hardware.

Analog devices will become the high-end products due to their exclusivity, privacy, offline functionality, good craftsmanship, and longevity.

Another reason why manufacturers can likely make devices cheaper today is because we are willing to share our behavior (and thus Data) on how we interact with that particular device. In return, we get the functionality to control the device from any location in the world.

Data collection seems to become a key component in many devices, but do everyday devices need to be ‘smart'? We have become the data product.

The Importance Of Data is Again Marked by AI.

Last year was the year of generative AI models, from large language models to imaging models and audio models. This may have triggered numerous companies to start collecting data or making sure what's theirs remains theirs. The applications that can be built or the processes that can be optimized opens a new set of possibilities. A few remarkable innovations are for example by Amazon, Microsoft, Luna, and Mcafee among others.

Amazon incorporated several new third-party experiences for Alexa leveraging AI, everyone can now create a song in any style by simply asking.

Microsoft discovered a promising new battery material in just weeks that could reduce lithium use by up to 70%.

How AI and high-performance computing are speeding up scientific discovery

Luma announced a new text-to-3D model generator capable of rendering HD materials.

Luma AI – Genie

McAfee announced an AI-powered Deepfake Audio Detection tool called Project Mockingbird. They report a 90% success rate in identifying fraudulent audio to protect users from sophisticated online scams. This is quite remarkable in two ways, first of all, we can now easily mimic a voice from anyone with just an audio recording where is spoken a few sentences, and secondly, we can detect whether a recording is fake or not.

Besides these remarkable innovations, there are also innovations in our everyday devices to make them so-called ‘smart‘. Especially these innovations scare the heck out of me because I know how quickly things can break when working with data and models. For example, it is great to see functionalities in washing machines with sensors for optimized washing cycles, but I would find it horrifying if my washing machine would not start for reasons such as ‘wrong‘ washing powder in combination with the clothes or maybe the restrictions on brands that it automatically detects.

In my simple view, data-driven solutions are not needed for all devices. Especially for everyday home devices such as washing machines, and maybe just all the kitchen gears. They just need to work by pressing the button. There are however plenty of use cases where we do need data and models. But creating them in a reliable manner is a challenge! Let's jump into these in the next section.

Creating Sustainable and Reliable AI Products is a Challenge.

Most large companies have nowadays their own data science team but despite that, it may not surprise you that many data science projects never see the light of day. With the rise of generative models, I believe this number can even further drop. The gains can be great though but short-lived because when you build something new today with for example a large language model, it will be replaced/degraded/unsupported by something better within weeks or months. I should also note that not every data science project needs to land in production. Some projects are to gain knowledge while others are for insights to steer larger projects. But when we do aim to create data science products, be aware that bringing an idea towards production requires more than only technical skills. As a data scientist, we must understand the business context, effectively communicate with stakeholders, and translate their questions into actionable recommendations that drive business value. To create reliable and sustainable AI products, we need to think of many crucial steps:

A data-driven project is much more than only a technical solution.
Responsibility in AI products.
Manage the technical depth.
Manage data-driven solutions

In the following sections, I will describe these parts in more detail.

1. A data-driven project is much more than only a technical solution.

As data scientists we must not only show strong technical skills but also understand the business context, effectively communicate with stakeholders, and translate their questions into actionable recommendations that drive business value. The business has changed over the years, and we need to change with it to keep delivering successful data science projects. A summary of three of the most important points are:

Learn the fundamentals of programming: Write your code in known styles, such as PEP8. Write inline comments; what/why you do it. Write docstrings. Use sensible variable names. Write Unit tests. Write Documentation. Programming is one of the major challenges in the data science field. It is heavily underestimated but one of the key components that can make or break a data science project for getting into production.
Understand that the success of a project is more than only a technical solution: Start with the end in mind. Start with minimum verifiable learnings which is a strategy that emphasizes what people want first, and then build a potential solution second (the MVP). It is first about understanding problems and then delivering solutions that the business needs. As an example, check which platform or infrastructure to collaborate on. Understand the domain.
Be smart, learn, and repeat: Data science is a highly complicated and fast-evolving field. Without continuous learning, you will fall behind in months.

More details can be found over here:

The Path to Success in Data Science Is About Your Ability to Learn. But What to Learn?

2. Create responsible AI products.

New machine learning models are created every day and are nowadays accessible for a broader audience with tools like chatGPT, co-pilot, and prompt engineering. We therefore need to put another important part into our AI systems; responsibility. This is however a challenging part which includes moral consultations, ensuring careful commissioning, and informing the stakeholders. However, there are best practices that contribute to a responsible and ethical AI landscape. We can break "responsibility" down into six parts from which ethics is tangled throughout all the parts:

The Six Parts of Responsible AI (Image by Author)

Privacy: The very first step before even starting a project is looking at the privacy aspects. Think about personal data such as age, gender, etc. but also other confidential information such as political, and religious preferences. Then there can be indirect information that can be linked to individuals such as car license plates. Such data must be protected and safeguarded throughout the entire project and beyond.
Governance: We need to know who is the talk-to-person, maintainer, or even responsible for e.g. the data set. This is important because this person can likely tell us more about the ownership, usability, quality, integrity, and security of the data set.
Input quality: Good quality of the input data set is essential to create reliable models. This may sound straightforward but it needs proper attention in data science projects.
Output quality: Ensuring a reliable output is another important task. The type of output depends on the use case and can for example be in the form of a dashboard, advice reports, or presentation. It is important to note that our input data set can be of high quality, our model can be trained reliably but the output can still be unpredictable or even unusable.
Model quality: Model quality can be addressed by its accuracy, reliability, reproducibility, and explainability.
Ethics: The ethics should be taken care of across the entire project. The key is awareness and integrity. In addition, it is always important to remain transparent with the involved parties, and users regarding the limitations, and to use common sense for developing and reviewing the model.

The Next Step is Responsible AI. How Do We Get There?

Besides these three points, there is another crucial point which is about technical depth.

3. Manage Technical Depth.

Remember that building software is super expensive, and including a data-driven solution can be even more expensive. Various research articles describe that up to 42% of development time can be lost on technical debt. Interestingly, managing technical depth can also be a calculated strategy. In a paper published by the Software Engineering Institute, it describes that there are 13 types of technical debt, each with a key indicator[3]. Technical debt can thus be a choice, logic follows that it is not always bad, however, it must be managed responsibly. Gartner predicts that organizations that manage and reduce technical debt will "achieve at least 50% faster service delivery times to the business [4]". Having this said, there are some good practices to follow because short-term wins are going to be long-term losses.

After showcasing the final product, allocate time to improve on the code deck: While you are in the developing phase, code can become messier to be faster and to get stakeholder pressure off your back. Making such sprints is not necessarily a bad practice, just make sure you communicate about this and that it needs improvement in a later stage.
Always leave the place cleaner when you leave: Use good practices when you work in a team or on your code deck, and support each other to do so. Use templates and agree on code styles. The code deck should be a joint effort of the team.
Workarounds are no solutions in production environments: quick and dirty solutions are not the way to go. However, keep in mind that over-engineering can also lead to higher code complexity. Ask a more senior developer for help if you need it or do pair programming.
Keep it as simple as possible but not simpler: Not all solutions require machine learning solutions. Some problems can be fixed with simple logic, others with statistics and some require a machine learning solution.

4. Manage Data-driven Solutions.

If your use case requires a machine learning solution, try to avoid complex solutions as they can become more problematic to get into production. Most solutions can be solved with tree-based ML models such as XGBoost. Like every ML model, it requires optimization steps, splitting the dataset into the training set, a testing set, and, an independent validation set. Especially for XGBoost models, this can be a time-consuming and error-prone process because of the optimization process of the hyperparameters. Libraries such as HGBoost [3] will help you greatly in creating/selecting reliable model(s). More details can be found in this blog:

A Guide to Find the Best Boosting Model using Bayesian Hyperparameter Tuning but without…

HGBoost stands for Hyperoptimized Gradient Boosting. It is a Python package for hyperparameter optimization for XGBoost, LightBoost, and CatBoost. It will carefully split the dataset into a train, test, and independent validation set. Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization (with hyperopt) and, the outer loop to score how well the top-performing models can generalize based on k-fold cross-validation. As such, it will select the most robust model with the relative best performance. It also creates insightful figures that allow you to describe why the particular model is selected and likely the most reliable.

Final words.

It must be clear by now that not all use cases require a data-driven approach. certain problems can be addressed using alternative methods or strategies that do not rely on data analysis but on user logic. If you do need a data-driven approach, then consider all the parts from responsibility towards building reliable models.

Be Safe. Stay Frosty.

Cheers E.

If you found this article interesting, you are welcome to follow me because I write more about such topics. If you are thinking of taking a Medium membership, you can support my work a little bit by using my referral link. It is the same price as a coffee for which you can read every month unlimited articles.

Software

Repo: HGBoost

Let's connect!

References

E. Taskesen, The Path to Success in Data Science Is About Your Ability to Learn. But What to Learn?, Towards Data Science (Medium), Jun 2023.
E. Taskesen, The Next Step is Responsible AI. How Do We Get There?Towards Data Science (Medium), Aug 2023.
N. Rios, L. Ribeiro, V. Caires, T. Mendes, "Towards an Ontology of Terms on Technical Debt," Research Gate, Dec. 2014
Assessing Technical Debt to Prioritize Modernization Investments," Gartner. Dec. 19, 2023
E. Taskesen, A Guide to Find the Best Boosting Model using Bayesian Hyperparameter Tuning but without Overfitting, Towards Data Science (Medium), Aug 2022.

Tags: AI Data Data Sicence Deep Dives Responsible Ai