Still today the power of data is underestimated (source: Getty Images)

Strategic data management: a case study and an interview on DaaS vs. DaaP

Laurent Dossche
8 min readDec 6, 2020

--

The undeniable strength of data management strategies in today’s business environments is key. I have interviewed Emile Cras, a business analyst building next generation solutions for clients all over Belgium at Semetis, on the hype of data as a service and its differences with data used as a product. Choosing the right strategy and infrastructure for large sets of data is vital. A case study on the steps to build a developed data science product is done later in this article.

DaaP vs. DaaS: interview

One interesting water cooler conversation I did with Emile Cras, one of the best business analysts at Semetis who helps me understand the mechanisms behind the sourcing of some data. Why there is the hype on data as a service (DaaS) vs. data as a product (DaaP)?

  • What’s the difference between data as a service (DaaS) and data as a product (DaaP)?

E: DaaP is based on the underlying principle of providing data to the company which is translated in reports/data visualization that managers use to answer their own questions and business matters that they are facing on a day-to-day basis. DaaS on the other hand is focused on already providing the answers to questions people and stakeholders in the company have. In other words, it is providing the tools to others to answer their questions. In a DaaS model there’s a much closer relationship between the management and the data teams. In that sense, data teams have a higher expertise in their domain and can condict strategic data science. In short, DaaP is focused on providing data to the company while DaaS is all about providing valuable insights to the company to answer challenging business questions.

Important to mention that you cannot achieve a DaaS model in your company without passing the phase of a DaaP model. First things first, you will need to build strong reporting and data-models before moving into strategic decision support. In other words, it will be of vital importance to have the fundamentals in place before you will be able to translate your data into strategic tools and insights.

Data Team maturity over time
  • Why is the DaaS only now seeing widespread adoption?

E: This is mainly due to the fact that now we have much more possibilities to handle massive amounts of data in no time. Companies have the ability to improve their infrastructure and workloads by the use of smart technologies. Nowadays, we can make use of solutions such as cloud computing like AWS and Google Cloud which makes it much less time-consuming to manage and process data. This evolution of cloud-based platforms are engineered for large-scale data management. Data teams can now focus on the important part of the data, translating it into insights rather than focusing on storing and handling data.

  • Is the demand for DaaS something which was demanded from within your industry specifically or was this shift about to happen anyway?

E: The demand for such a model has been initiated, I believe, by big tech companies and small start-ups that were already applying such a vision from the start. The call for such a model only became more important now that companies have to handle bigger datasets and find competitive edges in today’s challenging markets and industries. Nonetheless, this wouldn’t have been possibly with the emergence of new technologies and cloud computing being the new norm in the industry. In that regard, technological change also plays a big role in this change as data teams and scientists now have the time and capabilities to focus on the data analysis rather than on processing the data itself.

For example, in my field of Digital Marketing, we are using Google’s Cloud marketing platform, Big Query and Customer data platforms which allows us to generate, store and clean massive amounts of data sets. This process is mostly being handled by technology while our data scientists and analysts can now focus on delivering valuable insights to make business decisions this didn’t use to be the case just a few years ago. Data as a Service delivers a framework for monetizing Big Data, turning predictive analytics insights into untapped revenue streams.

  • Is there a trade-off hypothesis between SaaS and DaaS?

E: Before diving into detail on this question, let me clarify what a SaaS model is. A SaaS model is a cloud computing software that involves delivering applications to end-users over the internet, removing the need to run applications locally on your device. DaaS is similar to SaaS as they are both using cloud computing and remove the need to store software or data on a device. Akin to a SaaS model that withdraws the need to install and manage software on a device, DaaS also outsources most data storage and processing to the cloud.

The SaaS model has been living amongst us for more than a decade now, while DaaS is a model that is only beginning to emerge now. The root cause of this is because cloud computing services were not engineered to handle large amounts of data sets and processing that data was also difficult when bandwidth of the internet was far too often limited. Now that we can make an appeal on lower-cost computing services and improved bandwidth, DaaS is becoming more accessible than ever before and will gain a lot of importance in the coming years.

Thank you Emile! On that note,

Case study

Building out a viable data science product involves much more than just building a machine learning model with scikit-learn, pickling it, and loading it on a server. It requires an understanding of how all the parts of the enterprise’s ecosystem work together, starting with where/how the data flows into the data team, the environment where the data is processed/transformed, the enterprise’s conventions for visualizing/presenting data, and how the model output will be converted as input for some other enterprise applications. The main goals involve building a process that will be easy to maintain; where models can be iterated on and the performance is reproducible; and the model’s output can be easily understood and visualized for other stakeholders so that they may make better informed business decisions. Achieving those goals require selecting the right tools, as well as an understanding of what others in the industry are doing and the best practices.

Let’s illustrate with a scenario: suppose you just got hired as the lead data scientist for a vacation recommendation app startup that is expected to collect hundreds of gigabytes of both structured (customer profiles, temperatures, prices, and transaction records) and unstructured (customers’ posts/comments and image files) data from users daily. Your predictive models would need to be retrained with new data weekly and make recommendations instantaneously on demand. Since you expect your app to be a huge hit, your data collection, storage, and analytics capacity would have to be extremely scalable. How would you design your data science process and productionize your models? What are the tools that you’d need to get the job done? Since this is a startup and you are the lead — and perhaps the only — data scientist, it’s on you to make these decisions.

First, you’d have to figure out how to set up the data pipeline that takes in the raw data from data sources, processes the data, and feeds the processed data to databases. The ideal data pipeline has low event latency (ability to query data as soon as it’s been collected); scalability (able to handle massive amount of data as your product scales); interactive querying (support both batch queries and smaller interactive queries that allow data scientists to explore the tables and schemas); versioning (ability to make changes to the pipeline without bringing down the pipeline and losing data); monitoring (the pipeline should generate alerts when data stops coming in); and testing (ability to test the pipeline without interruptions). Perhaps most importantly, it had better not interfere with daily business operations — e.g. heads will roll if the new model you’re testing causes your operational database to grind to a halt. Building and maintaining the data pipeline is usually the responsibility of a data engineer (for more details, this article has an excellent overview on building the data pipeline for startups), but a data scientist should at least be familiar with the process, its limitations, and the tools needed to access the processed data for analysis.

Next, you’d have to decide if you want to set up on-premises infrastructure or use cloud services. For a startup, the top priority is to scale data collection without scaling operational resources. As mentioned earlier, on-premises infrastructure requires huge upfront and maintenance costs, so cloud services tend to be a better option for startups. Cloud services allow scaling to match demand and require minimal maintenance efforts, so that your small team of staff could focus on the product and analytics instead of infrastructure management.

Choice of cloud provider

In order to choose a cloud service provider, you’d have to first establish the data that you’d need for analytics, and the databases and analytics infrastructure most suitable for those data types. My interview with Loïc Claeys in a previous episode might give more clarification on this subject. Since there’d be both structured and unstructured data in your analytics pipeline, you might want to set up both a Data Warehouse and a Data Lake. An important thing to consider for data scientists is whether the storage layer supports the big data tools that are needed to build the models, and if the database provides effective in-database analytics. For example, some ML libraries such as Spark’s MLlib cannot be used effectively with databases as the main interface for data — the data would have to be unloaded from the database before it can be operated on, which could be extremely time-consuming as data volume grows and might become a bottleneck when you’ve to retrain your models regularly (thus causing another “heads-rolling” situation).

For data science in the cloud, most cloud providers are working hard to develop their native machine learning capabilities that allow data scientists to build and deploy machine learning models easily with data stored in their own platform (Amazon has SageMaker, Google has BigQuery ML, Microsoft has Azure Machine Learning). But the toolsets are still developing and often incomplete: for example, BigQuery ML currently only support linear regression, binary and multiclass logistic regression, K-means clustering, and TensorFlow model importing. If you decide to use these tools, you’d have to test their capabilities thoroughly to make sure they do what you need them to do.

Another major thing to consider when choosing a cloud provider is vendor-lock in. If you choose a proprietary cloud database solution, you most likely won’t be able to access the software or the data in your local environment, and switching vendor would require migrating to a different database, which could be costly. One way to address this problem is to choose vendors that support open source technologies (here’s Netflix explaining why they use open source software). Another advantage of using open source technologies is that they tend to attract a larger community of users, meaning it’d be easier for you to hire someone who has the experience and skills to work within your infrastructure. Another way to address the problem is to choose third-party vendors (such as Pivotal Greenplum and Snowflake) that provide cloud database solutions using other major cloud providers as storage backend, which also allows you to store your data in multiple clouds if that fits your startup’s needs.

Finally, since you expect the company to grow, you’d have to put in place a robust cloud management practice to secure your cloud and prevent data loss and leakages — such as managing data access and securing interfaces and APIs. You’d also want to implement data governance best practices to maintain data quality and ensure your Data Lake won’t turn into a Data Swamp.

See you next week,

Laurent

--

--

Laurent Dossche
Laurent Dossche

Written by Laurent Dossche

Weekly letter for interested readers | Subscribe to get it into your mailbox

No responses yet