DASC-PM: a novel Process Model for Data Science Projects

Author:Murphy | View: 20952 | Time: 2025-03-23 19:54:25

Introduction of an alternative approach to the popular CRISP-DM

With easy access to data, increasing computing power, and user-friendly analytics software, there has been a massive increase in the number of Data Science projects in various industries. While the early years felt like the Wild West, it is now much more common (and recommended) to follow specific frameworks for data-related projects.

Process models provide a clear and structured approach to defining and organizing tasks, activities, and deliverables throughout the project lifecycle. By implementing a consistent process, teams can ensure that all project objectives are met and that the final deliverables are of high quality. In addition, process models help reduce the risk of delays, errors, and budget overruns, making them an essential part of the data science project management.

Please note that the entire article is a quick introduction and a brief summary of the framework, which was produced by a larger group of authors, with me contributing to DASC-PM v1.1 as one of the co-authors:

_Schulz et al. (2022): "DASC-PM v1.1 – A Process Model for Data Science Projects" **** (2022), Publisher: NORDAKADEMIE gAG Hochschule der Wirtschaft, ISBN: 978–3–00–064898–4, DOI:10.25673/32872._2

Data Science Process Model DASC-PM (Schulz et al. 2022)

The "DAta SCience – Process Model" (DASC-PM) is a novel process model for data science projects that describes the key areas relevant to the project and the phases to be completed. It explains the typical tasks within the phases and depicts the project roles involved and the required competencies. The following article aims to introduce the main concepts and to work out the advantages compared to known concepts such as CRISP-DM, TDSP, KDD, or SEMMA.

The rise of a novel approach

With the cross-industry standard process for data mining (CRISP-DM), there exists already a well-known "framework for carrying out data mining projects which is independent of both the industry sector and the technology used." [4] Besides that, other relevant concepts such as TDSP, SEMMA, or KDD aim to provide comparable models that outperform each other on several details. However, it was of interest to take a step back to identify (meta-) requirements relevant to process models focusing on data science. The requirements were collected via a survey from April 2019 to February 2020, cover both scientific and practical aspects, and, thus, address the first research question [2]:

Which theoretical and practical requirements are imposed upon data science process models?

Data collection was conducted in a working group consisting of 22 experts, including 9 professors as well as 13 practitioners and scientists with relevant theoretical and practical experience in data science. Based on that, it was examined the extent to which the related process models fulfill the requirements collected previously. The below table offers an overview of the results of the investigation by the specific requirement and process model. Filled Harvey balls indicate that a requirement is addressed by the respective process model, half-filled ones that a requirement is at least mentioned, and empty ones that a requirement is neither mentioned nor addressed. [2]

(Meta-) Requirements of a data science process model (Schulz et al. 2020)

Recognizing that none of the related, well-known process models could fulfill the 17 identified requirements placed upon process models for data science projects, Schulz et al. developed a novel data science process model called DASC-PM to address the second research question [2]:

How can a data science process model that is aligned with relevant theoretical and practical requirements be conceptualized?

Brief introduction of the five DASC-PM phases

In the following section, we will briefly capture the main ideas of the five core phases within the newly created process model: Project Order, Data Provision, Analysis, Deployment, and Application. [3]

The corresponding areas and tasks will be visualized as the following:

Legend of phase representation (Schulz et al. 2022)

Phase 1: Project Order

Problems existing within a domain trigger a use-case development. The promising use cases are subsequently configured to a data science project outline. All associated tasks are reflected in the project order phase. Through the early, relatively comprehensive consideration of the project, comprehensive abilities in almost all skill areas are also frequently required here. [3]

Phase 2: Data Provision

Within the data provision phase, all activities that are allocable to the data key area are summarized, which is why the term used is broadly formulated. The phase contains data preparation (from recording to storage), data management, and exploratory data analysis. This phase results in a data source that is suited for further analysis. [3]

Phase 3: Analysis

In a data science project, either existing procedures can be used or a new procedure developed the decision in question is a separate challenge. The phrase, therefore, includes not only performing the analysis but also related activities. The artifact of the phase is an analysis result that has traversed a methodical and technical evaluation. [3]

Phase 4: Deployment

In the deployment phase, an applicable form of the analysis results is created. Depending on the project, this can entail comprehensively considering technical, methodological, and professional tasks – or it can be handled pragmatically. The analysis artifact can include results as well as models or procedures and is provided to its target recipients in various forms. [3]

Phase 5: Application

Using artifacts after the project performance is not considered a primary part of a data science project. Monitoring is necessary, however, to check the models continuing suitability in the application and obtain findings from the application for ongoing and new developments (including developments for the purposes of iterative approaches). [3]

Overarching key areas

Besides the five process steps, the model contains three overarching key areas which have to be taken into account in all phases of the project:

Domain

At many points in a data science process, broad background knowledge of the domain is needed. Examples are the identification of the analysis target or the correct understanding of data, its origin, quality, and connections. Other examples include the assessment and classification of analysis results in the application as well as subsequent practical use. The area "Domain" also encompasses rating strengths and weaknesses of existing solutions, conducting requirements analysis, supporting parametrization of models, and finally evaluating the success of the project. Legal, social, and ethical aspects of data science projects will also be discussed here. [1]

Scientificity

Just because data science projects are scientific in nature does not mean they claim to be complete, formalized, academic, or consistently research-orientated in general. Although this might certainly be the case for research projects, the aspect of their scientificity within a business context primarily refers to a solid methodology: a typically expected characteristic or minimum requirement of scientific work. [1]

The defined project order must be processed methodically in every project phase. Special mention must be made here of the Project Management and structured processing that is placed in the foreground by using a process mode. Details on the degree of scientificity required must be established while considering the project situation and domain specifics. [1]

IT Infrastructure

All the steps that a data science project traverses depend on the underlying IT infrastructure; the actual extent of IT support, however, should be individually assessed for each project. Even if the use of specific hardware and software is frequently determined within the organization, the limiting and empowering characteristics of the IT infrastructure (as well as the possibility of expanding the infrastructure, if applicable) must be considered in all project phases. [1]

Summary

The DASC-PM is the result of a scientific approach to collect, structure, and address (meta-) requirements for process models in the area of data science. Since all previous, well-known models did not meet all defined requirements, it was of interest to introduce a novel concept that allows researchers as well as practitioners in business & industry to use the DASC-PM to structure data science projects in a phase-orientated way. However, the authors claim that the DASC-PM should not be considered a finished deliverable, but more as a framework that can be continuously improved through scientific and practical discourse.

Jonas Dieckmann – Medium

I hope you find it interesting. Let me know your thoughts and feel free to connect on LinkedIn https://www.linkedin.com/in/jonas-dieckmann and/or to follow me here on medium.

References

The whole article represents a brief summary and is based on the framework:

[1] Schulz et al. (2022): _"_DASC-PM v1.1 – A Process Model for Data Science Projects" **** (2022), Publisher: NORDAKADEMIE gAG Hochschule der Wirtschaft, ISBN: 978–3–00–064898–4, DOI:10.25673/32872.2

as well as the introduction:

[2] Schulz et al. (2020): __ "Introducing DASC-PM: A Data Science Process Model" (2020). ACIS 2020 Proceedings. 45. https://aisel.aisnet.org/acis2020/45

An additional quotable source is provided in the following book:

[3] Kuehnel, S., Neuhaus, U., Kaufmann, J., Schulz, M., Alekozai, E.M. (2023). "Using the Data Science Process Model Version 1.1 (DASC-PM v1.1) for Executing Data Science Projects: Procedures, Competencies, and Roles." In: Barton, T., Müller, C. (eds) Apply Data Science. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-38798-3_8

Other references:

[4] Wirth, R., Hipp, J. (2000) "CRISP-DM: Towards a Standard Process Model for Data Mining", Proc. 4th Int. Conference on Practical Applications of Knowledge Discovery and Data mining, pp. 29–39.

Tags: Crisp Dm Dasc Pm Data Science Process Modelling Project Management