What is process discovery?
Organizations today record and store the details on events occurring in
their systems. Such an event log contains chronologically ordered data
describing various operations occurring in their system. A process discovery
technique can learn a process model from the observed behavior. The discovered
model is “representative” for the activities and behaviors seen in the
event log. The general problem of process discovery is to find such an
algorithm. Process discovery is considered one of the most challenging
tasks during process mining.
Why is process discovery necessary?
Nowadays, with the innovations in computing and communication that
dramatically changed organizations' workflow, business processes have
become more complex, heavily rely on information systems, and may span
multiple organizations. Therefore, visualizing how the organization is
running is crucial. A process model can provide insights, be
used to discuss responsibilities, analyze compliance, predict
performance, and much more.
However, creating such process models by hand is a challenging and error-prone
task, and only experienced designers can make valuable models. Typical
errors can be that
the model describes an idealized version of reality,
the model is unable to adequately capture the observed behavior.
Process discovery algorithms use
facts to automatically generate a
better model in less time. It
provides various views on the same reality at different abstraction
levels. With the created process model, people can apply other process
mining techniques such as conformance checking, performance analysis,
etc. to gain deeper insights into the organization.
How to discover a process model?
Discovering a process model is similar to how a child learns a new language.
Upon hearing new words, the child would start forming a mental model of the
language. When the child hears new words or sentences, it starts to refine
the model representing the language and makes it more and more sophisticated.
A process discovery algorithm also works similarly and starts by scanning
through the activities captured in the event log and slowly builds a model
that best represents the observed behavior. An unseen sequence of activities
coming from the event log is considered to be a new sentence. It either fits
the existing process model or requires some adjustment in the model discovered
so far. When the algorithm reaches the end of the event log, it has seen all
possible variations of the behavior exhibited by the system, and thus outputs a
process model representing the system behavior.
α-algorithm was the first process discovery algorithms that could
adequately deal with concurrency. With an event log as the input, the
α-algorithm derives various "relations" between the activities occurring
in the event log. These relations are used to produce a Petri net that
represents the log. Although the α-algorithm should not be considered as
mining technique that can be used in practice, it provides a good introduction
to the topic. The α-algorithm provided the basis for many other process discovery
Heuristic mining algorithms use a representation similar to causal
nets. Moreover, these algorithms take frequencies of events and
sequences into account when constructing a process model. The basic
idea is that infrequent paths should not be incorporated into the
Genetic process mining
The α-algorithm and techniques for heuristic and fuzzy mining
provide process models in a direct and deterministic manner.
Genetic algorithms are a search technique that mimics the natural process
of evolution in biological systems. These algorithms try to find a solution
in the search space, by either testing existing points, or through the
process of mutation or a combination of existing points. Such approaches
are not deterministic and depend on randomization to find new alternatives.
In the context of Petri nets, researchers have been looking at the
i.e., constructing a system model from a description of its
behavior. State-based regions can be used to construct a Petri net
from a transition system. This technique finds "General Excitation Regions"
and builds Petri nets using such regions. Language-based regions can
be used to construct a Petri net from a prefix-closed language. The
language-based region technique uses algebraic constraints modeled
from the event log to determine the places that allow the behavior observed
in the event log.
A range of inductive process discovery techniques exists for process
trees, which ensure soundness from construction. Therefore, the
inductive mining framework is highly extendible and allows for many
variants of the basic approach. It is considered one of the
leading process discovery approaches due to its flexibility, formal
guarantees, and scalability.
Outcomes of Process discovery
The process discovery techniques applied to the event logs provide a
graphical representation of a process. The result of a process discovery
algorithm is generally a process model and statistics of the cases that
are part of the event log. The representation and accuracy of the discovered
model depend both on the technique used for the discovery and the type of
visualization that is chosen.
A Directly-Follows Graph (DFG) is the simplest representation of the process models.
In a directly-follows graph, each node represents an activity and the arcs describe
the relationship between various activities. Typically in a process model, the
directly-follows graph has a source and sink representing the start and end
activities. An arc in the directly-follows graph between any two activities represents
that the source activity is directly followed by the sink activity in the event log.
Petri nets provide a higher-level representation of the process models and
allow for a compact representation of concurrent behaviour in processes.
A Petri net is capable of showing different types of transformations between
the activities. Petri nets are capable of describing sequential, parallel,
choice, and loop execution between various activities in the processes. The
notion of token flows has been adopted by most of the graphical
process modeling languages (BPMN, UML activity diagrams, etc.).
The BPMN 2.0 (Business Process Model and Notation) standard is
widely used and allows building compact and understandable
process models. In addition to the flat control-flow perspective,
subprocesses, data flows, resources can be integrated within one
BPMN diagram. This makes BPMN very attractive for both process miners
and business users since the control flow perspective can be
integrated with data and resource perspectives discovered from event logs.
Process discovery poses several unique challenges. An initial
glance shows that the model discovered using process mining is
different and complex than the ideal process model that was expected.
The event logs used to discover models, only show the behaviour that
has occurred so far, giving us only the partial model of the entire
space of possibilities.
Four challenges during process discovery
Fitness: Can the model explain the observed behavior?
Simplicity: Is the model simple ("Occam's razor")?
Generalization: Does the discovered model overfit the event log?
Precision: Is the model not expressive enough to show the behavior
seen in the logs (underfitting)?