Why is process discovery necessary?
Nowadays, with the innovations in computing and communication that
dramatically changed organizations' workflow, business processes have
become more complex, heavily rely on information systems, and may span
multiple organizations. Therefore, visualizing how the organization is
running is crucial. A process model can provide insights, be used to
discuss responsibilities, analyze compliance, predict performance, and
However, creating such process models by hand is a challenging and
error-prone task, and only experienced designers can make valuable
models. Typical errors can be that
the model describes an idealized version of reality,
the model is unable to adequately capture the observed
Process discovery algorithms use
facts to automatically generate a
better model in less time. It
provides various views on the same reality at different abstraction
levels. With the created process model, people can apply other process
mining techniques such as conformance checking, performance analysis,
etc. to gain deeper insights into the organization.
How to discover a process model?
Discovering a process model is similar to how a child learns a new
language. Upon hearing new words, the child would start forming a
mental model of the language. When the child hears new words or
sentences, it starts to refine the model representing the language and
makes it more and more sophisticated. A process discovery algorithm
also works similarly and starts by scanning through the activities
captured in the event log and slowly builds a model that best
represents the observed behavior. An unseen sequence of activities
coming from the event log is considered to be a new sentence. It
either fits the existing process model or requires some adjustment in
the model discovered so far. When the algorithm reaches the end of the
event log, it has seen all possible variations of the behavior
exhibited by the system, and thus outputs a process model representing
the system behavior.
α-algorithm was the first process discovery algorithms that could
adequately deal with concurrency. With an event log as the input,
the α-algorithm derives various "relations" between the activities
occurring in the event log. These relations are used to produce a
Petri net that represents the log. Although the α-algorithm should
not be considered as mining technique that can be used in practice,
it provides a good introduction to the topic. The α-algorithm
provided the basis for many other process discovery techniques.
Heuristic mining algorithms use a representation similar to causal
nets. Moreover, these algorithms take frequencies of events and
sequences into account when constructing a process model. The basic
idea is that infrequent paths should not be incorporated into the
Genetic process mining
The α-algorithm and techniques for heuristic and fuzzy mining
provide process models in a direct and deterministic manner. Genetic
algorithms are a search technique that mimics the natural process of
evolution in biological systems. These algorithms try to find a
solution in the search space, by either testing existing points, or
through the process of mutation or a combination of existing points.
Such approaches are not deterministic and depend on randomization to
find new alternatives.
In the context of Petri nets, researchers have been looking at the
i.e., constructing a system model from a description of its
behavior. State-based regions can be used to construct a Petri net
from a transition system. This technique finds "General Excitation
Regions" and builds Petri nets using such regions. Language-based
regions can be used to construct a Petri net from a prefix-closed
language. The language-based region technique uses algebraic
constraints modeled from the event log to determine the places that
allow the behavior observed in the event log.
A range of inductive process discovery techniques exists for process
trees, which ensure soundness from construction. Therefore, the
inductive mining framework is highly extendible and allows for many
variants of the basic approach. It is considered one of the leading
process discovery approaches due to its flexibility, formal
guarantees, and scalability.
Outcomes of Process discovery
The process discovery techniques applied to the event logs provide a
graphical representation of a process. The result of a process
discovery algorithm is generally a process model and statistics of the
cases that are part of the event log. The representation and accuracy
of the discovered model depend both on the technique used for the
discovery and the type of visualization that is chosen.
A Directly-Follows Graph (DFG) is the simplest representation of the
process models. In a directly-follows graph, each node represents an
activity and the arcs describe the relationship between various
activities. Typically in a process model, the directly-follows graph
has a source and sink representing the start and end activities. An
arc in the directly-follows graph between any two activities
represents that the source activity is directly followed by the sink
activity in the event log.
Petri nets provide a higher-level representation of the process
models and allow for a compact representation of concurrent
behaviour in processes. A Petri net is capable of showing different
types of transformations between the activities. Petri nets are
capable of describing sequential, parallel, choice, and loop
execution between various activities in the processes. The notion of
token flows has been adopted by most of the graphical process
modeling languages (BPMN, UML activity diagrams, etc.).
The BPMN 2.0 (Business Process Model and Notation) standard is
widely used and allows building compact and understandable process
models. In addition to the flat control-flow perspective,
subprocesses, data flows, resources can be integrated within one
BPMN diagram. This makes BPMN very attractive for both process
miners and business users since the control flow perspective can be
integrated with data and resource perspectives discovered from event
Process discovery poses several unique challenges. An initial glance
shows that the model discovered using process mining is different and
complex than the ideal process model that was expected. The event logs
used to discover models, only show the behaviour that has occurred so
far, giving us only the partial model of the entire space of
Four challenges during process discovery
Fitness: Can the model explain the observed
Simplicity: Is the model simple ("Occam's
Generalization: Does the discovered model
overfit the event log?
Precision: Is the model not expressive enough to
show the behavior seen in the logs (underfitting)?