How do processes work?
To enhance the outcomes of operational processes of an organization, it is important to know the process inside out. Process discovery is used to construct process models using the event logs recorded from the real-life activities.
What is process discovery?
Why is process discovery necessary?

Nowadays, with the innovations in computing and communication that dramatically changed organizations' workflow, business processes have become more complex, heavily rely on information systems, and may span multiple organizations. Therefore, visualizing how the organization is running is crucial. A process model can provide insights, be used to discuss responsibilities, analyze compliance, predict performance, and much more.

However, creating such process models by hand is a challenging and error-prone task, and only experienced designers can make valuable models. Typical errors can be that the model describes an idealized version of reality, or the model is unable to adequately capture the observed behavior.

Process discovery algorithms use facts to automatically generate a better model in less time. It provides various views on the same reality at different abstraction levels. With the created process model, people can apply other process mining techniques such as conformance checking, performance analysis, etc. to gain deeper insights into the organization.

How to discover a process model?

Discovering a process model is similar to how a child learns a new language. Upon hearing new words, the child would start forming a mental model of the language. When the child hears new words or sentences, it starts to refine the model representing the language and makes it more and more sophisticated. A process discovery algorithm also works similarly and starts by scanning through the activities captured in the event log and slowly builds a model that best represents the observed behavior. An unseen sequence of activities coming from the event log is considered to be a new sentence. It either fits the existing process model or requires some adjustment in the model discovered so far. When the algorithm reaches the end of the event log, it has seen all possible variations of the behavior exhibited by the system, and thus outputs a process model representing the system behavior.

α-algorithm was the first process discovery algorithms that could adequately deal with concurrency. With an event log as the input, the α-algorithm derives various "relations" between the activities occurring in the event log. These relations are used to produce a Petri net that represents the log. Although the α-algorithm should not be considered as mining technique that can be used in practice, it provides a good introduction to the topic. The α-algorithm provided the basis for many other process discovery techniques.
Heuristic mining
Heuristic mining algorithms use a representation similar to causal nets. Moreover, these algorithms take frequencies of events and sequences into account when constructing a process model. The basic idea is that infrequent paths should not be incorporated into the model.
Genetic process mining
The α-algorithm and techniques for heuristic and fuzzy mining provide process models in a direct and deterministic manner. Genetic algorithms are a search technique that mimics the natural process of evolution in biological systems. These algorithms try to find a solution in the search space, by either testing existing points, or through the process of mutation or a combination of existing points. Such approaches are not deterministic and depend on randomization to find new alternatives.
Region-Based Mining
In the context of Petri nets, researchers have been looking at the so-called synthesis problem, i.e., constructing a system model from a description of its behavior. State-based regions can be used to construct a Petri net from a transition system. This technique finds "General Excitation Regions" and builds Petri nets using such regions. Language-based regions can be used to construct a Petri net from a prefix-closed language. The language-based region technique uses algebraic constraints modeled from the event log to determine the places that allow the behavior observed in the event log.
Inductive Mining
A range of inductive process discovery techniques exists for process trees, which ensure soundness from construction. Therefore, the inductive mining framework is highly extendible and allows for many variants of the basic approach. It is considered one of the leading process discovery approaches due to its flexibility, formal guarantees, and scalability.
Outcomes of Process discovery

The process discovery techniques applied to the event logs provide a graphical representation of a process. The result of a process discovery algorithm is generally a process model and statistics of the cases that are part of the event log. The representation and accuracy of the discovered model depend both on the technique used for the discovery and the type of visualization that is chosen.

Directly-follows graph
A Directly-Follows Graph (DFG) is the simplest representation of the process models. In a directly-follows graph, each node represents an activity and the arcs describe the relationship between various activities. Typically in a process model, the directly-follows graph has a source and sink representing the start and end activities. An arc in the directly-follows graph between any two activities represents that the source activity is directly followed by the sink activity in the event log.
Petri nets
Petri nets provide a higher-level representation of the process models and allow for a compact representation of concurrent behaviour in processes. A Petri net is capable of showing different types of transformations between the activities. Petri nets are capable of describing sequential, parallel, choice, and loop execution between various activities in the processes. The notion of token flows has been adopted by most of the graphical process modeling languages (BPMN, UML activity diagrams, etc.).
The BPMN 2.0 (Business Process Model and Notation) standard is widely used and allows building compact and understandable process models. In addition to the flat control-flow perspective, subprocesses, data flows, resources can be integrated within one BPMN diagram. This makes BPMN very attractive for both process miners and business users since the control flow perspective can be integrated with data and resource perspectives discovered from event logs.

Process discovery poses several unique challenges. An initial glance shows that the model discovered using process mining is different and complex than the ideal process model that was expected. The event logs used to discover models, only show the behaviour that has occurred so far, giving us only the partial model of the entire space of possibilities.

Four challenges during process discovery
  • Fitness: Can the model explain the observed behavior?
  • Simplicity: Is the model simple ("Occam's razor")?
  • Generalization: Does the discovered model overfit the event log?
  • Precision: Is the model not expressive enough to show the behavior seen in the logs (underfitting)?