guerilla černý text trixx

Transkript

guerilla černý text trixx
MEMICS 2014
Petr Hliněný, Zdeněk Dvořák,
Jiřı́ Jaroš, Jan Kofroň, Jan Kořenek,
Petr Matula and Karel Pala (Eds.)
MEMICS 2014
Ninth Doctoral Workshop on Mathematical
and Engineering Methods in Computer Science
Telč, Czech Republic, October 17–19, 2014
Editors
Petr Hliněný
Faculty of Informatics
Masaryk University
Botanická 68a, Brno, Czech Republic
Zdeněk Dvořák
Faculty of Mathematics and Physics
Charles University
Ke Karlovu 3, Praha, Czech Republic
Jiřı́ Jaroš
Faculty of Information Technology
Brno University of Technology
Božetěchova 2, Brno, Czech Republic
Jan Kofroň
Faculty of Mathematics and Physics
Charles University
Malostranské náměstı́ 25, Praha, Czech Republic
Jan Kořenek
Faculty of Information Technology
Brno University of Technology
Božetěchova 2, Brno, Czech Republic
Petr Matula
Faculty of Informatics
Masaryk University
Botanická 68a, Brno, Czech Republic
Karel Pala
Faculty of Informatics
Masaryk University
Botanická 68a, Brno, Czech Republic
Subject classification: Information technology
ISBN 978-80-214-5022-6
Company and university graphics are properties of their respective owners and
are published as provided
Preface
This volume contains the local proceedings of the 9th Doctoral Workshop on
Mathematical and Engineering Methods in Computer Science (MEMICS 2014)
held in Telč, Czech Republic, during October 17—19, 2014.
The aim of the MEMICS workshop series is to provide an opportunity for PhD
students to present and discuss their work in an international environment.
The scope of MEMICS is broad and covers many fields of computer science
and engineering. In the year 2014, submissions were invited especially in the following (though not exclusive) areas:
–
–
–
–
–
–
Algorithms, logic, and games,
High performance computing,
Computer aided analysis, verification, and testing,
Hardware design and diagnostics,
Computer graphics and image processing, and
Artificial intelligence and natural language processing.
There were 28 submissions from 10 countries. Each submission was thoroughly
evaluated by at least four Programme Committee members who also provided extensive feedback to the authors. Out of these submissions, 9 papers were selected
for publication in LNCS post-proceedings, and 9 other papers for publication in
these local proceedings.
In addition to regular papers, MEMICS workshops also invite PhD students
to submit a presentation of their recent research results that have already undergone a rigorous peer review process and have been presented at a high quality international conference or published in a recognized journal. A total of
16 presentations out of 22 submissions from 6 countries were included into the
MEMICS 2014 programme. Short abstracts of accepted presentations also appear in these local proceedings.
All of the contributed papers were presented by PhD students who received
immediate feedback from their peers and the participating senior researchers.
All students were encouraged to actively take part in the discussions, express
their opinions, exchange ideas and compare methods, traditions and approaches
between groups and institutions whose representatives were participating in the
workshop.
The highlights of the MEMICS 2014 programme included six keynote lectures delivered by internationally recognized researchers. The full papers of
these keynote lectures were also included for publication in the LNCS postproceedings. The speakers were:
– Gianni Antichi from University of Cambridge who gave a talk on An OpenSource Hardware Approach for High Performance Low-Cost QoS Monitoring
of VoIP Traffic,
V
– Derek Groen from University College London who gave a talk on Highperformance multiscale computing for modelling cerebrovascular bloodflow
and nanomaterials.
– Jozef Ivanecký from European Media Laboratory who gave a talk on Today’s
Challenges for Embedded ASR,
– Daniel Lokshtanov from University of Bergen who gave a talk on Tree Decompositions and Graph algorithms,
– Michael Tautschnig from Queen Mary University of London who gave a talk
on Automating Software Analysis at Very Large Scale and
– Stefan Wörz from University of Heidelberg who gave a talk on 3D ModelBased Segmentation of 3D Biomedical Images
The MEMICS tradition of best paper awards continued also in the year 2014. The
best regular papers were selected during the workshop, taking into account their
scientific and technical contribution together with the quality of presentation.
The awards consisted of a diploma accompanied by a financial prize of roughly
400 Euro. The money was donated by Red Hat Czech Republic and by Y Soft,
two of the MEMICS 2014 Industrial Sponsors.
The successful organization of MEMICS 2014 would not be possible without
generous help and support from the organizing institutions: Brno University of
Technology and Masaryk University in Brno.
We thank the Programme Committee members and the external reviewers
for their careful and constructive work. We thank the Organizing Committee
members who helped create a unique and relaxed atmosphere which distinguishes
MEMICS from other computer science meetings. We also gratefully acknowledge
the support of the EasyChair system and the great cooperation with the Lecture
Notes in Computer Science team of Springer Verlag.
Brno, October 2014
Petr Hliněný
General chair of MEMICS 14
Zdeněk Dvořák, Jiřı́ Jaroš, Jan Kofroň,
Jan Kořenek, Petr Matula and Karel Pala
PC track chairs of MEMICS 14
VI
Organisation
General Chair
Petr Hliněný, Masaryk University, Brno, Czech Republic
Programme Committee Co-Chairs
Zdeněk Dvořák, Charles University, Czech Republic
Jiřı́ Jaroš, Brno University of Technology, Czech Republic
Jan Kofroň, Charles University, Czech Republic
Jan Kořenek, Brno University of Technology, Czech Republic
Petr Matula, Masaryk University, Czech Republic
Karel Pala, Masaryk University, Czech Republic
Programme Committee
Gianni Antichi, University of Cambridge, UK
Tomáš Brázdil, Masaryk University, Czech Republic
Markus Chimani, Osnabrück University, Germany
Jan Černocký, Brno University of Technology, Czech Republic
Eva Dokladalova, ESIEE Paris, France
Jiřı́ Filipovič, Masaryk University, Czech Republic
Robert Ganian, Vienna University of Technology, Austria
Dieter Gollmann, TU Hamburg, Germany
Derek Groen, University College London, UK
Juraj Hromkovič, ETH Zürich, Switzerland
Ondřej Jakl, VŠB-TU Ostrava, Czech Republic
Hidde de Jong, INRIA, France
Zdeněk Kotásek, Brno University of Technology, Czech Republic
Lukasz Kowalik, University of Warsaw, Poland
Hana Kubátová, Czech Technical University in Prague, Czech Republic
Michal Laclavı́k, Slovak Academy of Sciences, Bratislava, Slovakia
Markéta Lopatková, Charles University in Prague, Czech Republic
Julius Parulek, University of Bergen, Norway
Maciej Piasecki, Wroclaw University of Technology, Poland
Geraint Price, Royal Holloway, University of London, UK
Viktor Puš, CESNET, Czech Republic
Ricardo J. Rodrı́guez, Technical University of Madrid, Spain
Adam Rogalewicz, Brno University of Technology, Czech Republic
Cristina Seceleanu, MDH, Sweden
Jiřı́ Srba, Aalborg University, Denmark
VII
Andreas Steininger, TU Wien, Austria
Jan Strejček, Masaryk University, Czech Republic
David Šafránek, Masaryk University, Czech Republic
Ivan Šimeček, Czech Technical University in Prague, Czech Republic
Petr Švenda, Masaryk University, Czech Republic
Catia Trubiani, GSSI, Italy
Pavel Zemčı́k, Brno University of Technology, Czech Republic
Florian Zuleger, TU Wien, Austria
Steering Committee
Tomáš Vojnar, chair, Brno University of Technology, Brno, Czech Republic
Milan Češka, Brno University of Technology, Brno, Czech Republic
Zdeněk Kotásek, Brno University of Technology, Brno, Czech Republic
Mojmı́r Křetı́nský, Masaryk University, Brno, Czech Republic
Antonı́n Kučera, Masaryk University, Brno, Czech Republic
Luděk Matyska, Masaryk University, Brno, Czech Republic
Organizing Committee
Radek Kočı́, chair, Brno University of Technology, Czech Republic
Zdeněk Letko, Brno University of Technology, Czech Republic
Jaroslav Rozman, Brno University of Technology, Czech Republic
Hana Pluháčková, Brno University of Technology, Czech Republic
Lenka Turoňová, Brno University of Technology, Czech Republic
Additional Reviewers
Kfir Barhum
Hans-Joachim Boeckenhauer
Yu-Fang Chen
Pavel Čeleda
Vojtěch Forejt
Lukáš Holı́k
Ivan Kolesár
Jan Křetı́nský
Sacha Krug
VIII
Julio Mariño
František Mráz
Mads Chr. Olesen
Jakub Pawlewicz
Martin Plátek
Fernando Rosa-Velardo
Václav Šimek
Marek Trtı́k
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
V
Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
I
II
III
Invited Lectures—Abstracts
Contributed Papers - Abstracts
Regular Papers
Boosted Decision Trees for Behaviour Mining of Concurrent Programs . . .
Renata Avros, Vendula Hrubá, Bohuslav Křena, Zdeněk Letko,
Hana Pluháčková, Tomáš Vojnar, Zeev Volkovich, and Shmuel Ur
15
LTL model checking of Parametric Timed Automata . . . . . . . . . . . . . . . . . .
Peter Bezděk, Nikola Beneš, Jiřı́ Barnat, and Ivana Černá
28
FPGA Accelerated Change-Point Detection Method for 100 Gb/s
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tomáš Čejka, Lukáš Kekely, Pavel Benáček, Rudolf B. Blažek, and
Hana Kubátová
40
Hardware Accelerated Book Handling with Unlimited Depth . . . . . . . . . . .
Milan Dvořák, Tomáš Závodnı́k, and Jan Kořenek
52
Composite Data Type Recovery in a Retargetable Decompilation . . . . . . .
Dušan Kolář and Peter Matula
63
Multi-Stride NFA-Split Architecture for Regular Expression Matching
Using FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vlastimil Košař and Jan Kořenek
77
Computational Completeness Resulting from Scattered Context
Grammars Working Under Various Derivation Modes . . . . . . . . . . . . . . . . . .
Alexander Meduna and Ondřej Soukup
89
Convergence of Parareal Algorithm Applied on Molecular Dynamics
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Jana Pazúriková and Luděk Matyska
IX
A Case for a Multifaceted Fairness Model: An Overview of Fairness
Methods for Job Queuing and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Šimon Tóth
IV
Presentations
Fault Recovery Method with High Availability for Practical Applications . 127
Jaroslav Borecký, Pavel Vı́t, and Hana Kubátová
Verification of Markov Decision Processes using Learning Algorithms . . . . 128
Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelı́k, Vojtěch
Forejt, Jan Křetı́nský, Marta Kwiatkowska, David Parker, and
Mateusz Ujma
CEGAR for Qualitative Analysis of Probabilistic Systems . . . . . . . . . . . . . . 129
Krishnendu Chatterjee, Martin Chmelı́k, and Przemyslaw Daca
From LTL to Deterministic Automata: A Safraless Compositional
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Javier Esparza and Jan Křetı́nský
Faster Existential FO Model Checking on Posets . . . . . . . . . . . . . . . . . . . . . . 131
Jakub Gajarský
Fully Automated Shape Analysis Based on Forest Automata . . . . . . . . . . . 132
Lukáš Holı́k, Ondřej Lengál, Adam Rogalewicz, Jiřı́ Šimáček, and
Tomáš Vojnar
Multi-objective Genetic Optimization for Noise-Based Testing of
Concurrent Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Hana Pluháčková,
and Tomáš Vojnar
On Iterpolants and Variable Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Pavel Jančı́k, Jan Kofroň, Simone Fulvio Rollini, and Natasha
Sharygina
Finding Terms in Corpora for Many Languages . . . . . . . . . . . . . . . . . . . . . . . 135
Adam Kilgarriff, Miloš Jakubı́ček, Vojtěch Kovář, Pavel Rychlý,
and Vı́t Suchomel
Hereditary properties of permutations are strongly testable . . . . . . . . . . . . . 137
Tereza Klimošová and Daniel Král’
Paraphrase and Textual Entailment Generation in Czech . . . . . . . . . . . . . . . 138
Zuzana Nevěřilová
Minimizing Running Costs in Consumption Systems . . . . . . . . . . . . . . . . . . . 139
Petr Novotný
X
Testing Fault-Tolerance Methodologies in Electro-mechanical Applications 140
Jakub Podivı́nský and Zdeněk Kotásek
A Simple and Scalable Static Analysis for Bound Analysis and
Amortized Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Moritz Sinn, Florian Zuleger, and Helmut Veith
Optimal Temporal Logic Control for Deterministic Transition Systems
with Probabilistic Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Mária Svoreňová, Ivana Černá, and Calin Belta
Understanding the Importance of Interactions among Job Scheduling
Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Šimon Tóth and Dalibor Klusáček
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
XI
XII
Part I
Invited Lectures—Abstracts
Gianni Antichi, Computer Laboratory, University of Cambridge,
United Kingdom
Hardware accelerated networking systems: practice against theory
Computer networks are the hallmark of the 21st century’s society and underpin
virtually all infrastructures of the modern world. Building, running and maintaining enterprise networks is getting ever more complicated and difficult. Part of
the problem is related to the proliferation of real-time applications (voice, video,
gaming), which demand higher bandwidth and low-latency connections pushing
network devices to work at higher speeds. In this scenario, hardware acceleration
comes to aid speeding up time-critical operations. This talk will introduce the
most common network processing hardware accelerated operations such as IPlookup, packet classification and network monitoring taking as a reference the
widely used NetFPGA platform. We will present a list of technical challenges
that must be addressed to transition from a simple layer-2 switching device to
a highly accurate network monitoring system, passing by a packet classifier.
Derek Groen, Centre for Computational Science, University College
London
High-performance multiscale computing for modelling
cerebrovascular bloodflow and nanomaterials
Stroke is a leading cause of adult disability, and responsible for 50,000 deaths
in the UK in 2012. Brain haemorrhages are responsible for 15% of all strokes in
the UK, and 50% of all strokes in children. Within UCL we have developed the
HemeLB simulation environment to try and obtain more understanding about
Brain haemorrhages and blood flow in sparse geometries in general. In this talk
I will introduce HemeLB and summarize the many research efforts made around
it in recent years. I will present a cerebrovascular blood flow simulation which
incorporates input from the wider environment in a cerebrovascular network by
coupling a 1D discontinuous Galerkin model to a 3D lattice-Boltzmann model,
as well as several advances that we have made to improve the performance of our
code. These include vectorization of the code, improved domain decomposition
techniques and some preliminary results on using non-blocking collectives.
I will also present our ongoing work on clay-polymer nanocomposites where
we use a three-level multiscale scheme to produce a chemically-specific model
of clay-polymer nanocomposites. We applied this approach to study collections
of clay mineral tactoids interacting with two synthetic polymers, polyethylene
glycol and polyvinyl alcohol. The controlled behaviour of layered materials in
a polymer matrix is centrally important for many engineering and manufacturing applications. Our approach opens up a route to computing the properties
of complex soft materials based on knowledge of their chemical composition,
molecular structure and processing conditions.
3
Jozef Ivanecký, Stephan Mehlhase, European Media Laboratory,
Germany
Today‘s Challenges fof Embedded ASR
Automatic Speech Recognition (ASR) is pervading nowadays to areas unimaginable few years ago. Such a progress in past few years was achieved not because
of core embedded ASR technology improvement, but mainly because of massive
changes in “Smart phones world” as well as availability of a small powerful and
affordable Linux based HW. These changes answered two important questions:
1. How to make ASR always available? 2. How to install local affordable ASR
system almost anywhere? In recent years we can also observe grow of freely available ASR systems with acceptable speed and accuracy. Together with changes in
mobile world it is possible to embed remote ASR into applications very quickly
and without deep knowledge about the speech recognition. What is the future of
real embedded ASR systems in this case? The goal of this talk is to present two
embedded ASR applications which would not be possible without above mentioned changes over recent years and point out their advantages in contrast to
today‘s quick solutions. The first one demonstrates how changes in users behaviors allowed to design usable voice enabled house control application accepted
by all age groups. The focus of second one is mainly on extremely reliable in-car
real time speech recognition system which can use also remote ASR for some
specific tasks.
Daniel Lokshtanov, University of Bergen, Norway
Tree Decompositions and Graph algorithms
A central concept in graph theory is the notion of tree decompositions - these are
decompositions that allow us to split a graph up into “nice” pieces by “small”
cuts. It is possible to solve many algorithmic problems on graphs by decomposing
the graph into “nice” pieces, finding a solution in each of the pieces, and then
gluing these solutions together to form a solution to the entire graph. Examples
of this approach include algorithms for deciding whether a given input graph is
planar, the k-Disjoint paths algorithm of Robertson and Seymour, as well as a
plenthora of algorithms on graphs of bounded tree-width. By playing with the
formal definition of “nice” one arrives at different kinds of decompositions, with
different algorithmic applications. For an example graphs of bounded treewidth
are graphs that may be decomposed into “small” pieces by “small” cuts. The
structure theorem for minor-free graphs of Robertson and Seymour states that
minor-free graphs are exactly the graphs that may be decomposed by “small”
cuts into pieces that “almost” can be drawn on a surface of small genus. In this
talk we will ask the following lofty question: is it possible that every graph has
one, “nicest” tree decomposition which simultaneously decomposes the graph
into parts that are “as nice as possible” for any reasonable definition of nice?
And, if such a decomposition exists, how fast can we find such it algorithmically?
4
Michael Tautschnig, University of London, United Kingdom
Automating Software Analysis at Very Large Scale
Actual software in use today is not known to follow any uniform normal distribution, whether syntactically—in the language of all programs described by the
grammar of a given programming language, or semantically—for example, in the
set of reachable states. Hence claims deduced from any given set of benchmarks
need not extend to real-world software systems.
When building software analysis tools, this affects all aspects of tool construction: starting from language front ends not being able to parse and process
real-world programs, over inappropriate assumptions about (non-)scalability, to
failing to meet actual needs of software developers.
To narrow the gap between real-world software demands and software analysis tool construction, an experiment using the Debian Linux distribution has
been set up. The Debian distribution presently comprises of more than 22000
source software packages. Focussing on C source code, more than 400 million
lines of code are automatically analysed in this experiment, resulting in a number of improvements in analysis tools on the one hand, but also more than 700
public bug reports to date.
Stefan Wörz, University of Heidelberg, Germany
3D Model-Based Segmentation of 3D Biomedical Images
A central task in biomedical image analysis is the segmentation and quantification of 3D image structures. A large variety of segmentation approaches have
been proposed including approaches based on different types of deformable models. A main advantage of deformable models is that they allow incorporating a
priori information about the considered image structures. In this contribution
we give a brief overview of often used deformable models such as active contour
models, statistical shape models, and analytic parametric models. Moreover, we
present in more detail 3D analytic parametric intensity models, which enable
accurate and robust segmentation and quantification of 3D image structures.
Such parametric models have been successfully used in different biomedical applications, for example, for the localization of 3D anatomical point landmarks in
3D MR and CT images, for the quantification of vessels in 3D MRA and CTA
images, as well as for the segmentation of cells and subcellular structures in 3D
microscopy images.
5
6
Part II
Contributed Papers Abstracts
Petr Bauch, Vojtěch Havel, and Jiřı́ Barnat, Masaryk University,
Brno, Czech Republic
LTL Model Checking of LLVM Bitcode with Symbolic Data
The correctness of parallel and reactive programs is often easier specified using
formulae of temporal logics. Yet verifying that a system satisfies such specifications is more difficult than verifying safety properties: the recurrence of a specific
program state has to be detected. This paper reports on the development of a
generic framework for automatic verification of linear temporal logic specifications for programs in LLVM bitcode. Our method searches explicitly through
all possible interleavings of parallel threads (control non-determinism) but represents symbolically the variable evaluations (data non-determinism), guided by
the specification in order to prove the correctness. To evaluate the framework
we compare our method with state-of-the-art tools on a set of unmodified C
programs.
Stephan Beyer and Markus Chimani, University of Osnabrück,
Germany
Steiner Tree 1.39-Approximation in Practice
We consider the currently strongest Steiner tree approximation algorithm that
has recently been published by Goemans, Olver, Rothvoß and Zenklusen (2012).
It first solves a hypergraphic LP relaxation and then applies matroid theory
to obtain an integral solution. The cost of the resulting Steiner tree is at most
(1.39 + ε)-times the cost of an optimal Steiner tree where ε tends to zero as some
parameter k tends to infinity. However, the degree of the polynomial running
time depends on this constant k, so only small k are tractable in practice.
The algorithm has, to our knowledge, not been implemented and evaluated
in practice before. We investigate different implementation aspects and parameter choices of the algorithm and compare tuned variants to an exact LP-based
algorithm as well as to fast and simple 2-approximations.
Jan Fiedor, Zdeněk Letko, João Lourenço and Tomáš Vojnar, Brno
University of Technology, Czech Republic and Universidade Nova de
Lisboa, Portugal
On Monitoring C/C++ Transactional Memory Programs
Transactional memory (TM) is an increasingly popular technique for synchronising threads in multi-threaded programs. To address both correctness and
performance-related issues of TM programs, one needs to monitor and analyse
their execution. However, monitoring concurrent programs (including TM programs) can have a non-negligible impact on their behaviour, which may hamper objectives of the intended analysis. In this paper, we propose several approaches for monitoring TM programs and study their impact on the behaviour
of the monitored programs. The considered approaches range from specialised
lightweight monitoring to generic heavyweight monitoring. The implemented
9
monitoring tools are publicly available for further applications, and the implementation techniques used for lightweight monitoring can be used as an inspiration for developing further specialised lightweight monitors.
Radek Hrbáček, Brno University of Technology, Czech Republic
Bent Functions Synthesis on Intel Xeon Phi Coprocessor
A new approach to synthesize bent Boolean functions by means of Cartesian
Genetic Programming (CGP) has been proposed recently. Bent functions have
important applications in cryptography due to their high nonlinearity. However,
they are very rare and their discovery using conventional brute force methods
is not efficient enough. In this paper, a new parallel implementation is proposed
and the performance is evaluated on the Intel Xeon Phi Coprocessor.
Vojtěch Nikl and Jiřı́ Jaroš, Brno University of Technology, Czech
Republic
Parallelisation of the 3D Fast Fourier Transform Using the Hybrid
OpenMP/MPI Decomposition
The 3D fast Fourier transform (FFT) is the heart of many simulation methods.
Although the efficient parallelisation of the FFT has been deeply studied over
last few decades, many researchers only focused on either pure message passing
(MPI) or shared memory (OpenMP) implementations. Unfortunately, pure MPI
approaches cannot exploit the shared memory within the cluster node and the
OpenMP cannot scale over multiple nodes.
This paper proposes a 2D hybrid decomposition of the 3D FFT where the
domain is decomposed over the first axis by means of MPI while over the second
axis by means of OpenMP. The performance of the proposed method is thoroughly compared with the state of the art libraries (FFTW, PFFT, P3DFFT)
on three supercomputer systems with up to 16k cores. The experimental results
show that the hybrid implementation offers 10-20% higher performance and better scaling especially for high core counts.
Juraj Nižnan, Radek Pelánek, and Jiřı́ Řihák, Masaryk University,
Brno, Czech Republic
Mapping Problems to Skills Combining Expert Opinion and Student
Data
Construction of a mapping between educational content and skills is an important part of development of adaptive educational systems. This task is difficult,
requires a domain expert, and any mistakes in the mapping may hinder the
potential of an educational system. In this work we study techniques for improving a problem-skill mapping constructed by a domain expert using student
data, particularly problem solving times. We describe and compare different
techniques for the task – a multidimensional model of problem solving times
10
and supervised classification techniques. In the evaluation we focus on surveying
situations where the combination of expert opinion with student data is most
useful.
Karel Štěpka and Martin Falk, Masaryk University and Institute of
Biophysics of ASCR, Brno, Czech Republic
Image Analysis of Gene Locus Positions within Chromosome
Territories in Human Lymphocytes
One of the important areas of current cellular research with substantial impacts
on medicine is analyzing the spatial organization of genetic material within the
cell nuclei. Higher-order chromatin structure has been shown to play essential
roles in regulating fundamental cellular processes, like DNA transcription, replication, and repair. In this paper, we present an image analysis method for the
localization of gene loci with regard to chromosomal territories they occupy in
3D confocal microscopy images. We show that the segmentation of the territories
to obtain a precise position of the gene relative to a hard territory boundary may
lead to undesirable bias in the results; instead, we propose an approach based
on the evaluation of the relative chromatin density at the site of the gene loci.
This method yields softer, fuzzier “boundaries”, characterized by progressively
decreasing chromatin density. The method therefore focuses on the extent to
which the signals are located inside the territories, rather than a hard yes/no
classification.
Vladimı́r Štill, Petr Ročkai, and Jiřı́ Barnat, Masaryk University,
Brno, Czech Republic
Context-Switch-Directed Verification in DIVINE
In model checking of real-life C and C++ programs, both search efficiency
and counterexample readability are very important. In this paper, we suggest
context-switch-directed exploration as a way to find a well-readable counterexample faster. Furthermore, we allow to limit the number of context switches
used in state-space exploration if desired. The new algorithm is implemented in
the DIVINE model checker and enables both unbounded and bounded contextswitch-directed exploration for models given in LLVM bitcode, which efficiently
allows for verification of multi-threaded C and C++ programs.
David Wehner, ETH Zürich, Switzerland
A New Concept in Advice Complexity of Job Shop Scheduling
In online scheduling problems, we want to assign jobs to machines while optimizing some given objective function. In the class we study in this paper, we
are given a number m of machines and two jobs that both want to use each of
the given machines exactly once in some predefined order. Each job consists of
m tasks and each task needs to be processed on one particular machine. The
11
objective is to assign the tasks to the machines while minimizing the makespan,
i.e., the processing time of the job that takes longer. In our model, the tasks arrive in consecutive time steps and an algorithm must assign a task to a machine
without having full knowledge of the order in which the remaining tasks arrive.
We study the advice complexity of this problem, which is a tool to measure
the amount of information necessary to achieve a certain output quality. A great
deal of research has been carried out in this field; however, this paper studies
the problem in a new setting. In this setting, the oracle does not know the exact
future anymore but only all possible future scenarios and their probabilities.
This way, the additional information becomes more realistic. We prove that the
problem is more difficult with this oracle than before.
Moreover, in job shop
√
scheduling, we provide a lower bound of 1 + 1/(6 m) on the competitive
√ ratio
of any online algorithm with advice and prove an upper bound of 1 + 1/ m on
the competitive ratio of an algorithm from Hromkovič et al.
12
Part III
Regular Papers
Boosted Decision Trees for Behaviour Mining of
Concurrent Programs
Renata Avros2 , Vendula Hrubá1 , Bohuslav Křena1 , Zdeněk Letko1 , Hana
Pluháčková1 , Tomáš Vojnar1 , Zeev Volkovich2 , and Shmuel Ur1
1
2
IT4Innovations Centre of Excellence, FIT, Brno University of Technology, Brno, CZ
{ihruba, krena, iletko, ipluhackova, vojnar}@fit.vutbr.cz,
[email protected]
Ort Braude College of Engineering, Software Engineering Department, Karmiel, IL
{r avros, vlvolkov}@braude.ac.il
Abstract. Testing of concurrent programs is difficult since the scheduling non-determinism requires one to test a huge number of different
thread interleavings. Moreover, a simple repetition of test executions
will typically examine similar interleavings only. One popular way how
to deal with this problem is to use the noise injection approach, which
is, however, parameterized with many parameters whose suitable values are difficult to find. In this paper, we propose a novel application of
classification-based data mining for this purpose. Our approach can identify which test and noise parameters are the most influential for a given
program and a given testing goal and which values (or ranges of values)
of these parameters are suitable for meeting this goal. We present experiments that show that our approach can indeed fully automatically
improve noise-based testing of particular programs with a particular testing goal. At the same time, we use it to obtain new general insights into
noise-based testing as well.
1
Introduction
Testing of concurrent programs is known to be difficult due to the many different interleavings of actions executed in different threads to be tested. A single
execution of available tests used in traditional unit and integration testing usually exercises a limited subset of all possible interleavings. Moreover, repeated
executions of the same tests in the same environment usually exercise similar
interleavings [2, 3]. Therefore, means for increasing the number of tested interleavings within repeated runs, such as deterministic testing [2], which controls
threads scheduling and systematically enumerates different interleavings, and
noise injection [3], which injects small delays or context switches into the running threads in order to see different scheduling scenarios, have been proposed
and applied in practice.
In order to measure how well a system under test (SUT) has been exercised
and hence to estimate how good a given test suite is, testers often collect and
analyse coverage metrics. However, one can gain a lot more information from the
test executions. One can, e.g., get information on similarities of the behaviour
15
R. Avros et al.
witnessed through different tests, on the behaviour witnessed only within tests
that failed, and so on. Such information can be used to optimize the test suite, to
help debugging the program, etc. In order to get such information, data mining
techniques appear to be a promising tool.
In this paper, we propose a novel application of data mining allowing one
to exploit information present in data obtained from a sample of test runs of a
concurrent program to optimize the process of noise-based testing of the given
program. To be more precise, our method employs a data mining method based
on classification by means of decision trees and the AdaBoost algorithm. The
approach is, in particular, intended to find out which parameters of the available
tests and which parameters of the noise injection system are the most influential and which of their values (or ranges of values) are the most promising for
a particular testing goal for the given program.
The information obtained by our approach can certainly be quite useful since
the efficiency of noise-based testing heavily depends on a suitable setting of the
test and noise parameters, and finding such values is not easy [8]. That is why,
repeated testing based on randomly chosen noise parameters is often used in
practice. Alternatively, one can try to use search techniques (such as genetic
algorithms) to find suitable test and noise settings [8, 7].
The classifiers obtained by our data mining approach can be easily used to
fully automatically optimize the most commonly used noise-based testing with
a random selection of parameter values. This can be achieved by simply filtering
out randomly generated noise settings that are not considered as promising by
the classifier. Moreover, it can also be used to guide and consequently speed up
the manual or search-based process of finding suitable values of test and noise
parameters (in the latter case, the search techniques would look for a suitable refinement of the knowledge obtained by data mining). Finally, if some of the noise
parameters or generic test parameters (such as the number of threads) appear
as important across multiple test cases and test goals, they can be considered
as important in general, providing a new insight into the process of noise-based
testing.
In order to show that the proposed approach can indeed be useful, we apply it
for optimizing the process of noise-based testing for two particular testing goals
on a set of several benchmark programs. Namely, we consider the testing goals
of reproducing known errors and covering rare interleavings which are likely to
hide so far unknown bugs. Our experimental results confirm that the proposed
approach can discover useful knowledge about the influence and suitable values of test and noise parameters, which we show in two ways: (1) We manually
analyse information hidden in the classifiers, compare it with our long-term experience from the field, and use knowledge found as important across multiple
case studies to derive some new recommendations for noise-based testing (which
are, of course, to be validated in the future on more case studies). (2) We show
that the obtained classifiers can be used—in a fully automated way—to significantly improve efficiency of noise-based testing using a random selection of test
and noise parameters.
16
Boosted Decision Trees for Behaviour Mining of Concurrent Programs
Plan of the paper. The rest of the paper is structured as follows. Section 2
briefly introduces the techniques that our paper builds on, namely, noise-based
testing of concurrent programs, data mining based on decision trees, and the
AdaBoost algorithm. Section 3 presents our proposal of using data mining in
noise-based testing of concurrent programs. Section 4 provides results of our
experiments and presents the newly obtained insights of noise-based testing.
Section 5 summarizes the related work. Finally, Section 6 provides conclusions
and a discussion of possible future work.
2
Preliminaries
In our previous works, e.g., [8, 10], we have used noise injection to increase the
number of interleavings witnessed within the executions of concurrent program
and thus to increase the chance of spotting concurrency errors. Noise injection
is a quite simple technique which disturbs thread scheduling (e.g., by injecting, removing, or modifying delays, forcing context switches, or halting selected
threads) with the aim of driving the execution of a program into less probable
scenarios.
The efficiency of noise injection highly depends on the type of the generated
noise, on the strength of the noise (which are both determined using some noise
seeding heuristics), as well as on the program locations and program executions
into which some noise is injected (which is determined using some noise placement heuristics). Multiple noise seeding and noise placement heuristics have
been proposed and experimentally evaluated [10]. Searching for an optimal configuration of noise seeding and noise placement heuristics in combination with
a selection of available test cases and their parameters has been formalized as
the test and noise configuration search problem (TNCS) in [7, 8].
To assess how well tests examine the behaviour of an SUT, error manifestation ratio and coverage metrics can be used. Coverage metrics successfully used
for testing of sequential programs (like statement coverage) are not sufficient
for testing of concurrent programs as they do not reflect concurrent aspects of
executions. Concurrency coverage metrics [1] are usually tailored to distinguish
particular classes of interleavings and/or to capture synchronization events that
occur within the execution. Some of the metrics target concurrency issues from
a general point of view while some other metrics, e.g., those inspired by particular
dynamic detectors of concurrency errors [9], concentrate on selected concurrency
aspects only (e.g., on behaviours potentially leading to a deadlock or to a data
race). In this work, we, in particular, use the GoldiLockSC∗ coverage metric
which measures how many internal states of the GoldiLock data race detector
with the fast short circuit checks [5] have been reached [9].
The data mining approach proposed in this paper is based on binary classification. Binary classification problems consist in dividing items of a given collection into two groups using a suitable classification rule. Methods for learning
such classifiers include decision trees, Bayesian networks, support vector machines, or neural networks [12]. The use of decision trees is the most popular of
17
R. Avros et al.
those because they are known for quite some time and can be easily understood.
A decision tree can be viewed as a hierarchically structured decision diagram
whose nodes are labelled by Boolean conditions on the items to be classified
and whose leaves represent classification results. The decision process starts in
the root node by evaluating the condition associated with it on the item to be
classified. According to the evaluation of the condition, a corresponding branch
is followed into a child node. This descent, driven by the evaluation of the conditions assigned to the encountered nodes, continues until a leaf node, and hence
a decision, is reached. Decision trees are usually employed as a predictive model
constructed via a decision tree learning procedure which uses a training set of
classified items.
In the paper, we—in particular—employ the advanced classification technique called Adaptive Boosting (shortly, AdaBoost) [6] which reduces the natural
tendency of decision trees to be unstable (meaning that a minor data oscillation can lead to a large difference in the classification). This technique makes it
possible to correct the functionality of many learning algorithms (so-called weak
learners) by weighting and mixing their outcomes in order to get the output of
the boosted classifier. The method works in iterations (phases). In each iteration, the method aims at producing a new weak classifier in order to improve
the consistency of the previously used ones. In our case, AdaBoost uses decision
trees as the weak learners with the classification result being −1 or +1. In each
phase, the algorithm adds new weighted decision trees obtained by concentrating
on items difficult to classify by the so far learnt classifier and updates weights
of the previously added decision trees to keep the sum of the weights equal to
one. The resulting advanced classifier then consists of a set of weighted decision
trees that are all applied on the item to be classified, their classification results
are weighted by the appropriate weights, summarized, and the sign of the result
provides the final decision.
3
Classification-based Data Mining in Noise-based
Testing
In this section, we first propose our application of AdaBoost in noise-based
testing. Subsequently, we discuss how the information hidden in the classifier may
be analysed to draw some conclusions about which test and noise parameters are
important for particular test cases and test goals or even in general. Finally, we
describe two concrete classification properties that are used in our experiments.
3.1
An Application of AdaBoost in Noise-based Testing
First, in order to apply the proposed approach, one has to define some testing
goal expressible as a binary property that can be evaluated over test results such
that both positive and negative answers are obtained. The requirement of having
both positive and negative results can be a problem in some cases, notably in
the case of discovering rare errors. In such a case, one has to use a property that
18
Boosted Decision Trees for Behaviour Mining of Concurrent Programs
approximates the target property of interest (e.g., by replacing the discovery of
rare errors by discovering rare behaviours in general). Subsequently, once testing
based on settings chosen in this way manages to find some behaviours which were
originally not available (e.g., behaviours leading to a rare error), the process can
be repeated on the newly available test results to concentrate on a repeated
discovery of such behaviours (e.g., for debugging purposes or for the purpose of
finding further similar errors).
Once the property of interest is defined, a number of test runs is to be
performed using a random setting of test and noise parameters in each run.
For each such run, the property of interest is to be evaluated and a couple (x, y)
is to be formed where x is a vector recording the test and noise settings used
and y is the result of evaluating the property of interest. This process has to be
repeated to obtain a set X = {(x1 , y1 ), . . . , (xn , yn )} of such couples to be used
as the input for learning the appropriate classifier.
Now, the AdaBoost algorithm can be applied. For that, the common practice
is to split the set X to two sets—the training set and the testing set, use the
training set to get a classifier, and then use the testing set for evaluating the
precision of the obtained classifier. To evaluate the precision, one can use the
notions of accuracy and sensitivity. Accuracy gives the probability of a successful
classification and can be computed as the fraction of the number of correctly
classified items and the total number of items. Sensitivity (also called as the
negative predictive value or NPV) expresses the fraction of correctly classified
negative results and can be computed as the number of the items correctly
classified negatively divided by the sum of correctly and incorrectly negatively
classified items (see e.g. [12]). Moreover, in order to increase confidence in the
obtained results, this process of choosing the training and validation set and of
learning and validating the classifier can be repeated several times, allowing one
to judge the average values and standard deviation of accuracy and sensitivity. If
the obtained classifier is not validated successfully, one can repeat the AdaBoost
algorithm with more boosting phases and/or a bigger set X of data.
A successfully validated classifier can subsequently be analysed to get some
insight which test and noise parameters are influential for testing the given program and which of their values are promising for meeting the defined testing goal.
Such a knowledge can then in turn be used by testers when thinking of how to
optimize the testing process. We discuss a way how such an analysis can be done
in Section 3.2 and we apply it in Section 4.3. Moreover, the obtained classifier
can also be directly used to improve performance of noise-based testing based
on random selection of parameters by simply filtering out the settings that get
classified as not meeting the considered testing goal. The fact that such an approach does indeed significantly improve the testing process is experimentally
confirmed in Section 4.4.
3.2
Analysing Information Hidden in Classifiers
In order to be able to easy analyse the information hidden in the classifiers
generated by AdaBoost, we have decided to restrict the height of the basic deci19
R. Avros et al.
sion trees used as weak classifiers to one. Moreover, our preliminary experiments
showed us that increasing the height of the weak classifiers does not lead to
significantly better classification results.
A decision tree of height one consists of a root labelled by a condition concerning the value of a single test or noise parameter and two leaves corresponding
to positive and negative classification. AdaBoost provides us with a set of such
trees, each with an assigned weight. We convert this set of trees into a set of
rules such that we get a single rule for each parameter that appears in at least
one decision tree. The rules consist of a condition and a weight, and they are
obtained as follows. First, decision trees with negative weights are omitted because they correspond to weak classifiers with the weighted error greater than
0.5.3 Next, the remaining decision trees are grouped according to the parameter
about whose value they speak. For each group of the trees, a separate rule is produced such that the conjunction of the decision conditions of the trees from the
group is used as the condition of the rule. The weight of the rule is computed by
summarising the weights of the trees from the concerned group and normalising
the result by dividing it by the sum of the weights of all trees from all groups.
The obtained set of rules can be easily used to gain some insight into how the
test and noise injection parameters should be set in order to increase efficiency
of the testing process—either for a given program and testing goal or even in
general. In particular, one can look for parameters that appear in rules with the
highest weights (which speak about parameters whose correct setting is the most
important to achieve the given testing goal), for parameters that are important in
all or many test cases (and hence can be considered to be important in general),
as well as for parameters that do not appear in any rules (and hence appear to
be irrelevant).
3.3
Two Concrete Classification Properties
In the experiments described in the next section, we consider two concrete properties according to which we classify test runs. First, we consider the case of
finding TNCS solutions suitable for repeatedly finding known errors. In this
case, the property of interest is simply the error manifestation property that
indicates whether an error manifested during the test execution or not.
Subsequently, we consider the case of finding TNCS solutions suitable for
testing rare behaviours in which so far unknown bugs might reside. In order
to achieve this goal, we use classification according to a rare events property
that indicates whether a test execution covers at least one rare coverage task
of a suitable coverage metric—in our experiments, the GoldiLockSC∗ is used for
this purpose. To distinguish rare coverage tasks, we collect the tasks that were
covered in at least one of the performed test runs (i.e., both from the training
and validation sets), and for each such coverage task, we count the frequency of
its occurrence in all of the considered runs. We define the rare tasks as those
that occurred in less than 1 % of the test executions.
3
20
Note that the AdaBoost methodology suggests that the employed weak classifiers
should not be of this kind, but they can appear in practical applications.
Boosted Decision Trees for Behaviour Mining of Concurrent Programs
4
Experimental Evaluation
In this section, we first describe the test data which we used for an experimental
evaluation of our approach. Then, we describe the precision of the classifiers
inferred from this data. Subsequently, we analyse the knowledge hidden in the
classifiers, compare it with our previously obtained experience, and derive some
new insights about importance of the different test and noise parameters. Finally, we demonstrate that a use of the proposed data mining approach does
indeed improve (in a fully automated way) the process of noise-based testing
with random setting of the parameters.
4.1
Experimental Data
The results presented below are based on 5 multi-threaded benchmark programs
that contain a known concurrency error. We use data collected during our previous work [7]. Namely, our case studies are the Airlines (0.3 kLOC), Animator
(1.5 kLOC), Crawler (1.2 kLOC), Elevator (0.5 kLOC), and Rover (5.4 kLOC).
For each program, we collected data from 10,000 executions with a random test
and noise injection setting. We collected various data about the test executions,
such as whether an error occurred during the execution (used as our error manifestation property) and various concurrency coverage information, including the
GoldiLockSC∗ coverage used for evaluating the rare events property.
In our experiments, we consider vectors of test and noise parameters having
12 entries, i.e., x = (x1 , x2 , . . . , x12 ). Here, x1 ∈ {0, . . . , 1000} represents the
noise frequency which controls how often the noise is injected and ranges from 0
(never) to 1000 (always). The x2 ∈ {0, . . . , 100} parameter controls the amount
of injected noise and ranges from 0 (no noise) to 100 (considerable noise). The
x3 ∈ {0, . . . , 5} parameter selects one of six available basic noise injection heuristics (based on injecting calls of yield(), sleep(), wait(), using busy waiting,
a combination of additional synchronization and yield(), and a mixture of these
techniques). The x4 , x5 , x7 , x8 , x9 ∈ {0, 1} parameters enable or disable the advanced injection heuristics haltOneThread, timeoutTampering, sharedVarNoise,
nonVariableNoise, advSharedVarNoise1, and advSharedVarNoise2, respectively.
The x6 ∈ {0, 1, 2} parameter controls the way how the sharedVarNoise advanced heuristic behaves (namely, whether it is disabled (0), injects the noise
at accesses to one randomly selected shared variable (1) or at accesses to all
such variables (2)). A more detailed description of the particular noise injection
heuristics can be found in [3, 7, 8, 10].
Furthermore, x10 ∈ {1, . . . , 10} and x11 , x12 ∈ {1, . . . , 100} encode parameters of some of the test cases themselves. In particular, Animator and Crawler
are not parametrised, and x10 , x11 , x12 are not used with them. In the Airlines
and Elevator test cases, the x10 parameter controls the number of used threads,
and in the Rover test case, the x10 ∈ {0, . . . , 6} parameter selects one of the
available test scenarios. The Airlines test case is the only one that uses the x11
and x12 parameters, which are in particular used to control how many cycles the
test does.
21
R. Avros et al.
Table 1. Average accuracy and sensitivity of the learnt AdaBoost classifiers.
Error manifestation
Accurancy
CaseStudies
Mean
Std
Airlines
Animator
Crawler
Elevator
Rover
0.7695 0.0086
0.937 0.0054
0.9975 0.00076
0.8335 0.0038
0.9714 0.0031
Rare behaviours
Sensitivity
Mean
Std
0.6229 0.0321
0.9866 0.0052
0.999 0.00077
0.9982 0.0016
0.9912 0.0012
Accurancy
Mean
0.9755
0.7815
0.7642
0.6566
0.8737
Std
0.0056
0.0054
0.0402
0.0051
0.1092
Sensitivity
Mean
Std
0.9964 0.0021
0.9071 0.0217
0.9741 0.0765
0.6131 0.027
0.9687 0.137
Table 2. Inferred weighted rules for the error manifestation classification property.
Airlines
Rules x1 < 275 x3 < 0.5 or 3.5 < x3
x6 < 1.5
2.5 < x10
73.5 < x12
Weights
0.16
0.50
0.04
0.18
0.12
Animator
Rules 705 < x1
2.5 < x3 < 3.5
x6 < 0.5
Weights
0.19
0.55
0.26
Crawler
Rules x1 < 215
15 < x2
1.5 < x3 < 3.5 0.5 < x4 x5 < 0.5 x6 < 1.5
or 4.5 < x3
Weights
0.32
0.1
0.38
0.05
0.08
0.07
Elevator
Rules
x1 < 5
x3 < 0.5 or 3.5 < x3 < 4.5
x7 < 0.5 8.5 < x10
Weights
0.93
0.04
0.01
0.02
Rover
Rules 515 < x1
2.5 < x3 < 3.5
0.5 < x4
x6 < 0.5
Weights
0.21
0.48
0.08
0.23
4.2
Precision of the Classifiers
In our experiments, we used the implementation of AdaBoost available in the
GML AdaBoost Matlab Toolbox4 . We have set it to use decision trees of height
restricted to one and to use 10 boosting phases. The algorithm was applied
100 times on randomly chosen divisions of the test data into the training and
validation groups.
Table 1 summarises the average accuracy and sensitivity of the learnt AdaBoost classifiers. One can clearly see that both the average accuracy and sensitivity are quite high, ranging from 0.61 to 0.99. Moreover, the standard deviation
is very low in all cases. This indicates that we always obtained results that provide meaningful information about our test runs.
4.3
Analysis of the Knowledge Hidden in the Obtained Classifiers
We now employ the approach described in Section 3.2 to interpret the knowledge hidden in the obtained classifiers. Tables 2 and 3 show the inferred rules
4
22
http://graphics.cs.msu.ru/en/science/research/machinelearning/AdaBoosttoolbox
Boosted Decision Trees for Behaviour Mining of Concurrent Programs
and their weights for the error manifestation property and the rare behaviours
property, respectively. For each test case, the tables contain a row whose upper
part contains the condition of the rule (in the form of interval constraints) and
the lower part contains the appropriate weight from the interval (0, 1).
In order to interpret the obtained rules, we first focus on rules with the
highest weights (corresponding to parameters with the biggest influence). Then
we look at the parameters which are present in rules across the test cases (and
hence seem to be important in general) and parameters that are specific for
particular test cases only. Next, we pinpoint parameters that do not appear in
any of the rules and therefore seem to be of a low relevance in general.
As for the error manifestation property (i.e., Table 2), the most influential
parameters are x3 in four of the test cases and x1 in the Crawler test case. This
indicates that the selection of a suitable noise type (x3 ) or noise frequency (x1 )
is the most important decision to be done when testing these programs with the
aim of reproducing the errors present in them. Another important parameter is
x6 controlling the use of the sharedVarNoise heuristic. Moreover, the parameters
x1 , x3 , and x6 are considered important in all of the rules, which suggests that,
for reproducing the considered kind of errors, they are of a general importance.
In two cases (namely, Crawler, and Rover ), the advanced haltOneThread
heuristic (x4 ) turns out to be important. In the Crawler and Rover test cases,
this heuristic should be enabled in order to detect an error. This behaviour fits
into our previous results [10] in which we show that, in some cases, this unique
heuristic (the only heuristic which allows one to exercise thread interleavings
which are normally far away from each other) considerably contributes to the
detection of an error. Finally, the presence of the x10 and x12 parameters in the
rules derived for the Airlines test case indicates that the number of threads (x10 )
and the number of cycles executed during the test (x12 ) pays an important role
in the noise-based testing of this particular test case. The x10 parameter (i.e.,
the number of threads) turns out to be important for the Elevator test case too,
indicating that the number of threads is of a more general importance.
Finally, we can see that the x8 , x9 and x11 parameters are not present in any
of the derived rules. This indicates that the advSharedVarNoise noise heuristics
are of a low importance in general, and the x11 parameter specific for Airlines
is not really important for finding errors in this test case.
For the case of classifying according to the rare behaviour property, the
obtained rules are shown in Table 3. We can again find the highest weights in
rules based on the x3 parameter (Animator, Crawler, Rover ) and on the x1
parameter (Airlines). However, in the case of Elevator, the most contributing
parameter is now the number of threads used by the test (x10 ). The rule suggests
to use certain numbers of threads in order to spot rare behaviours (i.e., it is
important to consider not only a high number of threads). The generated sets of
rules often contain the x3 parameter controlling the type of noise (all test cases
except for Airlines) and the x6 parameter which controls the sharedVarNoise
heuristic. These parameters thus appear to be of a general importance in this
case.
23
R. Avros et al.
Table 3. Inferred weighted rules for the rare behaviours classification property.
Airlines
Rules x1 < 295 or 745 < x1 < 925 x2 < 35
Weights
0.52
0.06
Animator
Rules
0.5 < x3 < 3.5 or 4.5 < x3
Weights
0.8
Crawler
Rules 0.5 < x3 < 3.5 or 4.5 < x3 0.5 < x4
Weights
0.46
0.08
Elevator
Rules 0.5 < x3 < 3.5 0.5 < x4 0.5 < x5
or 4.5 < x3
Weights
0.22
0.05
0.2
Rover
Rules 2.5 < x3 < 3.5 or 4.5 < x3 x4 < 0.5
Weights
0.46
0.26
0.5 < x5 61.5 < x12 < 91.5
0.1
0.32
0.5 < x6 < 1.5
0.2
0.5 < x5
0.2
0.5 < x6 < 1.5
0.26
1.5 < x6 1.5 < x10 < 4.5
or 7.5 < x10
0.06
0.47
x6 < 0.5
0.16
0.5 < x7
0.12
Next, the parameter x12 does again turn out to be important in the Airlines
test case, and the x10 parameter is important in the Elevator test case. This
indicates that even for testing rare behaviours, it is important to adjust the
number of threads or test cycles to suitable values. Finally, the x8 , x9 , and
x11 parameters do not show up in any of the rules, and hence seem to be of
a low importance in general for finding rare behaviours (which is the same as
for reproduction of known errors).
Overall, the obtained results confirmed some of the facts we discovered during our previous experimentation such as that different goals and different test
cases may require a different setting of noise heuristics [10, 7, 8] and that the
haltOneThread noise injection heuristics (x4 ) provides in some cases a dramatic
increase in the probability of spotting an error [10]. More importantly, the analysis revealed (in an automated way) some new knowledge as well. Mainly, the type
of noise (x3 ) and the setting of the sharedVarNoise heuristic (x6 ) as well as the
frequency of noise (x1 ) are often the most important parameters (although the
importance of x1 seems to be a bit lower). Further, it appears to be important
to suitably adjust the number of threads (x10 ) whenever that is possible.
4.4
Improvement of Noise-based Testing with Random Parameters
Finally, we show that the obtained classifiers can be used to fully automatically
improve the process of noise-based testing with randomly chosen values of parameters. For that, we reuse the 7,500 test runs out of 10,000 test runs recorded
with random parameter values for each of the case studies. In particular, we
randomly choose 2,500 test runs as training set for our AdaBoost approach to
produce classifiers. Then, from the rest of the test runs, we randomly choose
5,000 test runs to compare our approach with the random approach.
From these 5,000 test runs, we first select runs that were performed using
settings considered as suitable for the respective testing goals by the classifiers
24
Boosted Decision Trees for Behaviour Mining of Concurrent Programs
Table 4. A comparison of the random approach and the newly proposed AdaBoost
approach.
Error manifestation
CaseStudies
Airlines
Animator
Crawler
Elevator
Rover
Rare behaviours
Rand.
AdaBoost
Pos.
Impr.
Rand.
AdaBoost
Pos.
Impr.
56.26
14.81
0.18
16.75
6.65
75.43
54.05
0.25
27.66
36.25
1,612
901
2,806
1,410
822
1.34
3.65
1.39
1.65
5.45
1.94
39.53
22.41
52.77
10.76
1.64
57.95
31.26
59.51
23.21
2,444
3,258
1,513
1,398
1,620
0.85
1.47
1.39
1.13
2.16
that we have obtained. Then, we compute what fractions of all the runs and
what fractions of all the selected runs satisfy the testing goals for the considered
case studies, which shows us the efficiency of the different testing approaches.
In Table 4, the columns Pos. contain the numbers of test runs (out of the
considered 5,000 runs) classified positively by the obtained classifiers for the two
considered test goals. The columns Rand. give the percentage of runs out of
the 5,000 runs performed under purely randomly chosen values of parameters
that met the considered testing goals (i.e., found an error or a rare behaviour,
respectively). The columns AdaBoost give this percentage for the selected runs
(i.e., those whose number is in the columns Pos.). Finally, the columns Impr.
present how many times the efficiency of testing with the selected values of
parameters is better than that of purely random noise-based testing (i.e., it
contains the ratio of the values in the AdaBoost and Rand. columns).
The improvement columns clearly show that our AdaBoost technique often
brings an improvement (with one exception described bellow), which ranges from
1.13 times in the case of the rare behaviours property and the Elevator test case
to 5.45 times in the case of the error manifestation property and the Rover
test case. In the case of the Airlines test case and the rare behaviours property,
our technique provided worse results (impr. 0.85). This is mostly caused by the
simplicity of the case study and hence lack of rare behaviours in the test runs.
Therefore, our approach did not have enough samples to construct a successful
classifier. Nevertheless, we can conclude that our classification approach can
really improve the efficiency of testing in majority of studied cases.
5
Related Work
Most of the existing works on obtaining new knowledge from multiple test runs
of concurrent programs focus on gathering debugging information that helps to
find the root cause of a failure [4, 11]. In [11], a machine learning algorithm is used
to infer points in the execution such that the error manifestation probability is
increased when noise is injected into them. It is then shown that such places are
often involved in the erroneous behaviour of the program. Another approach [4]
uses a data mining-like technique, more precisely, the feature selection algorithm,
25
R. Avros et al.
to infer a reduced call graph representation of the SUT, which is then used to
discover anomalies in the behaviour of the SUT within erroneous executions.
There is also rich literature and tool support for data mining test results
without a particular emphasis on concurrent programs. The existing works study
different aspects of testing, including identification of test suite weaknesses [1]
and optimisation of the test suite [13]. In [1], a substring hole analysis is used to
identify sets of untested behaviours using coverage data obtained from testing
of large programs. Contrary to the analysis of what is missing in coverage data
and what should be covered by improving the test suite, other works focus on
what is redundant. In [13], a clustering data mining technique is used to identify
tests which exercise similar behaviours of the program. The obtained results are
then used to prioritise the available tests.
6
Conclusions and Future Work
In the paper, we have proposed a novel application of classification-based data
mining in the area of noise-based testing of concurrent programs. In particular,
we proposed an approach intended to identify which of the many noise parameters and possibly also parameters of the tests themselves are important for a
particular testing goal as well as which values of these parameters are suitable
for meeting this goal. As we have demonstrated on a number of case studies, the
proposed approach can be used to fully automatically improve the noise-based
testing approach of a particular program with a particular testing goal. Moreover, we have also used our approach to derive new insights into the noise-based
testing approach itself.
Apart from validating our findings on more case studies, there is plenty of
space for further research in the area of applications of data mining in testing
of concurrent programs. One can ask many interesting questions and search for
the answers using different techniques, such as outliers detection, clustering,
association rules mining, etc. For example, many of the concurrency coverage
metrics based on dynamic detectors contain a lot of information on the behaviour
of the tested programs, and when mined, this information could be used for
debugging purposes.
Acknowledgement. The work was supported by the bi-national Czech-Israel
project (Kontakt II LH13265 by the Czech Ministry of Education and 3-10371
by Ministry of Science and Technology of Israel), the EU/Czech IT4Innovations
Centre of Excellence project CZ.1.05/1.1.00/02.0070, and the internal BUT
projects FIT-S-12-1 and FIT-S-14-2486 . Z. Letko was funded through the EU/
Czech Interdisciplinary Excellence Research Teams Establishment project
CZ.1.07/2.3.00/30.0005.
References
1. Yoram Adler, Noam Behar, Orna Raz, Onn Shehory, Nadav Steindler, Shmuel Ur,
and Aviad Zlotnick. Code Coverage Analysis in Practice for Large Systems. In
Proc. of ICSE’11, pages 736–745. ACM, 2011.
26
Boosted Decision Trees for Behaviour Mining of Concurrent Programs
2. Thomas Ball, Sebastian Burckhardt, Katherine E. Coons, Madanlal Musuvathi,
and Shaz Qadeer. Preemption Sealing for Efficient Concurrency Testing. In Proc.
of TACAS’10, volume 6015 of LNCS, pages 420–434. Springer-Velrlag, 2010.
3. Orit Edelstein, Eitan Farchi, Evgeny Goldin, Yarden Nir, Gil Ratsaby, and Shmuel
Ur. Framework for Testing Multi-threaded Java Programs. Concurrency and Computation: Practice and Experience, 15(3-5):485–499. Wiley, 2003.
4. Frank Eichinger, Victor Pankratius, Philipp W. L. Große, and Klemens Böhm.
Localizing Defects in Multithreaded Programs by Mining Dynamic Call Graphs.
In Proc. of TAIC PART’10, volume 6303 of LNCS, pages 56–71. Springer-Velrlag,
2010.
5. Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. Goldilocks: A Race and
Transaction-aware Java Runtime. In Proc. of PLDI’07, pages 245–255. ACM,
2007.
6. Yoav Freund and Robert E. Schapire. A Short Introduction to Boosting. In In
Proc. of IJCAI’99, pages 1401–1406. Morgan Kaufmann, 1999.
7. Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Hana Pluháčková, and Tomáš Vojnar. Multi-objective Genetic Optimization for Noise-based Testing of Concurrent
Software. In Proc. of SSBSE’14, volume 8636 of LNCS, pages 107–122. SpringerVerlag, 2014.
8. Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Shmuel Ur, and Tomáš Vojnar.
Testing of Concurrent Programs Using Genetic Algorithms. In Proc. of SSBSE’12,
volume 7515 of LNCS, pages 152–167. Springer-Velrlag, 2012.
9. Bohuslav Křena, Zdeněk Letko, and Tomáš Vojnar. Coverage Metrics for
Saturation-based and Search-based Testing of Concurrent Software. In Proc. of
RV’11, volume 7186 of LNCS, pages 177–192. Springer-Velrlag, 2012.
10. Zdeněk Letko, Tomáš Vojnar, and Bohuslav Křena. Influence of Noise Injection
Heuristics on Concurrency Coverage. In Proc. of MEMICS’11, volume 7119 of
LNCS, pages 123–131, Springer-Velrlag, 2012.
11. Rachel Tzoref, Shmuel Ur, and Elad Yom-Tov. Instrumenting Where It Hurts:
An Automatic Concurrent Debugging Technique. In Proc. of ISSTA’07, pages
27–38. ACM, 2007. ACM.
12. Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, 3rd edition, 2011.
13. Shin Yoo, Mark Harman, Paolo Tonella, and Angelo Susi. Clustering Test Cases
to Achieve Effective and Scalable Prioritisation Incorporating Expert Knowledge.
In Proc. of ISSTA’09, pages 201–212. ACM, 2009.
27
LTL Model Checking of Parametric Timed
Automata
Peter Bezděk, Nikola Beneš? , Jiřı́ Barnat?? , and Ivana Černá??
Faculty of Informatics, Masaryk University, Brno, Czech Republic
{xbezdek1,xbenes3,barnat,cerna}@fi.muni.cz
Abstract. The parameter synthesis problem for timed automata is undecidable in general even for very simple reachability properties. In this
paper we introduce restrictions on parameter valuations under which the
parameter synthesis problem is decidable for LTL properties. The proposed problem could be solved using an explicit enumeration of all possible parameter valuations. However, we introduce a symbolic zone-based
method for synthesising bounded integer parameters of parametric timed
automata with an LTL specification. Our method extends the ideas of
the standard automata-based approach to LTL model checking of timed
automata. Our solution employs constrained parametric difference bound
matrices and a suitable notion of extrapolation.
1
Introduction
Model checking [1] is a formal verification technique applied to check for logical
correctness of discrete distributed systems. While it is often used to prove the
unreachability of a bad state (such as an assertion violation in a piece of code),
with a proper specification formalism, such as the Linear Temporal Logic (LTL),
it can also check for many interesting liveness properties of systems, such as
repeated guaranteed response, eventual stability, live-lock, etc.
Timed automata have been introduced in [2] and have emerged as a useful
formalism for modelling time-critical systems as found in many embedded and
cyber-physical systems. The formalism is built on top of the standard finite automata enriched with a set of real-time clocks and allowing the system actions
to be guarded with respect to the clock valuations. In the general case, such
a timed system exhibits infinite-state semantics (the clock domains are continuous). Nevertheless, when the guards are limited to comparing clock values with
integers only, there exists a bisimilar finite state representation of the original
infinite-state real-time system referred to as the region abstraction. A practically
efficient abstraction of the infinite-state space came with the so called zones [3].
The zone-based abstraction is much coarser and the number of zones reachable
?
??
28
The author has been supported by the MEYS project No. CZ.1.07/2.3.00/30.0009
Employment of Newly Graduated Doctors of Science for Scientific Excellence.
The authors have been supported by the MEYS project No. LH11065 Control Synthesis and Formal Verification of Complex Hybrid Systems.
LTL model checking of Parametric Timed Automata
from the initial state is significantly smaller. This in turns allows for an efficient
implementation of verification tools for timed automata, see e.g. UPPAAL [4].
Very often the correctness of a time-critical system relates to a proper timing,
i.e. it does not only depend on the logical result of the computation, but also on
the time at which the results are produced. To that end the designers are not only
in the need of tools to verify correctness once the system is fully designed, but
also in the need of tools that would help them to derive proper time parameters
of individual system actions that would make the system as a whole satisfy
the required specification. After all this problem of parameter synthesis is more
urgent in practice than the verification as such.
The problem of the existence of a parameter valuation for a reachability
property of a parametric timed automaton has been shown to be undecidable
in [5] for a parametric timed automaton with as few as 3 clocks.
To obtain a decidable problem we need to restrict parameter valuations to
bounded integers. When modelling a real-time system, designers can usually provide practical bounds on time parameters of individual system actions. Therefore, introducing a parameter synthesis method with such a restriction is still
reasonable.
Our goal is to solve the parameter synthesis problem for linear time properties over parametric timed automata where the parameter valuation function is
restricted to bounded range over integer values. As part of our goal, we propose
a solution that avoids the parameter scan approach in order to provide a potentially more efficient method. To that end we introduce a finite abstraction over
parametric difference bound matrices, which allows us to deploy our solution
based on a zone abstraction.
An extension of the model checker Uppaal, capable of synthesising linear
parameter constraints for correctness of parametric timed automata has been
described in [6] together with a subclass of parametric timed automata, for
which the emptiness problem is decidable. In [7] authors show that the problem
of the existence of bounded integer parameter values such that some TCTL
property is satisfied is PSPACE-complete. They also give symbolic algorithms
for reachability and unavoidability properties.
Contribution We show how to apply the standard automata-based approach
to LTL model checking of Vardi and Wolper [8] in the context of an LTL formula,
a parametric timed automaton and bounds on parameters. In particular, we show
how to construct a Büchi automaton coming from the parametric system under
verification using a zone-based abstraction and an extrapolation. Due to space
constraints, the proofs of theorems from this paper are given in [9].
2
Preliminaries and Problem Statement
In order to state our main problem formally, we need to describe the notion of
a parametric timed automaton. We start by describing some basic notation.
29
P. Bezděk, N. Beneš, J. Barnat, and I. Černá
Let P be a finite set of parameters. An affine expression is an expression of
the form z0 + z1 p1 + . . . + zn pn , where p1 , . . . , pn ∈ P and z0 , . . . , zn ∈ Z. We use
E(P ) to denote the set of all affine expressions over P . A parameter valuation
is a function v : P → Z which assigns an integer number to each parameter.
Let lb : P → Z be a lower bound function and ub : P → Z be an upper
bound function. For an affine expression e, we use e[v] to denote the integer
value obtained by replacing each p in e by v(p). We use maxlb,ub (e) to denote
the maximal value obtained by replacing each p with a positive coefficient in e
by ub(p) and replacing each p with a negative coefficient in e by lb(p). We say
that the parameter valuation v respects lb and ub if for each p ∈ P it holds that
lb(p) ≤ v(p) ≤ ub(p). We denote the set of all parameter valuations respecting lb
and ub by V allb,ub (P ). In the following, we only consider parameter valuations
from V allb,ub (P ).
Let X be a finite set of clocks. We assume the existence of a special zero clock,
denoted by x0 , that has always the value 0. A guard is a finite conjunction of
expressions of the form xi − xj ∼ e where xi , xj ∈ X, e ∈ E(P ) and ∼ ∈ {≤, <}.
We use G(X, P ) to denote the set of all guards over a set of clocks X and a set of
parameters P . A plain guard is a guard containing only expressions of the form
xi −xj ∼ e where xi , xj ∈ X, e ∈ E(P ), ∼ ∈ {≤, <}, and xi = x0 or xj = x0 . We
also use G(X, P ) to denote the set of all plain guards over a set of clocks X and
a set of parameters P . A clock valuation is a function η : X → R≥0 assigning nonnegative real numbers to each clock such that η(x0 ) = 0. We denote the set of
all clock valuations by V al(X). Let g ∈ G(X, P ) and v be a parameter valuation
and η be a clock valuation. Then g[v, η] denotes a boolean value obtained from g
by replacing each parameter p with v(p) and each clock x with η(x). A pair (v, η)
satisfies a guard g, denoted by (v, η) |= g, if g[v, η] evaluates to true. A semantics
of a guard g, denoted by JgK, is a set of valuation pairs (v, η) such that (v, η) |= g.
For a given parameter valuation v we write JgKv for the set of clock valuations
{η | (v, η) |= g}.
We define two operations on clock valuations. Let η be a clock valuation,
d a non-negative real number and R ⊆ X a set of clocks. We use η + d to denote
the clock valuation that adds the delay d to each clock, i.e. (η + d)(x) = η(x) + d
for all x ∈ X \ {x0 }. We further use η[R] to denote the clock valuation that
resets clocks from the set R, i.e. η[R](x) = 0 if x ∈ R, η[R](x) = η(x) otherwise.
We can now proceed with the definition of a parametric timed automaton
and its semantics.
Definition 2.1 (PTA). A parametric timed automaton (PTA) is a tuple M =
(L, l0 , X, P, ∆, Inv ) where
–
–
–
–
–
–
30
L is a finite set of locations,
l0 ∈ L is an initial location,
X is a finite set of clocks,
P is a finite set of parameters,
∆ ⊆ L × G(X, P ) × 2X × L is a finite transition relation, and
Inv : L → G(X, P ) is an invariant function.
LTL model checking of Parametric Timed Automata
g,R
We use q −−→∆ q 0 to denote (q, g, R, q 0 ) ∈ ∆. The semantics of a PTA is given
as a labelled transition system. A labelled transition system (LTS) over a set of
symbols Σ is a triple (S, s0 , →), where S is a set of states, s0 ∈ S is an initial
a
state and → ⊆ S × Σ × S is a transition relation. We use s −
→ s0 to denote
(s, a, s0 ) ∈ →.
Definition 2.2 (PTA semantics). Let M = (L, l0 , X, P, ∆, Inv ) be a PTA
and v be a parameter valuation. The semantics of M under v, denoted by JM Kv ,
is an LTS (SM , s0 , →) over the set of symbols {act} ∪ R≥0 , where
– SM = L × V allb,ub (X) is a set of all states,
– s0 = (l0 , 0), where 0 is a clock valuation with 0(x) = 0 for all x, and
– the transition relation → is specified for all (q, η), (q 0 , η 0 ) ∈ S such that η is
a clock valuation as follows:
d
• (l, η) −
→ (l0 , η 0 ) if l = l0 , d ∈ R≥0 , η 0 = η + d, and (v, η 0 ) |= Inv (l0 ),
g,R
act
• (l, η) −−→ (l0 , η 0 ) if ∃g, R : l −−→∆ l0 , (v, η) |= g, η 0 = η[R],
and (v, η 0 ) |= Inv (l0 ).
The transitions of the first kind are called delay transitions, the latter are
called action transitions.
act
act
We write s1 −−→d s2 if there exists s0 ∈ SM and d ∈ R≥0 such that s1 −→
d
s0 −→ s2 . A proper run π of JM Kv is an infinite alternating sequence of delay
d
0
and action transitions that begins with a delay transition π = (l0 , η0 ) −→
(l0 , η0 +
act
d
1
d0 ) −−→ (l1 , η1 ) −→
· · · . A proper run is called a Zeno run if the sum of all its
delays is finite.
For the rest of the paper, we assume that we only deal with a deadlockfree PTA, i.e. that for each considered parameter valuation v there is no state
without a reachable action transition in JAKv . We deal with Zeno runs later.
Let M be a PTA, L : L → 2Ap be a labelling function that assigns a set
of atomic propositions to each location of M , v be a parameter valuation, and
ϕ be an LTL formula. We say that M under v with L satisfies ϕ, denoted
by (M, v, L) |= ϕ if for all proper runs π of JM Kv , π satisfies ϕ where atomic
prepositions are determined by L.
Unfortunately, it is known that the parameter synthesis problem for a PTA is
undecidable even for very simple (reachability) properties [5]. Instead of solving
the general problem, we thus focus on a more constrained version. We may now
state our main problem.
Problem 2.3 (The bounded integer parameter synthesis problem). Given a parametric timed automaton M , a labelling function L, an LTL property ϕ, a lower
bound function lb and an upper bound function ub, the problem is to compute
the set of all parameter valuations v such that (M, v, L) |= ϕ and lb(p) ≤ v(p) ≤
ub(p).
31
P. Bezděk, N. Beneš, J. Barnat, and I. Černá
Problem 2.3 is trivially decidable using a region abstraction and parameter
scan approach. Unfortunately, the size of the region-based abstraction grows
exponentially with the number of clocks and the largest integer number used.
As a result, the region-based abstraction is difficult to be used in practice for
an analysis of more than academic toy examples, even though it has its theoretical value.
Unlike the region-based abstraction, a single state in a zone-based abstraction
is no longer restricted to represent only those clock values that are between two
consecutive integers. Therefore, the zone-based abstraction is much coarser and
the number of zones reachable from the initial state is significantly smaller. In
order to avoid the necessity of an explicit enumeration of all parameter valuations
we use the zone-based abstraction together with the symbolic representation of
parameter valuation sets. Our algorithmic framework which solves Problem 2.3
consists of three steps.
As the first step, we extend the standard automata-based LTL model checking of timed automata [8] to parametric timed automata. We employ this approach in the following way. From a PTA M and an LTL formula ϕ we produce
a product parametric timed Büchi automaton (PTBA) A. The accepting runs
of the automaton A correspond to the runs of M violating the formula ϕ (analogously as in the case of timed automata).
As the second step, we employ a symbolic semantics of a PTBA A with
a suitable extrapolation. From the symbolic state space of a PTBA A we finally
produce a Büchi automaton B.
As the last step, we need to detect all parameter valuations such that there
exists an accepting run in Büchi automaton B. This is done using our Cumulative
NDFS algorithm.
Now, we proceed with the definitions of a Büchi automaton, a parametric
timed Büchi automaton and its semantics.
Definition 2.4 (BA). A Büchi automaton (BA) is a tuple B = (Q, q0 , Σ, →, F ),
where
–
–
–
–
–
Q is a finite set of states,
q0 ∈ Q is an initial state,
Σ is a finite set of symbols,
→⊆ Q × Σ × Q is a set of transitions, and
F ⊆ Q is the set of accepting states (acceptance condition).
An ω-word w = a0 a1 a2 . . . ∈ Σ ω is accepting if there is an infinite sequence of
ai
states q0 q1 q2 . . . such that qi −→
qi+1 for all i ∈ N, and there exist infinitely
many i ∈ N such that qi ∈ F .
Definition 2.5 (PTBA). A parametric timed Büchi automaton (PTBA) is
a pair A = (M, F ) where
– M = (L, l0 , X, P, ∆, Inv ) is a PTA, and
– F ⊆ L is a set of accepting locations.
32
LTL model checking of Parametric Timed Automata
Zeno runs represent non-realistic behaviours and it is desirable to ignore
them in analysis. Therefore, we are interested only in non-Zeno accepting runs
of a PTBA. There is a well-known transformation to the strongly non-Zeno
form [10] of a PTBA, which guarantees that each accepting run is non-Zeno.
For the rest, we assume that we have the strongly non-Zeno form of a PTBA,
as introduced in [10].
Definition 2.6 (PTBA semantics). Let A = (M, F ) be a PTBA and v be
a parameter valuation. The semantics of A under v, denoted by JAKv , is defined
as JM Kv = (SM , s0 , →).
We say a state s = (l, η) ∈ SM is accepting if l ∈ F . A proper run π =
d
act
d
act
1
0
s01 −→ . . . of JAKv is accepting if there exists an infinite
s00 −→ s1 −→
s0 −→
set of indices i such that si is accepting.
3
Symbolic Semantics
A constraint is an inequality of the form e ∼ e0 where e, e0 ∈ E and ∼ ∈ {>, ≥
, ≤, <}. We define c[v] as a boolean value obtained by replacing each p in c by
v(p). A valuation v satisfies a constraint c, denoted v |= c, if c[v] evaluates to
true. The semantics of a constraint c, denoted JcK, is the set of all valuations
that satisfy c. A finite set of constraints C is called a constraint set. A valuation
satisfies a constrain set CTif it satisfies each c ∈ C. The semantics of a constraint
set C is given by JCK = c∈C JcK. A constraint set C is satisfiable if JCK 6= ∅. A
constraint c covers a constraint set C, denoted C |= c, exactly when JCK ⊆ JcK.
As in [6], we identify the relation symbol ≤ with a boolean value true and
< with a boolean value false. Then, we treat boolean connectives on relation
symbols ≤, < as operations with boolean values. For example, (≤ =⇒ <) = <.
Now, we define a parametric difference bound matrix, a constrained parametric
difference bound matrix, several operations on them, and a PTBA symbolic
semantics. These definitions are introduced in detail in [6].
Definition 3.1. A parametric difference bound matrix (PDBM) over P and X
is a set D which contains for all 0 ≤ i, j ≤ |X| a guard of the form xi −xj ≺ij eij
where xi , xj ∈ X and eij ∈ E(P ) ∪ {∞} and i = j =⇒ eii = 0. We denote
by Dij a guard of the form xi − xj ≺ij eij contained V
in D . Given a parameter
valuation v, the semantics of D is given by JDKv = J i,j Dij Kv . A PDBM D is
satisfiable with respect to v if JDKv is non-empty. If f is a guard of the form
xi − xj ≺ e with i 6= j (a proper guard), then D[f ] denotes the PDBM obtained
from D by replacing Dij with f . We denote by PDBMS (P, X) the set of all
PDBM over parameters P and clocks X.
Definition 3.2. A constrained parametric difference bound matrix (CPDBM)
is a pair (C, D), where C is a constraint set and D is a PDBM and for each
0 ≤ i ≤ |X| it holds that C |= e0i ≥ 0. The semantics of (C, D) is given by
JC, DK = {(v, η) | v ∈ JCK ∧ η ∈ JDKv }. We call (C, D) satisfiable if JC, DK is
non-empty. We denote by CPDBMS the set of all CPDBM. A CPDBM (C, D)
is in the canonical form iff for all i, j, k, C |= eij (≺ik ∧ ≺kj )eik + ekj .
33
P. Bezděk, N. Beneš, J. Barnat, and I. Černá
Definition 3.3 (Applying a guard). Suppose g is a simple guard of the form
xi − xj ≺ e. Suppose (C, D) is a constrained PDBM in the canonical form and
Dij = (eij , ≺ij ). The application of a guard g on (C, D) generally results in a set
of constrained PDBMs and is defined as follows:

{(C, D[g])}
if C |= ¬eij (≺ij =⇒ ≺)e,



{(C, D)}
if C |= eij (≺ij =⇒ ≺)e,
(C, D)[g] =

{(C
∪
{e
(≺
=⇒
≺)e},
D),
else,
ij
ij



(C ∪ {¬eij (≺ij =⇒ ≺)e}, D[g])}
where D[g] is defined as follows:
(
(e, ≺)
D[g]kl =
Dkl
if k = i and l = j,
else.
We can generalise this definition to conjunctions of simple guards as follows:
def
D[gi0 ∧ gi1 ∧ . . . ∧ gik ] ⇔ D[gi0 ][gi1 ] . . . [gik ].
Definition 3.4 (Resetting a clock). Suppose D is a PDBM in the canonical
form. D with a reset clock xr , denoted as D[xr ], represents a PDBM D after
resetting the clock xr and is defined as follows:


D0j if i 6= j and i = r,
D[xr ]ij = Di0 if i 6= j and j = r,


Dij else.
We can generalise this definition to reset of a set of clocks as follows:
def
D[xi0 , xi1 , . . . , xik ] ⇔ D[xi0 ][xi1 ] . . . [xik ].
Definition 3.5 (Time successors). Suppose D is a PDBM in the canonical
form. The time successor of D, denoted as D↑ , represents a PDBM with all
upper bounds on clocks removed and is defined as follows:
(
(∞, <) if i 6= 0 and j = 0,
↑
Dij =
Dij
else.
It follows from the definition that the reset and time successor operations
preserve the canonicity. After an application of a guard the canonical form needs
to be computed.
To compute the canonical form of the given CPDBM we need to derive the
tightest constraint on each clock difference. Deriving the tightest constraint on
a clock difference can be seen as finding the shortest path in the graph interpretation of the CPDBM [6,11]. The canonisation operation is usually implemented
using extended Floyd-Warshall algorithm where on each relaxation a split action
on the constraint set can occur. Therefore, the result of the canonisation is a
set containing constrained parametric difference bound matrices in the canonical
form.
34
LTL model checking of Parametric Timed Automata
Definition 3.6 (Canonisation). First, we define a relation −→F W on constrained parametric bound matrices as follows, for all 0 ≤ k, i, j ≤ |X| + 1
– (k, i, j, C1 , D1 ) −→F W (k, i, j + 1, C2 , D2 )
if (C2 , D2 ) ∈ (C1 , D1 )[xi − xj (≺ik ∧ ≺kj )eik + ekj ]
– (k, i, |X| + 1, C1 , D1 ) −→F W (k, i + 1, 0, C2 , D2 )
if (C2 , D2 ) ∈ (C1 , D1 )[xi − xj (≺ik ∧ ≺kj )eik + ekj ]
– (k, |X| + 1, 0, C1 , D1 ) −→F W (k + 1, 0, 0, C2 , D2 )
if (C2 , D2 ) ∈ (C1 , D1 )[xi − xj (≺ik ∧ ≺kj )eik + ekj ]
The relation −→F W can be seen as a representation of the computation steps
of the extended nondeterministic Floyd-Warshall algorithm.
Now, suppose (C, D) is a CPDBM. The canonical form of (C, D), denoted
as (C, D)c , represents a set of CPDBMs with a tightest constraint on each clock
difference in D and is defined as follows.
(C, D)c = {(C 0 , D0 ) | (0, 0, 0, C, D) −→∗F W (|X| + 1, 0, 0, C 0 , D0 )}
Definition 3.7 (PTBA symbolic semantics). Let A = ((L, l0 , X, P, ∆, Inv ),
F ) be a PTBA. Let lb and ub be a lower bound function and an upper bound
function on parameters. The symbolic semantics of A with respect to lb and ub
is a transition system (SA , Sinit , =⇒), denoted as JAKlb,ub , where
– SA = L × {JC, DK | (C, D) ∈ CP DBM S} is the set of all symbolic states,
– the set of initial states S0 is defined as {(l0 , JC, DK) | (C, D) ∈ (∅, E ↑ )[Inv (l0 )]},
where
• E is a PDBM with E i,j = (0, ≤) for each i, j, and
• for each p ∈ P , the constraints p ≥ lb(p) and p ≤ ub(p) are in C.
– There is a transition (l, JC, DK) =⇒ (l0 , JCc0 , Dc0 K) if
g,R
• l −→∆ l0 and
• (C 00 , D00 ) ∈ (C, D)[g] and
• (Cc00 , Dc00 ) ∈ (C 00 , D00 )c and
• (C 0 , D0 ) ∈ (Cc00 , Dc00 [R]↑ )[Inv(l0 )] and
• (Cc0 , Dc0 ) ∈ (C 0 , D0 )c .
A symbolic state is represented by a tuple (l, JC, DK) where l is a location,
(C, D) is a CPDBM. We say a state S = (l, JC, DK) ∈ SA is accepting if l ∈ F .
We say π = S0 =⇒ S1 =⇒ . . . is a run of JAK if S0 ∈ Sinit and for each i
Si ∈ SA and Si−1 =⇒ Si . A run respects a parameter valuation v if for each
state Si = (li , JCi , Di K) it holds that v ∈ JCi K. A run π is accepting if there exists
an infinite set of indices i such that Si is accepting.
For the rest of the paper we fix lb, ub and we use JAK to denote JAKlb,ub .
The transition system JAK may be infinite. In order to obtain a finite transition
system we need to apply a finite abstraction over JAK.
Definition 3.8 (Time-abstracting simulation). Given an LTS (S, s0 , →),
a time-abstracting simulation R over S is a binary relation satisfying following
conditions:
35
P. Bezděk, N. Beneš, J. Barnat, and I. Černá
act
act
– s1 Rs2 and s1 → s01 implies the existence of s2 → s02 such that s01 Rs02 , and
d
– s1 Rs2 and d1 ∈ R≥0 and s1 →1 s01 implies the existence of d2 ∈ R≥0 and
d2
s2 → s02 such that s01 Rs02 .
We define the largest simulation relation over S (4S ) in the following way:
s 4S s0 if there exists a time-abstracting simulation R and (s, s0 ) ∈ R. When S
is clear from the context we shall only use 4 instead of 4S in the following.
In the following definition, for a parameter valuation v, a concrete state
s1 = (l1 , η) from JAKv , and a symbolic state S2 = (l2 , JC, DK) from JAK we write
s1 ∈v S2 if l1 = l2 , v ∈ C, and η ∈ JDKv .
Definition 3.9 (PTBA abstract symbolic semantics). Let A = (M, F ) be
a PTBA. An abstraction over JAK = (SA , Sinit , =⇒) is a mapping α : SA → 2SA
such that the following conditions hold:
– (l0 , JC 0 , D0 K) ∈ α((l, JC, DK)) implies l = l0 ∧ JC 0 K ⊆ JCK ∧ JC 0 , DK ⊆ JC 0 , D0 K,
– for each v ∈ JCK there exists S1 , S2 such that S2 = (l, JC 0 , D0 K) ∈ α(S1 ) and
for each s ∈v S2 there exists a state s0 ∈v S1 satisfying s 4 s0 .
An abstraction α is called finite if its image is finite. An abstraction α over JAK
induces a new transition system denoted as JAKα = (QA , Qinit , =⇒α ) where
– QA = {S | S ∈ α(S 0 ) and S 0 ∈ SA },
– Qinit = {S | S ∈ α(S 0 ) and S 0 ∈ Sinit }, and
– Q =⇒α Q0 if there is S ∈ SA such that Q0 ∈ α(S) and Q =⇒ S.
An accepting state, a run and an accepting run are defined analogously as in the
JAK case. If the α is finite the JAKα can be viewed as a Büchi automaton.
Now, we define a parametric extension of the well known k-extrapolation [12].
Definition 3.10. Let A be a PTBA, (l, JC, DK) be a symbolic state of JAK and
Dij = xi −xj ≺ij eij for each 0 ≤ i, j ≤ |X|. We define the kp-extrapolation
αkp
V
in the following way: (l, JC 0 , D0 K) ∈ αkp ((l, JC, DK)) if C 0 = C ∧ 0≤i,j≤|X| c0ij
and for each 0 ≤ i, j ≤ |X|:
–
–
–
–
it
it
it
it
holds
holds
holds
holds
that
that
that
that
0
Dij
0
Dij
0
Dij
0
Dij
= xi − xj
= xi − xj
= xi − xj
= xi − xj
≺ij eij and c0ij = eij ≤ M (xi ) or
< ∞ and c0ij = eij > M (xi ) or
≺ij eij and c0ij = eij ≥ −M (xj ) or
< −M (xj ) and c0ij = eij < −M (xj ),
where M (x) is the maximum value in {maxlb,ub (e) | e is compared with x in A}.
Lemma 3.11. Let A be a PTBA. The kp-extrapolation is a finite abstraction
over JAK = (SA , Sinit , =⇒).
36
LTL model checking of Parametric Timed Automata
Proof. First, we prove that the kp-extrapolation is an abstraction. It is easy to
see that the kp-extrapolation satisfies the first condition (l0 , JC 0 , D0 K) ∈ α((l,
JC, DK)) implies l = l0 ∧ JC 0 K ⊆ JCK ∧ JC 0 , DK ⊆ JC 0 , D0 K. The validity of the
second condition follows from the following observation. For each v ∈ JCK and
each η 0 ∈ JD0 Kv there exists η ∈ JDKv such that for each clock x and each guard g
the following implication holds: η 0 (x) |= g =⇒ η(x) |= g.
Now, we need to show that the kp-extrapolation is finite. From the definition
we have the fact that the number of locations is finite and the number of sets
of bounded parameter valuations is finite. We need to show that there are only
finitely many sets JC, DK when the kp-extrapolation is applied. This follows
from the fact that the kp-extrapolation allows values either from the finite range
< −M (xi ), M (xi ) > or the value ∞.
t
u
Theorem 3.12. Let A be a PTBA and α be a finite abstraction. For each parameter valuation v the following holds: there exists an accepting run of JAKv if
and only if there exists an accepting run respecting v of JAKα .
4
Parameter Synthesis Algorithm
We recall that our main objective is to find all parameter valuations for which
the parametric timed automaton satisfies its specification. In the previous sections we have described the standard automata-based method employed under
a parametric setup which produces a Büchi automaton. For the rest of this
section we denote for each state s = (l, JC, DK) of the Büchi automaton on
the input the set of valuations JCK as s.JCK. We say that a sequence of states
s1 =⇒ s2 =⇒ . . . =⇒ sn =⇒ s1 is a cycle under the parameter valuation v if
each state si in the sequence satisfies v ∈ si .JCK. A cycle is called accepting if
there exists 0 ≤ i ≤ n such that si is accepting.
Contrary to the standard LTL model checking, it is not enough to check
the emptiness of the produced Büchi automaton. Our objective is to check the
emptiness of the produced Büchi automaton for each considered parameter valuation. We introduce the Cumulative NDFS algorithm as an extension of the
well-known NDFS algorithm. Our modification is based on the set F ound which
accumulates all detected parametric valuations such that an accepting cycle under these valuations was found. Contrary to the NDFS algorithm, whenever
Cumulative NDFS detects an accepting cycle, parameter valuations are saved to
the set F ound and the computation continues with a search for another accepting cycle. Note the fact that whenever we reach a state s0 with s0 .JCK ⊆ F ound
we already have found an accepting cycle under all valuations from s0 .JCK and
there is no need to continue with the search from s0 . Therefore, we are able to
speed up the computation whenever we reach such a state.
Now, we mention the crucial property of monotonicity. The set of parameter valuations s.JCK can not grow along any run of the input automaton.
Lemma 4.1 states this observation, which follows from the definition of successors
in JAKα and the definition of operations on CPDBMs. The clear consequence of
Lemma 4.1 is the fact that each state s on a cycle has the same set s.JCK.
37
P. Bezděk, N. Beneš, J. Barnat, and I. Černá
1
2
3
4
5
6
7
8
9
10
11
12
Algorithm CumulativeNDFS (G)
F ound ← Stack ← Outer ←
Inner ← ∅
OuterDF S(sinit )
return Accepted ← F ound
Procedure OuterDF S(s)
Stack ← Stack ∪ {s}
Outer ← Outer ∪ {s}
foreach s0 such that s → s0 do
if s0 ∈
/ Outer ∧ s0 ∈
/ Stack ∧
s0 .JCK 6⊆ F ound then
OuterDF S(s0 )
if s ∈ Accepting ∧ s.JCK 6⊆ F ound
then
InnerDF S(s)
Stack ← Stack \ {s}
return
13
14
15
16
17
18
19
20
21
Procedure InnerDF S(s)
Inner ← Inner ∪ {s}
foreach s0 such that s → s0 do
if s0 ∈ Stack then
“Cycle detected”
F ound ← F ound ∪ s0 .JCK
return
if s0 ∈
/ Inner ∧
s0 .JCK 6⊆ F ound then
InnerDF S(s0 )
return
Algorithm 1: Cumulative NDFS
Lemma 4.1. Let A be a PTBA, α be an abstraction and s be a state in JAKα .
For every state s0 reachable from s it holds that s0 .JCK ⊆ s.JCK.
Theorem 4.2. Let A be a PTBA and α an abstraction over JAK. A parameter
valuation v is contained in the output of the CumulativeNDFS(JAKα ) if and only
if there exists an accepting run respecting v in JAKα .
We recall that our objective was to synthesise the set of all parameter valuations such that the given parametric timed automaton satisfies the given LTL
property. In order to compute this set we employed a zone-based semantics,
an extrapolation technique and the Cumulative NDFS algorithm. We have shown
the way to compute all parameter valuations for which the given LTL formula
is not satisfied. Now, as the last step in the solution to Problem 2.3, we need to
complement the set Accepted. Thus, the solution to Problem 2.3 is the complement of the set Accepted, more precisely the set V allb,ub (X, P ) \ Accepted. To
conclude this section, we state that Theorem 4.2 together with Theorem 3.12
imply the correctness of our solution to Problem 2.3.
5
Conclusion and Future Work
We have presented a logical and algorithmic framework for the bounded integer
parameter synthesis of parametric timed automata with an LTL specification.
The proposed framework allows the avoidance of the explicit enumeration of all
possible parameter valuations.
38
LTL model checking of Parametric Timed Automata
In this paper we have used the parametric extension of a difference bound matrix called a constrained parametric difference bound matrix. To be able to employ a zone-based method successfully we introduced a finite abstraction called
the kp-extrapolation. At the final stage of the parameter synthesis process, the
cycle detection itself is performed by the introduced Cumulative NDFS algorithm which is an extension of the well-known NDFS algorithm.
As for future work we plan to introduce different finite abstractions and compare their influence on the state space size. Other area that can be investigated is
the employment of different linear specification logics, e.g. Clock-Aware LTL [13]
which enables the use of clock-valuation constraints as atomic propositions.
References
1. Clarke, E., Grumberg, O., Peled, D.: Model Checking. MIT press (1999)
2. Alur, R., Dill, D.L.: A Theory of Timed Automata. Theor. Comput. Sci. 126(2)
(1994) 183–235
3. Daws, C., Tripakis, S.: Model checking of real-time reachability properties using abstractions. In: Tools and Algorithms for the Construction and Analysis of
Systems. Springer (1998) 313–329
4. Behrmann, G., David, A., Larsen, K.G., Möller, O., Pettersson, P., Yi, W.: Uppaal
- present and future. In: Proc. of 40th IEEE Conference on Decision and Control,
IEEE Computer Society Press (2001)
5. Alur, R., Henzinger, T.A., Vardi, M.Y.: Parametric real-time reasoning. In: Proceedings of the twenty-fifth annual ACM symposium on Theory of computing,
ACM (1993) 592–601
6. Hune, T., Romijn, J., Stoelinga, M., Vaandrager, F.: Linear parametric model
checking of timed automata. The Journal of Logic and Algebraic Programming 52
(2002) 183–220
7. Jovanovic, A., Lime, D., Roux, O.H.: Synthesis of Bounded Integer Parameters for
Parametric Timed Reachability Games. In: Automated Technology for Verification
and Analysis (ATVA 2013). Volume 8172 of LNCS., Springer (2013) 87–101
8. Vardi, M., Wolper, P.: An automata-theoretic approach to automatic program verification (preliminary report). In: Proceedings, Symposium on Logic in Computer
Science (LICS’86), IEEE Computer Society (1986) 332–344
9. Bezděk, P., Beneš, N., Barnat, J., Černá, I.: LTL Model Checking of Parametric
Timed Automata. CoRR abs/1409.3696 (2014)
10. Tripakis, S., Yovine, S., Bouajjani, A.: Checking timed büchi automata emptiness
efficiently. Formal Methods in System Design 26(3) (2005) 267–292
11. Dill, D.L.: Timing assumptions and verification of finite-state concurrent systems.
In: Automatic verification methods for finite state systems, Springer (1990) 197–
212
12. Bouyer, P.: Forward analysis of updatable timed automata. Formal Methods in
System Design 24(3) (2004) 281–320
13. Bezděk, P., Beneš, N., Havel, V., Barnat, J., Černá, I.: On Clock-Aware LTL properties of Timed Automata. In: International Colloquium on Theoretical Aspects
of Computing (ICTAC). Volume 8687 of LNCS., Springer (2014)
39
FPGA Accelerated Change-Point Detection
Method for 100 Gb/s Networks
Tomáš Čejka1 , Lukáš Kekely1 , Pavel Benáček2, Rudolf B. Blažek2, and Hana
Kubátová2
1
CESNET a. l. e.
Zikova 4, Prague, CZ
cejkat,[email protected]
2
CTU in Prague – FIT
Thakurova 9, Prague, CZ
benacekp,rblazek,[email protected]
Abstract. The aim of this paper is a hardware realization of a statistical anomaly detection method as a part of high-speed monitoring probe
for computer networks. The sequential Non-Parametric Cumulative Sum
(NP-CUSUM) procedure is the detection method of our choice and we
use an FPGA based accelerator card as the target platform. For rapid
detection algorithm development, a high-level synthesis (HLS) approach
is applied. Furthermore, we combine HLS with the usage of Software
Defined Monitoring (SDM) framework on the monitoring probe, which
enables easy deployment of various hardware-accelerated monitoring applications into high-speed networks. Our implementation of NP-CUSUM
algorithm serves as hardware plug-in for SDM and realizes the detection
of network attacks and anomalies directly in FPGA. Additionally, the
parallel nature of the FPGA technology allows us to realize multiple different detections simultaneously without any losses in throughput. Our
experimental results show the feasibility of HLS and SDM combination
for effective realization of traffic analysis and anomaly detection in networks with speeds up to 100 Gb/s.
1
Introduction
Computer networks are getting larger and faster, and hence the volume of data
captured by network monitoring systems increases. Therefore, there is a need to
analyze more data for detection of network attacks and traffic anomalies. This
paper deals with real-time detection of attacks suitable for high-speed computer
networks thanks to the direct deployment of detection methods in hardware
monitoring probe.
Today, monitoring systems usually consist of several probes that capture and
preprocess huge amounts of network traffic at wire speed, and one or more collector servers that collect and store network traffic information from these probes.
Analysis of network data is traditionally also realized at the collectors. In this
40
FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks
paper, we propose a different approach, where anomaly detection is shifted directly into the monitoring probes. The aim of this approach is to enable real-time
analysis even in very large networks with speeds up to 100 Gb/s per Ethernet
port and to reduce the latency of anomaly detections.
It is virtually impossible to process all network data from the 100 Gb/s link in
software using only commodity hardware. The main limitations lay in insufficient
bandwidth of communication paths between the network interface card and the
software components [1] and in limited performance of the processors. Therefore,
hardware acceleration must be used for high-speed networks in order to avoid
transferring and processing of all the data in the software.
In this paper, we utilize a special network interface card mounted with FPGA
chip for hardware acceleration of network traffic processing as a basis for our
high-speed probe. The FPGA on the card allows us to realize more advanced
data processing features (e.g. anomaly detection methods that use packet level
statistics) directly on the card, thus reducing the data load for the software. To
demonstrate this approach, we concentrate on a real-time sequential ChangePoint Detection (CPD) method that is designed to minimize the average detection delay (ADD) for a prescribed false alarm rate (FAR) [2,3].
As the basis for the FPGA firmware, we use Software Defined Monitoring
(SDM). SDM is a novel monitoring approach proposed in [4], that can be used
as a framework for hardware acceleration of various monitoring methods. SDM
combines hardware and software modules into a tightly bound co-design that is
able to address challenges of monitoring from data link to application layer of the
ISO/OSI model in modern network environments at the speeds up to 100 Gb/s.
The main contribution of this paper is the evaluation of a statistical realtime detection methods implemented in hardware. The detection methods are
extensions of a hardware accelerated monitoring probe designed for 40 Gb/s and
100 Gb/s Ethernet lines. The resulting device is able to analyze unsampled highspeed network traffic without loss.
The rest of this paper is organized in the following way. Introduction to
the implemented sequential non-parametric change-point detection method (NPCUSUM) can be found in Sec. 2. The used SDM concept is briefly described in
Sec. 3. Sec. 4 describes created hardware implementation of detection method.
Evaluation of the developed system and the achieved results are presented in
Sec. 5. Related work and main differences between existing projects and our
implementation are presented in Sec. 6. Sec. 7 summarizes the results presented
in this paper and outlines our future work.
2
Change-Point Detection
Network attacks, intrusions, or anomalies appear usually at unpredictable points
in time. The start of an attack is mostly observable as a change of some statistical properties of the network traffic or its specific part. Therefore, methods
based on sequential Change-Point Detection theory are suitable for intrusion
detection. CPD methods detect the point in time where the distribution of some
41
T. Čejka et al.
perpetually observed variables changes. In network security settings, these variables correspond to some relevant, directly observed or calculated network traffic
characteristics. The main problem of such approach is the lack of precise knowledge about the statistical distributions of these traffic characteristics. Ideally,
the distributions should be known both, before and after the distribution change
that corresponds to the anomaly or attack. Therefore, we use a non-parametric
CPD method NP-CUSUM that was developed in [2,3] and that does not require
precise knowledge about these statistical distributions.
NP-CUSUM is inspired by Page’s CUSUM algorithm that is proven to be
optimal for detection of a change in the mean (expectation) when the distributions of the observed random variables are known before and after the change
[5]. The typical optimality criterion in CPD is to minimize the average detection delay (ADD) among all algorithms whose average false alert rate (FAR)
is below a prescribed low level. Page’s CUSUM procedure, which is based on
the log-likelihood ratio, can for i.i.d. (independent and identically distributed)
random variables Xn be rewritten [5] as:
p1 (Xn )
, U0 = 0,
(1)
Un = max 0, Un−1 + log
p0 (Xn )
Where p0 and p1 are the densities of Xn before and after the change, respectively.
The formulation in (1) is the inspiration for the NP-CUSUM method procedure [2,3]. The procedure is applicable to non–i.i.d. data with unknown distributions (i.e. the method is non-parametric). First, the Page’s CUSUM procedure
was generalized as Sn = max{0, Sn−1 + f (Xn )} with some function f . Changes
in the mean value of Xn can be detected using sequential statistic:
Sn = max{0, Sn−1 + Xn − µ̂ − εθ̂},
S0 = 0,
(2)
Where µ̂ is an estimate of the mean of Xn before the attack, θ̂ is an estimate of
the mean after the attack started, and ε is a tuning parameter for optimization.
It has been shown in [3] that with optimal value of ε the NP-CUSUM procedure
(2) is asymptotically optimal as FAR decreases. That is, for small prescribed
rate of false alarms, other procedures will have longer detection delays. In fact,
the delays can theoretically be exponentially worse [3].
As the input of the NP-CUSUM algorithm, we can use various features Xn of
the observed network traffic. To basically evaluate our hardware implementation
of the method, we have chosen for Xn the ratio of SYN and FIN packets of
the Transmission Control Protocol (TCP) in a short time window [6]. During
“normal” operation of the network, each connection is opened using two SYN
packets, and closed using two FIN packets (one in each direction). Therefore,
we expect the ratio of SYN and FIN packets to be on average close to 1 or at
least constant. Sudden and consistent change of the ratio is suspicious and can
be caused by some sort of attacks (e.g. SYN or FIN packet flood) [6].
To demonstrate the scalability and power of our hardware implementation
using SDM, we raise the number of observed statistics and add some more NPCUSUM blocks in parallel. The added statistics utilize information about ICMP
42
FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks
and RST TCP packets. All measured values are used in form of ratios in order
to avoid the dependency on trends and traffic volumes that could increase the
number of false alerts. Finally, thanks to parallelism, observation of multiple
statistics simultaneously does not negatively affect the processing throughput.
3
Software Defined Monitoring System
Software Defined Monitoring (presented in [4,7]) forms a basis for our hardware
implementation of detection methods in a monitoring probe. In this section we
briefly describe the main architecture of the SDM system and the changes needed
to accommodate the implementation of NP-CUSUM monitoring system.
An SDM system consists of two main parts: firmware for the FPGA on hardware accelerator and software for general processors. The hardware and software
components are connected via a PCI-Express bus. Both parts are tightly coupled
together to allow precise software control of hardware processing. The software
part of the SDM system consists of monitoring applications and a controller. The
monitoring applications can perform advanced monitoring tasks (such as analysis
of application protocols) or also export information (alerts) to the collector. The
controller manages the hardware module by dynamically removing and inserting
processing rules into its memory (see Fig. 1). The instructions contained in the
rules tell the hardware what actions to perform for each input packet with some
characteristics. These rules are defined by the monitoring Applications, which
inserts them to the Hardware via the Controller.
Due to aforementioned facts, the monitoring application can not only use
data coming from the hardware, but it can also manage the details of hardware
processing of network traffic as well. The offloading of traffic processing into
the hardware saves both, the bandwidth of communication interface (PCIe) and
the CPU processing time. The hardware module can pass information to the
software in the form packet metadata from a single packet, or as aggregated
records computed from multiple subsequent packets with common features (such
as NetFlow [8] aggregation). Whole received packets or their parts can be also
sent to the software for further (deeper) analysis. Graphical representation of
the SDM concept is shown in Fig. 1.
Processing of an incoming network packet in the SDM hardware starts with
the extraction of its protocol headers. The extracted data are used to search adequate rule in memory that specifies the desired processing possibly supplemented
by address of a record. The selected rule and metadata for each given packet are
then passed to the SDM Update block, which is the heart of the SDM concept
making that idea strong. This block contains a routing table that is used to
forward the incoming processing request to the appropriate update (instruction)
blocks, for execution. Each of these instruction blocks can perform a specific
update operation (realize a specific aggregation type) on the record. Each update operation is delimited by two memory operations: reading the stored record
values, and writing back the updated values. Also, new types of updates (aggregations) can be specified, simply by implementing the new instruction block
43
T. Čejka et al.
SW
Controller
Apps
Aggr. Data
Rules
Packets
Metadata
HW
Packets
Fig. 1. Software Defined Monitoring (SDM) abstract architecture
and plugging it into the existing Update block infrastructure. A special type of
processing action is an export into the software of the processed packet data,
metadata, or stored values from a selected record, optionally followed by clearing
of that record. Records can be exported when some special condition is met or
in periodical manner.
4
4.1
Implementation
The CPD hardware block
Our hardware implementation of CPD method is realized as hardware plug-in
for the SDM system. More precisely, it is available as a new instruction block
for the SDM Update module that is described in the previous section. The SDM
design supports access to arbitrary data records stored in memory for instruction
blocks. Although, the available data size of a record is limited due to memory
block size that can be read or written on each clock cycle – the block size is
equal to 288 b. Usage of bigger data records than 288 b would cause unwanted
latency increase and lower throughput of the whole monitoring hardware.
One CPD instruction block uses available space in memory to store: previous
historical value, 2 parameters of the NP-CUSUM algorithm, and 1 threshold
value that is used for alerting purposes. Memory should also contain counters
with observed features such as the number of SYN or FIN packets, and the
packet counter that starts the ratio and NP-CUSUM computation. The data
stored in memory is accessible from software and therefore all of the thresholds
and parameters can be changed on the fly.
The source code of the instruction block allows us to specify the data type size
of all values stored in memory. The choice of data type sizes implies the number
of hardware blocks that can work in parallel in the same clock cycle with the
same memory block. However, the decrease of data type size lowers the value
precision and data ranges. The NP-CUSUM parameters, the previous historical
value and the threshold are represented as 16 b decimal numbers. The counters
are set to 8 b. For one block that analyzes SYN/FIN ratio, the implementation
44
FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks
works with 88 b of memory for one record in total. Configuration with 4 NPCUSUM blocks uses 5 counters (SYN, FIN, RST, ICMP, packet counter) and
4 sets of fixed-point values. In total, 4 NP-CUSUM blocks would use 296 b of
memory. Therefore the size of decimal number data type was shortened to 15 b
and the total used memory size was decreased to available 280 b.
We use a high-level synthesis (HLS) approach [9], to implement the CPD
method from Sec. 2 for the FPGA as an instruction block inside the SDM system. The structure of the implemented block is shown at Fig. 2. The main
advantage of using HLS approach is faster implementation of new hardware accelerated monitoring and detection methods with minimal loss of efficiency in
comparison to traditional coding of FPGA firmware using Hardware Description
Languages (HDL) such as VHDL or Verilog. Following the requirements for the
SDM instruction block interfaces and general behavior, we have developed the
CPD hardware block in the C++ language.
Implementation of the CPD hardware block brings a several issues to solve.
The most important one is the choice of decimal numbers representation. We
try two of the standard approaches: fixed-point and floating-point representation.
The main advantage of the floating-point approach is the ability to represent a
greater range of values. But on the other hand, hardware realization of floatingpoint arithmetic is very complicated and considerably slower. Therefore, the
usage of fixed-point arithmetic can be favored by better performance and lower
resource usage of the instruction block.
From the HLS point of view, the most important parameter for our design
goals is the achievable Initiation Interval (II). This parameter represents the
number of clock cycles needed for initialization of a new request in the instruction
block. Ideally, we require the II to be equal to one so that a new request can
be accepted in each clock cycle and the instruction block is able to achieve
full throughput. During our experiments, we have discovered that the effect of
decimal numbers representation on the II is following: floating-point version of
the instruction block has II of 11 clock cycles whereas the fixed-point version
has II of 1.
Another very important performance-related parameter of our implementation is latency. It is required to be as small as possible because high latency can
lead to delays between repeated processing of the same instruction caused by
the fact that records in the memory need to be locked in order to achieve atomic
processing. In the end, our experimental timing and performance results indicate
that the created implementation is able to handle network traffic at 100 Gb/s
Ethernet line. More detailed results regarding our synthesis and FPGA requirements are discussed in Sec. 5.
Apart from CPD instruction block creation, another important part of the
implementation is connection of the new instruction block to the existing SDM
Update block. Thanks to the by design extensibility of SDM Update block, this
task is simple and straightforward. All that needs to be done is to wrap the
translated HLS implementation of the new block in a VHDL envelope that is
responsible for adapting the behavior of all predefined interface signals. The
45
T. Čejka et al.
Reservation
and memory
interface
Update pipeline
HLS dened
interface
Instruction Read and reservation
module
CPD module
(C++ desc.)
SDM interface adaptor
Data to
write
Output SDM format
generator
Write and release
module
Reservation
and memory
interface
Data to
export
SDM output
data
Fig. 2. Implementation of the CPD Instruction
wrapping process is depicted in Fig. 2. The gray blocks are parts of the SDM
designated for connecting of a new instruction blocks. The SDM can thus be
viewed as some kind of a framework that brings the possibility to create new
hardware modules for rapid network monitoring acceleration.
To finish the implementation of the Change-Point Detection method in the
SDM system, a software monitoring application needs to be created. The application communicates with an SDM Controller daemon to manage the detection
details in hardware module (see Fig. 1) and also receives detected alerts. The
main task of the monitoring application is to control the detection process and
present its results to human operators.
5
Evaluation
Correct functionality of the created implementation of the CPD block was verified using referential software application. The referential application is written
in the plain C language and is not meant to be highly optimized for the HLS. Its
main purpose is only to validate the functionality of the hardware implementation. In addition, the software application is extended and serves as the base for
the measuring and detection application [10] that can be used in slower networks
or for estimation of configurable parameters for the CPD block.
We have implemented the hardware-based prototype of the NP-CUSUM detection method as an instruction block for the SDM Update block in an SDM
monitoring probe. The prototype is developed for the network interface card
with a 100 Gb/s Ethernet port and a Virtex-7 H580T FPGA, which is the main
core for the implemented detection functionality.
A detailed list of all FPGA resources needed for the implementation of one
CPD instruction block, which observes one feature, is shown in Tab. 1. In the
46
FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks
table there are also results for other constellations of the CPD blocks that contain
more computational blocks with 1, 2, or 4 instances of the NP-CUSUM algorithm
and observe more features in parallel. The total number of available resources
on used chip is 725 600 Flip-Flops (FF) and 362 800 Look-up tables (LUT). The
number of utilized LUTs and FFs for CPD instruction block itself, therefore,
accounts only for less than 1% of the available FPGA resources.
Table 1. FPGA resources used for the CPD instruction block in different configurations.
Name
1 block
FF LUTs
Expression
0
458
Instance
280 252
Multiplexer
- 1842
Register 2253 ShiftMemory 0
806
Total
2533 3358
2 blocks
FF LUTs
0
496
560 504
- 1868
2377 0
816
2937 3684
4 blocks
FF LUTs
0
479
560 504
- 2130
2593 0
814
3178 3982
Performance results for the CPD instruction blocks are shown in Tab. 2 and
Tab. 3, whereas Tab. 3 shows detailed information about the fixed-point implementation. An Initiation Interval is required to be equal to one in order to
support processing of 100 Gb/s network traffic at full wire-speed (see Sec 3). This
requirement is not satisfied only by the floating-point implementation. Vivado
HLS version 2013.2 was used for high-level C to VHDL synthesis. Xilinx ISE
version 14.7 with enabled synthesis optimization was used for VHDL to FPGA
netlist synthesis. Enabling the optimization such as register duplication leads
to a higher clock frequency achieved for the final implementation and also to
a higher resources consumption. The tables illustrate that after the optimization all performance requirements from Sec. 3 have been met by the fixed-point
implementation.
Table 2. Comparison of timing results for the synthesized CPD instruction blocks.
Parameter
Reached
Reached
Required
Fixed-point Floating-point
Clock period
4.08 ns
16.48 ns
5 ns
Frequency
245 MHz
60.679 MHz
200 MHz
Latency
12
11
Initiation Interval
1
12
1
Bus Width
512 b
512 b
512 b
Achieved Throughput 125 Gb/s
2.5 Gb/s
100 Gb/s
47
T. Čejka et al.
Table 3. Performance results for the CPD instruction blocks in different configurations.
Parameter
Reached
1 block
Clock period
4.08 ns
Frequency
245 MHz
Latency
12
Initiation Interval
1
Bus Width
512 b
Achieved Throughput 125 Gb/s
Reached Reached Required
2 blocks 4 blocks
4.20 ns
4.20 ns
5 ns
238 MHz 238 MHz 200 MHz
12
12
1
1
1
512 b
512 b
512 b
121 Gb/s 121 Gb/s 100 Gb/s
Finally, Tab. 4 shows the total number of FPGA resources required for the
whole synthesized SDM system with one CPD hardware plug-in. The table shows
that about 87% of the Virtex-7 H580T resources are still available. Therefore,
it is feasible to include several CPD hardware plug-ins in the SDM system for
parallel detection of various anomalies without significant latency increase nor
throughput loss.
Table 4. FPGA resources of the SDM system with one CPD hardware plug-in ( FPGA
xc7vh580thcg1155-2).
Resource Name Used Resources [-] Utilization Percentage
LUTs
47731
13 %
Registers
21089
2%
BRAMS
107
11 %
6
Related Work
We present a brief overview of related work with regard to the differences of
our work. This section can be divided into two main domains. The first domain
is related to the hardware accelerated detectors and the second domain is related to the detection methods. From the hardware point of view, there are two
interesting projects somehow similar to our – Gorilla and Snabb Switch.
The Gorilla project [11] is the closest comparable solutions that we found.
Gorilla is a methodology for generating FPGA-based solutions especially wellsuited for data parallel applications. The main goal of Gorilla is the same as
our goal in SDM Update – to make the hardware design process easier and
faster. Our solution is however specially designed for the stateful processing of
network packet data. Furthermore, SDM is able to work with L2–L7 layers of
ISO/OSI model. In addition, the resource consumption of Gorilla is higher than
our solution.
The Snabb Switch project [12] shows different approach of network packets
processing. This approach uses modified drivers for faster transfer of network
48
FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks
packets from the network interface card to computer’s memory. Transferred data
are then processed by network applications. There is also available a Snabb
Lab with an accessible platform for measuring. This platform consists of the
Supermicro motherboard with dual Xeon-E5 and 20x10 GbE (Intel 82599ES)
network cards. This configuration allows to process network traffic at speed of
200 Gb/s. Massive usage of this platform is complicated due to large number
of network cards. Our solution is able to process network traffic at speed of
100 Gb/s on one Ethernet line (2 ports allows to achieve 200 Gb/s). Our work
is focused on full hardware acceleration of network traffic processing using the
only one 100 Gb/s Ethernet port.
From the detection method point of view, there are various existing approaches of anomaly detection from many authors. Detection of SYN flood attacks have been studied and well described in many papers. However, this issue
is currently still relevant because of increase of network traffic volumes. Detection based on NP-CUSUM is used in [13] by Wang et al., where the authors
present their observation about SYN-FIN pairs in network traffic under normal
condition: (1) there is a strong positive correlation between the SYN and RST
packets; (2) the difference between the number of SYN and FIN packets is close
to the number of RST packets. The authors bring experimental evaluation of
flood detection using NP-CUSUM, however they mention a possible disadvantage of aggregated counting of packets that can be spoofed by emission of mixed
packet types by attacker.
Siris et al. in [14] compare a straightforward adaptive threshold algorithm,
which can bring satisfactory performance for attacks with high intensity, and algorithm base on cumulative sum (CUSUM). Adaptive threshold algorithm uses
a difference from moving average value computed e.g. by EWMA algorithm. An
alarm is signalized when measured value is higher then moving average in last k
consecutive intervals. The CUSUM variant of detection algorithm is influenced
by seasonality and trends of network traffic (weekly and daily variations, trends
and time correlations). The authors propose to use some prediction method to remove non-stationary behavior before applying the CUSUM algorithm. However,
because of time-consuming calculations with minor gains compared to simpler
approaches, the authors used simpler approach based on application of CUSUM
on difference between measured value and result of Exponential Weighted Moving Average (EWMA) [15] algorithm.
Smoothing of the data signal is important for minimizing the number of false
alarms that can be caused by high peaks in data. Therefore, the data are usually
preprocessed to avoid short-time deviations to detect long-time anomalies. There
are various approaches to smooth the signal and the possible way is to exploit
some prediction method such as Moving average, EWMA, Holt-Winters [16],
or Box-Jenkins (ARIMA) [17] methods. However, dependency of an algorithm
on historical and current measured values can be dangerous and can lead to
overlooking of an attack. The issue of self-learning and self-adaptive approach is
being studied in our current and future work, however, it is out of the scope of
this paper.
49
T. Čejka et al.
Salem et al. presented the currently used methods of the network anomaly
detection in [18]. The paper evaluates the usage of extended NP-CUSUM called
Multi-chart NP-CUSUM, proposed by Tartakovsky et al. in [19], in combination
with Count Min Sketch and Multi-Layer Reversible Sketch (sketching method is
proposed eg. in [20]) for data aggregation and anomaly detection.
This paper is focused on the hardware implementation of the detection
method, whereas other authors usually more or less rely on software processing
of aggregated data. Our solution allows the detection method to be real-time
and independent on overloaded software part of system.
7
Conclusions
In this paper we present implementation and evaluation of the CPD algorithm
(NP-CUSUM) as hardware plug-in for the Software Defined Monitoring system.
We achieve easy and rapid development of detection hardware blocks for the
FPGA thanks to the usage of high-level synthesis. Also, creation of monitoring
probe utilizing newly implemented detection method is very simple and straight
forward thanks to the utilization of SDM as the platform for high-speed packet
processing. Moreover, we show frequency and FPGA resource evaluation of the
hardware implementation for the Virtex-7 H580T FPGA, which is large enough
and fast enough to accommodate complex network processing.
Results presented in this paper show that our implementation of NP-CUSUM
is capable of processing network traffic at the speed up to 100 Gb/s. The firmware
of the whole monitoring probe consumes only 13 % of the available resources of
the target FPGA and thus leaves space for several additional CPD (NP-CUSUM)
hardware plug-ins that can be used for parallel detection of multiple kinds of
network anomalies concurrently. In addition, other existing detection methods
can potentially be easily implemented in the similar way – as hardware SDM
plug-ins for detection of abrupt changes of network traffic characteristics. The
limiting factor for deploying detection hardware plug-ins into a monitoring probe
is the consumption of FPGA resources. Generally, detection methods with low
data storage requirements can be fully implemented as a hardware plug-ins.
Moreover, SDM allows creation of hardware-software co-design where only the
most critical parts of the more complex detection algorithm can be accelerated.
This partially hardware-accelerated approach can reduce the FPGA resource
requirements of advanced detection methods with moderate performance loss.
Acknowledgment
This work is partially supported by the “CESNET Large Infrastructure” project
no. LM2010005 funded by the Ministry of Education, Youth and Sports of the
Czech Republic and the project TA03010561 funded by the Technology Agency
of the Czech Republic.
50
FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks
References
1. Santiago del Rio, P.M., Rossi, D., Gringoli, F., Nava, L., Salgarelli, L., Aracil, J.:
Wire-speed statistical classification of network traffic on commodity hardware. In:
Proceedings of the 2012 ACM Conference on Internet Measurement Conference.
IMC ’12, New York, NY, USA, ACM (2012) 65–72
2. Blažek, R.B., Kim, H., Rozovskii, B., Tartakovsky, A.: A novel approach to detection of “denial–of–service” attacks via adaptive sequential and batch–sequential
change–point detection methods. In: Proc. 2nd IEEE Workshop on Systems, Man,
and Cybernetics, West Point, NY. (2001)
3. Tartakovsky, A.G., Rozovskii, B.L., Blažek, R., Kim, H.: A novel approach to
detection of intrusions in computer networks via adaptive sequential and batchsequential change-point detection methods. IEEE TRANSACTIONS ON SIGNAL
PROCESSING 54(9) (2006) 3372–3382
4. Kekely, L., Puš, V., Kořenek, J.: Software defined monitoring of application protocols. In: INFOCOM 2014. The 33rd Annual IEEE International Conference on
Computer Communications. (2014) 1725–1733
5. Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2) (1954) 100–115
6. Wang, H., Zhang, D., Shin, K.: Detecting syn flooding attacks. In: INFOCOM
2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE. Volume 3. (2002) 1530–1539
7. Puš, V.: Monitoring of application protocols in 40/100gb networks. In: Campus
Network Monitoring and Security Workshop, Prague, CZ, CESNET (2014)
8. Claise, B.: Cisco Systems NetFlow Services Export Version 9. RFC 3954 (2004)
9. Feist, T.: Vivado design suite. White Paper (2012)
10. Čejka, T.: Fast TCP Flood Detector. http://ddd.fit.cvut.cz/prj/FTFD (2014)
11. Lavasani, M., Dennison, L., Chiou, D.: Compiling high throughput network processors. In: Proceedings of the ACM/SIGDA international symposium on Field
Programmable Gate Arrays. FPGA ’12, New York, NY, USA, ACM (2012) 87–96
12. Gorrie, L.: Snabb switch. http://www.snabb.co (2014)
13. Wang, H., Zhang, D., Shin, K.: Detecting syn flooding attacks. In: INFOCOM
2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE. Volume 3. (2002) 1530–1539
14. Siris, V.A., Papagalou, F.: Application of anomaly detection algorithms for detecting SYN flooding attacks. Computer communications 29(9) (2006) 1433–1442
15. Ye, N., Borror, C., Zhang, Y.: Ewma techniques for computer intrusion detection
through anomalous changes in event intensity. Quality and Reliability Engineering
International 18(6) (2002) 443–451
16. Brutlag, J.D.: Aberrant behavior detection in time series for network monitoring.
In: LISA. (2000) 139–146
17. Box, G., Jenkins, G., Reinsel, G.: Time Series Analysis: Forecasting and Control.
Wiley Series in Probability and Statistics. Wiley (2013)
18. Salem, O., Vaton, S., Gravey, A.: A scalable, efficient and informative approach
for anomaly-based intrusion detection systems: theory and practice. International
Journal of Network Management 20(5) (2010) 271–293 00019.
19. Tartakovsky, A.G., Rozovskii, B.L., Blažek, R.B., Kim, H.: Detection of intrusions
in information systems by sequential change-point methods. Statistical Methodology 3(3) (2006) 252–293
20. Krishnamurthy, B., Sen, S., Zhang, Y., Chen, Y.: Sketch-based change detection:
methods, evaluation, and applications. In: Proceedings of the 3rd ACM SIGCOMM
conference on Internet measurement, ACM (2003) 234–247
51
Hardware Accelerated Book Handling with
Unlimited Depth
Milan Dvorak, Tomas Zavodnik, and Jan Korenek
Brno University of Technology, Brno, Czech Republic,
[email protected], [email protected],
[email protected]
Abstract. Strong competitiveness between market participants on electronic exchanges calls for continuing reduction of latency of trading systems. Recent efforts are focused on hardware acceleration using FPGA
technology and running trading strategies directly in hardware to eliminate high latency of system bus and software processing. For any trading
system, book handling is an important time-critical operation, which has
not been accelerated using the FPGA technology yet. Therefore we propose a new hybrid hardware-software architecture that processes messages from the exchange and creates the book with best buy and sell
prices. Based on the analysis of transactions on the exchange, we propose to store only the most frequently accessed price levels in hardware
and keep the rest in the host memory managed by software. This enables
handling half of the whole stock universe (4 000 instruments) in a single
FPGA. Update of price levels in hardware takes only 240 ns, which is
two orders of magnitude faster than recent software implementations.
Throughput of the hardware unit is 75 million messages per second,
which is 140 times more than current peak rates of the market data.
1
Introduction
Electronic trading is dominating in today financial markets. Market participants
communicate with the exchange by sending messages via a computer network.
Techniques of algorithmic and high frequency trading (HFT) are widely adapted.
Traders no longer focus on specific trades, but rather on tweaking parameters of
algorithm, which is responsible for the trading itself.
HFT traders are utilizing the latest network technologies to achieve advantage against the rest of the market. There is a strong competitiveness even between the individual traders who compete with each other to achieve the lowest
latency of their systems. This is a key factor for making their profit. Therefore, a
significant effort is being put into accelerating the systems for electronic trading
by both academic and commercial institutions.
First efforts in acceleration of trading systems were focused on latency of data
transfers from network interface to the processor using designated acceleration
card [1] [2]. Further reduction of the latency was achieved by accelerating the
decoding of messages from the exchange [3] [4]. The latest efforts aim to realize
52
Hardware Accelerated Book Handling with Unlimited Depth
the whole system in the FPGA chip [5]. This eliminates the latency of system
data bus transfers and ensures the lowest latency possible. Still many parts of
trading systems have not been accelerated using the FPGA yet. For instance,
Lockwood [5] does not address the book handling, which is crucial for processing
the data feed from the exchange. An architecture for handling aggregated book
with limited depth is proposed in [6]; however, many exchanges (including the
stock markets) are using so called book with unlimited depth (see section 2),
which has not been yet accelerated using the FPGA technology.
This paper presents a hybrid hardware-software architecture that enables
handling of the book with unlimited depth. Only the best N price levels are
stored in hardware. These levels can be updated with latency only 240 ns, which
is two orders of magnitude faster than recent software implementations. Software
manages the complete book with all price levels. A synchronization protocol is
used to ensure the data consistency in hardware. Further, we discuss a trade-off
between the number of price levels stored in hardware, the risk of underflow and
the number of messages transferred on the system bus. The architecture was
synthesized to Virtex-7 technology running at 150 MHz. Using two QDR SRAM
modules with total capacity of 144 Mbits, it is possible to store the book for up
to 4 thousands financial instruments, which corresponds to a half of the whole
NASDAQ (National Association of Securities Dealers Automated Quotations)
stock exchange. Only two cards are needed to handle the whole exchange.
This paper is divided into six sections. After this introduction follows the
problem statement in which we discuss how the financial exchanges work and
what is the order book. The third section presents analysis of memory requirements of the book handling with unlimited depth. Hardware architecture for
acceleration of this problem is described in the fourth section. After that we
show the experimental results. The last section contains a conclusion.
2
Problem Statement
Financial exchange is an institution which provides services for trading various
financial instruments, for example stock, derivatives or commodities. Current
price of traded instruments is usually determined by mutual auction between
supply (sell) and demand (buy) side.
Market participants send trading orders to the exchange. These orders define
what instrument, at which price and what quantity they want to trade. Example
of such order can be buy 50 shares of Apple stock for 91 dollars or sell 30 shares
of Microsoft stock for 42 dollars. For every new order, exchange tries to find
corresponding buy and sell orders and execute a transaction. If no corresponding
order is found, the new order is stored in so called book.
The book contains all open trading orders for financial instruments. The exchange needs to inform all users about the current state of the market. Usually,
the exchange simply forwards the information about individual trading orders
to users. Thus, if a trader places a new order, which is not immediately executed, the exchange assigns to the order a unique identificator and sends an ADD
53
M. Dvořák, T. Závodnı́k, and J. Kořenek
message to all users. This message represents addition of new order to the book.
It usually consists of the order identificator, instrument identificator, required
price, quantity and an indication whether it is buy or sell order.
If trader decides to change his/her current order, the exchange generates a
MODIFY message. This message usually consists of order identificator, changed
price and changed quantity. But it does not contain instrument identificator, previous price and quantity, because this information was included in the previous
ADD message.
The last message type is a DELETE message. It is created if user cancels his/her
order or if the order is matched and executed. DELETE messages contain only
order identificator, because the other information is known from the previous
ADD and MODIFY messages.
Information about individual orders in the book is not essential for the
traders. The trading algorithms usually use the values of best prices at which
the relevant instruments are traded. Therefore system for processing messages
from the exchange needs to convert the information about individual orders and
create an aggregated information about the best prices. The main principle of
this processing is to join the orders with the same price, accumulate their quantities and sort the resulting price levels according to the price. This is how the
aggregated book is created [6]. The number of price levels is called the book
depth. This number is unlimited, because the prices are set by users. Therefore
the book is called a book with unlimited depth.
Example of an order book is shown in Table 1. Only two instruments (stocks)
are shown in the table. Real book can have thousands of instruments. Each
instrument has its unique ID (AAPL and MSMT in our case). The table shows
list of buy and sell orders for both instruments. Each order has unique order
ID (not shown in the table), price and quantity. Orders are sorted according to
the price. We can see that multiple orders can have the same price. If we join
these orders and sum up their quantities, we get the aggregated price level. For
instance, first level on the buy side of AAPL consists of three orders with price
91.05 and total quantity 15 (8+2+5).
Since some values are omitted in the MODIFY and DELETE messages, it is
necessary to store for each order a record with the information from the ADD
message to be able to update the price levels accordingly. For each order, we need
to store its identificator (64 bits), price (32 bits), quantity (32 bits), instrument
identificator (15 bits) and a buy/sell flag (1 bit). It means that every order needs
144 bits of memory.
Aggregated information for each price level contains price (32 bits), accumulated quantity (32 bits) and count of accumulated orders (16 bits), which is 80
bits in total.
Total memory requirements for book handling depends on the number of orders placed by users during the day and number of price levels. Major stock exchange, which needs book with unlimited depth, is the American stock exchange
NASDAQ. Therefore we provide detailed analysis of real data from NASDAQ
exchange.
54
Hardware Accelerated Book Handling with Unlimited Depth
Order Book
AAPL (ID for Apple)
Buy Orders
Price
91.05
91.05
91.05
91.00
90.95
...
Quantity
8
2
5
12
32
...
MSMT (ID for Microsoft)
Sell Orders
Buy Orders
Sell Orders
Price Quantity
91.10
21
91.15
10
91.15
15
91.20
85
...
...
...
...
Price Quantity
42.30
16
42.30
11
42.28
31
42.28
10
...
...
...
...
Price Quantity
42.40
28
42.43
14
42.43
30
42.43
5
42.45
44
...
...
Table 1. Order Book example
3
Analysis
The goal of the analysis is to determine memory requirements of book handling
operation. To provide precise analysis, we use real data from the NASDAQ exchange, which was captured on 3rd October, 2013. The exchange manages almost
8 000 stocks (instruments). Maximum number of orders in the book during the
day is over 1.5 million. With 144 bits per order, at least 206 Mbits are required
to store all instruments in the book. This amount of data cannot be be stored in
the on-chip memory. The largest FPGAs have only 66 Mbits of on-chip memory.
All orders can be stored only in external memory.
Almost 350 000 price levels on buy and sell side were created from these
orders, which is 700 000 price levels in total. That is 88 price levels for each
instrument on average; however, the maximum book depth is 3 000. With 80 bits
for each price level, we need 54 Mbits to store the whole book. This amount of
data uses most of the on-chip memory capacity even with the largest available
FPGA chips. No memory would be left for the trading strategy and other parts
of the trading system.
The analysis of total memory requirements and the length of the list of price
levels imply that it is not possible to handle complete book with unlimited
depth in the FPGA. We can store and update only a few best price levels in the
FPGA and keep remaining levels in the host memory, where they can be simply
managed by software with higher latency. This principle is supported by the
typical behavior of traders, who usually use only a few best price levels to make
their trading decisions. To evaluate the feasibility of this idea, we performed an
analysis of accesses to the list of price levels.
55
...
M. Dvořák, T. Závodnı́k, and J. Kořenek
We created a histogram of a price level indexes, which were accessed by each
of the ADD, MODIFY and DELETE operations during one day. The characteristics
of the accesses were similar for all operations and also for the buy and sell side.
Accumulated histogram for all operations and both buy and sell sides is shown
in the Fig. 1.
Accesses percentage
100
80
60
40
20
0
1–8
9 – 16
17 – 24
25 – 32
33 – 40
41 – 3000
Price levels range
Fig. 1. Histogram of access distribution among the price levels
The histogram shows that there is a strong locality of accesses to the price
level list. More than 94 % of updates are at top 24 levels and more than 97 %
of updates are covered by 32 price levels. Only 1.5 % of updates are referenced
to levels 41 to 3 000. It has to be noted; however, that the histogram does not
provide information about the movement of price levels in time. Individual price
levels are added and deleted during the trading day. Therefore, it is possible that
the current top price level in the list was in much lower position only a couple
of microseconds earlier.
Locality of price level accesses therefore supports the idea to store only the
best price levels in the hardware. Due to the dynamic nature of the data structure, it is necessary to deal with a possible underflow when records from lower
positions of the list are moved to the top of the list.
4
Architecture
Based on the analysis of transactions on the exchange in the previous section, we
propose to store only the most frequently accessed price levels in hardware and
keep the rest in the host memory managed by software. The FPGA operates as
a hardware cache. It provides the best price levels to the trading strategy as fast
as possible. Relevant messages from the exchange are used to update these price
levels with low latency. Software processes all messages from the exchange and
stores the complete book. Thus, the software is able to detect possible underflow
56
Hardware Accelerated Book Handling with Unlimited Depth
in hardware caused by deletion of some price levels. Then a special message is
generated and sent via the system bus to provide the missing information to
the FPGA. Hardware architecture of the book handling consists of three steps
illustrated by Fig. 2.
Messages
Instrument
Table
Messages
+ addr
Order
Table
Price Level
Updates
Price Level
Table
Top Price
Levels
Fig. 2. High level architecture of book handling with unlimited depth
The first step is conversion of instrument identificator to internal address.
The internal address is used in the Price level table to store data for each instrument. Set of instrument identificators is known in advance and does not change
during the processing. Therefore the instrument identificators can be represented
by static dictionary and implemented by architecture published in [6]. The dictionary provides instrument address as a result. Then the address is passed to
the next step of processing together with the message from the exchange.
The second step is a management of order table, which stores all orders from
the exchange. It is a dynamic table because individual orders can be created and
deleted during the day. Fast look-up in a large number of orders is required. To
achieve low latency and high throughput, some hashing scheme has to be used.
We propose to use Cuckoo hashing [7] that provides fast look-up together with
efficient memory utilization [8] [9] [10].
The Order table component processes ADD, MODIFY and DELETE messages from
the exchange. Depending on the message type, an order is added to the table,
modified, or deleted from the table. As discussed in session 2, we need to convert
information about individual orders to price levels at which the instruments are
traded on the exchange. Each order message generates an update for the price
level table. ADD messages always increase quantity on corresponding price level.
Size of this increase depends on the quantity in the new order. Similarly, DELETE
messages decrease quantity on corresponding price level. MODIFY message can
cause either increase or decrease of quantity. The results depends on how the
order was modified.
The last step of proposed architecture is an update of price level table. It
processes the updates from the order table and stores the price level data. The
table can store up to N price levels for every instrument, where N is configurable
and has direct impact on the performance of the architecture. The performance is
discussed in details in section 5. When an update from the order table is received,
all price levels of the instrument are read from the memory. The address of the
record was computed in the instrument table. The one-bit buy/sell flag is added
57
M. Dvořák, T. Závodnı́k, and J. Kořenek
to the address in this component, because the price levels are stored separately
for buy and sell side.
Updates from the order table can cause one of the following operations on
the price level table:
– Modification of price level, when the updated price level is already in the
table. The order quantity is added to or subtracted from the existing level.
– Insertion of price level, when the price level to be increased does not exist.
This requires shifting the lower price levels down.
– Deletion of price level, when the updated price level is decreased to zero
quantity. This requires shifting the lower price levels up.
Update operations are implemented in parallel by processing elements (PE)
at each price level. In the following text, we denote the price levels P Li and the
corresponding processing elements P Ei for 1 ≤ i ≤ N . Processing element P Ei
has 4 data inputs, P Li−1 , P Li , P Li+1 and the new price level P Lnew , which
is calculated from the input message. Each element also has control input OP ,
which denotes the type of update operation, and control output cmpi , which is
the result of comparison between the current (P Li ) and the new (P Lnew ) price
level.
Detailed design of a processing element is shown in Fig. 3. Block (CMP) compares the input price level P Li with the new price level P Lnew and creates the
signal cmpi . Block MODIFY implements the increase or decrease of the price level
quantity if the new and the current prices are equal, otherwise it just forwards
the new price level. The comparison result and the type of update OP are also
used in SEL LOGIC block to determine the select signal of the output multiplexer
MX. Type of update determines the shift direction, comparison result determines
whether the price level is below the inserted/deleted level and thus needs to be
shifted. The multiplexer then simply selects one of its inputs to implement the
required update operation.
Interconnection of the processing elements and the memory is shown in Fig.
4. Each element fetches corresponding price level from the memory and sends it
to both neighbors (inputs P Li−1 and P Li+1 ). New price level input is shared
by all processing units and compared with corresponding price level P Li . All
comparison results cmpi are processed by the control logic block to determine the type of operation OP (modification, insertion or deletion). Processing
elements use the operation type to select the output price level, which is then
written back to the memory. The top price levels are also passed to the trading
algorithm (not shown in the figure).
5
Experimental results
The proposed architecture has been implemented in VHDL and tested on FPGA
acceleration card COMBO-80G. The card has fast PCIe interface, eight 10GE
ports and it is equipped by Virtex-7 XC7VX690T chip and two QDR-II+ SRAM
memory modules with total capacity of 144 Mbits.
58
Hardware Accelerated Book Handling with Unlimited Depth
PL_new
cmp_n
OP
PE_N
CMP
PL_new
SEL
LOGIC
MODIFY
PL_n
sel
PL_n+1
PL_n−1
MX
PL_n
PL_n
PL_n − updated
PL_n
Fig. 3. Architecture of processing element
Control logic
OP
cmp_2
cmp_1
cmp_n
PL_new
OP
PL_2
PL_1
zeros
PE_1
PL_1
OP
OP
PL_2
PL_1 − updated
PE_2
PL_3
PL_n−1
PE_N
PL_n
zeros
PL_n
PL_2
PL_2 − updated
PL_n − updated
Memory
Fig. 4. Architecture of price level update block
59
M. Dvořák, T. Závodnı́k, and J. Kořenek
The VHDL implementation was synthesized by Xilinx Vivado tool version
2013.4 with 165.5 MHz as a maximum achievable frequency. Nevertheless, all
tests on the exchange data were performed at 150 MHz. The architecture is
pipelined and multi-clock cycle constraint is used for the parallel update at all
price levels. It means that each update takes 2 clock cycles and the throughput
of the unit is 75 million messages per second, which is 140 times more than the
peak rate from used data set. The latency is only 4 clock cycles, because the
architecture needs 1 cycle for memory read, 2 cycles for update and 1 cycle for
write back. Thus, the latency of the price level update is 27 ns.
The biggest impact on the overall latency has the Order table component
(see Fig. 2). Each operation in the Order table requires a read access to the
QDR memory, which takes 180 ns. The overall latency of the whole full order
book architecture is 240 ns.
The card is equipped with QDR-II+ SRAM memory, which has only limited
capacity (144 Mbits). Therefore, it was not possible to store orders for all traded
instruments. To process the whole NASDAQ exchange, we needed to use two
cards. Each was handling half of the instruments (4 000).
We also measured memory requirements for the FPGA on-chip memory. Total amount of consumed memory is affected by two parameters: number of instruments and number of stored price levels N . Effect of those two parameters
on the resource consumption is shown in Table 2.
Number of
price levels
8
16
24
32
4096 instruments
Register
LUT
BRAM
740 (0 %) 5551 (1 %) 242 (16 %)
844 (0 %) 8441 (1 %) 482 (32 %)
680 (0 %) 11951 (2 %) 722 (49 %)
806 (0 %) 15310 (3 %) 962 (65 %)
8192 instruments
Register
LUT
BRAM
783 (0 %) 5600 (1 %) 483 (32 %)
862 (0 %) 10646 (2 %) 963 (65 %)
680 (0 %) 13393 (2 %) 1443 (98 %)
911 (0 %) 15411 (3 %) 1923 (130 %)
Table 2. Comparison of resource consumption for different number of symbols and
price levels
It can be seen that number of used registers and LUTs is very low even for
8192 instruments and 32 price levels. On-chip memory consumption increases
linearly with both number of instruments and number of price levels. Up to
32 price levels can be stored for 4096 instruments, but only 16 price levels are
possible for 8192 instruments. Required memory for 8192 instruments and 32
price levels exceeds the capacity of the chip, hence the 130 % value in the table.
We also performed evaluation of the synchronization process between hardware and software on the same captured data from the exchange as in section
2. We analysed number of messages between acceleration card and the software.
Further, we observed the deepest underflow in the hardware (minimal number of
60
Hardware Accelerated Book Handling with Unlimited Depth
valid price levels) for different number of price levels in hardware and how often
the underflow reached threshold value. We set the threshold value to 5 because
this is the usual number of price levels in aggregated book.
Results of the evaluation are shown in Table 3. The number of synchronization messages generated by software is decreasing with the number of price levels.
This can be explained by the fact that the higher number price levels corresponds
to the higher number of instruments, which can be fully stored in hardware, and
do not need synchronisation messages. The risk of underflow also decreases with
higher number of price levels. Having only 8 price levels is not enough because
the underflows occur very often. Underflows reaching the threshold level are still
happening even for N = 16; however, minimum number of valid levels is 4. There
are no underflows in case of 24 and 32 levels and more than 50 % of price levels
in hardware are valid all the time.
Our analysis suggests that higher number of price levels in hardware can help
to reduce the risk of underflow and also decrease the utilization of the system
bus. The key factor is therefore the amount of memory on the chip. It is possible
to increase the possible number of price levels by reducing the number of symbols
and vice versa.
No. of levels
8
16
24
32
Messages from SW
887270
327581
152218
63251
Lowest valid level
0
4
13
21
Threshold reached
42 487
88
0
0
Table 3. Analysis of underflow risk and system bus utilization for different number of
price levels
6
Conclusions
We introduced a hybrid hardware-software architecture to accelerate book handling with unlimited depth for low-latency trading systems. The proposed architecture processes messages from the exchange and creates the book with best
buy and sell prices. Based on the analysis of transactions on the exchange, we
propose to store only the most frequently accessed price levels in hardware and
keep the rest in the host memory managed by software. The software is able
to detect a possible underflow in hardware and provide the missing information
to hardware via the system bus. Moreover, we analysed how number of price
levels in hardware has impact on the system bus utilization and the risk of underflow. To our best knowledge, this is the first published FPGA architecture
for book handling with unlimited depth. The proposed architecture has latency
61
M. Dvořák, T. Závodnı́k, and J. Kořenek
only 240 ns, which is two orders of magnitude faster than recent software implementations. Throughput of the hardware architecture is 75 million messages
per second, which is 140 times more than current peak rates in available market
data.
Acknowledgment
This work was supported by the IT4Innovations Centre of Excellence
CZ.1.05/1.1.00/02.0070 and the BUT project FIT-S-14-2297.
References
1. G. W. Morris, D. B. Thomas, and W. Luk, ”Fpga accelerated low-latency market
data feed processing”. In Symposium on High-Performance Interconnects, 2009,
vol. 0, pp. 83–89.
2. H. Subramoni, F. Petrini, V. Agarwal, and et al., ”Streaming, low-latency communication in on-line trading systems”. In 2010 IEEE International Symposium on
Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010, pp.
1–8.
3. C. Leber, B. Geib, and H. Litz, ”High frequency trading acceleration using fpgas”.
In 2011 International Conference on Field Programmable Logic and Applications
(FPL), 2011, pp. 317–322.
4. R. Pottathuparambil, J. Coyne, J. Allred, W. Lynch, and V. Natoli, ”Low-latency
fpga based financial data feed handler”. In 2011 IEEE 19th Annual International
Symposium on Field-Programmable Custom Computing Machines (FCCM), 2011,
pp. 93–96.
5. J. W. Lockwood, and et al., ”A Low-Latency Library in FPGA Hardware for
High-Frequency Trading (HFT)”. In IEEE 20th Annual Symposium on HighPerformance Interconnects, 2012, pp. 9–16.
6. M. Dvorak, and J. Korenek, ”Low Latency Book Handling in FPGA for High
Frequency Trading”. In 2014 IEEE 17th International Symposium on Design and
Diagnostics of Electronic Circuits & Systems (DDECS), 2014, pp. 175-178.
7. R. Pagh, and F. F. Rodler, ”Cuckoo hashing”, Journal of Algorithms, vol. 51, no.
2, pp. 122–144, May 2004.
8. L. Kekely, M. Zadnik, J. Matousek, and J. Korenek, ”Fast Lookup for Dynamic
Packet Filtering in FPGA”. In 2014 IEEE 17th International Symposium on Design
and Diagnostics of Electronic Circuits & Systems (DDECS), 2014.
9. T. Tran, and S. Kittitornkun, ”Fpga-based cuckoo hashing for pattern matching
in nids/nips,” in MNGNS, ser. LNCS, 2007.
10. A. Kirsch, M. Mitzenmacher, and U. Wieder, ”More robust hashing: Cuckoo hashing with a stash,” in ESA, ser. LNCS. Springer, 2008.
62
Composite Data Type Recovery in a
Retargetable Decompilation
Dušan Kolář1 and Peter Matula1
DIFS FIT BUT Brno, Božetěchova 1/2, 612 66 Brno, Czech Republic
{kolar, imatula}@fit.vutbr.cz
Abstract. Decompilation is a reverse engineering technique performing
a transformation of a platform-dependent binary file into a High Level
Language (HLL) representation. Despite its complexity, several decompilers have been developed in recent years. They are not yet advanced
enough to serve as a standalone tool, but combined with the traditional
disassemblers, they allow much faster understanding of analysed machine
code. To achieve the necessary quality, many advanced analyses must be
performed. One of the toughest, but most rewarding, is the data type reconstruction analysis. It aims to assign each object with a high level type,
preferably the same as in the original source code. This paper presents
the composite data type analysis used by the retargetable decompiler developed within the Lissom project at FIT BUT. We design a whole new
address expression (AE) creation algorithm, which is both retargetable
and suitable for operating on code in Static Single Assignment (SSA)
form. Moreover, we devise a new AE aggregation rules that increase the
quality of recovered data types.
1
Introduction
In recent years, there has been a growing threat of a new malware attacking wide
range of intelligent devices other than personal computers. Nowadays, our smartphones, routers, televisions, or gaming consoles are no longer safe from computer
criminals. Washing machines, refrigerators, or microwave ovens are likely to follow in the near future. Since these devices usually have some specific purpose,
they are often using dedicated processor architectures or operating systems. Because of this, it has become increasingly more difficult to analyse potentially
dangerous binaries. The solution may be a new tool, a retargetable decompiler,
capable of translating platform-dependent executables into some common high
level language. Main tasks of such a program are reconstruction of high-level
control flow, functions, objects and data types. Data type reconstruction can
be further divided into two sub-tasks: simple and composite data type recovery.
In this article, we present an advanced composite type recovery algorithm implemented by the Lissom project’s retargetable decompiler. Our approach has
several advantages over the state of the art competition.
This paper is organised as follows. In Section 2, we discuss related work, which
has been done on the subject. Section 3 briefly presents the Lissom project’s
63
D. Kolář and P. Matula
decompiler. The basic scheme of our data type recovery system is described in
Section 4. The main subject of this paper, composite data type recovery analysis,
is explained in detail by Section 5. Section 6 experiments with our approach and
finally, Section 7 draws a conclusion and outlines our future work.
2
Related Work
Mycroft [1] presents an unification-based simple and composite data type reconstruction algorithm that lays down the basic principles used by most of its
successors. Emmerik [2] is using a similar technique optimised for the SSA form.
It minimises memory requirements by storing only the object definitions’ types.
All object’s occurrences and derived objects only refer to the original definition. Type Inference on Binary Executables [3] is one of the latest papers on
the subject. It presents state of the art approach capable of high-level data type
reconstruction using the unification of type terms extended to support subtypes.
It emphasises the convergence, precision and conservativeness and is able to exploit the results of both static and dynamic analysis. From existing decompilers,
the Hex-Rays [4] plugin to the IDA Pro disassembler [5] is the most used, and
arguably the most advanced reverse engineering tool today. It has both array and
structure reconstruction capabilities. It is however questionable, if it performs
any state of the art type aggregation.
Our work is closest to the composite data type recovery in the SmartDec
decompiler [6, 7], which is briefly described in the remainder of this section.
Since SmartDec produces output in the C programming language, the composite
type recovery aims for structure and array detection. Unions are not taken into
account. The algorithm is based on the memory access analysis that is divided
into two main steps.
(1) Memory access addresses are expressed according to Equation 1. Base
b is either a global memory address, or an object name holding such address
during the execution. Offset o is a distance of an accessed element from the base.
The rest is a multiplicative component (MC) typical for array iterations. Single
element represents one multiplication, where Ci is a constant value multiplying
the iterator xi . Multiple elements indicate multidimensional accesses. The same
concept can be expressed in a form of an Address Expression (AE) in Equation 2.
Multiplication pairs are arranged in a list (denoted as [ ]) in a descending ordered
by the value of Ci . Non-relevant MC can be represented by the symbol m.
[7] computes such AEs for every machine code object using the fix-point dataflow analysis. Most results are thrown away and only those used in memory
access operations are further processed.
b+o +
n
X
!
Ci xi
(1)
i=0
AE = (b, o, [(C1 , x1 ), . . . , (Cn , xn )])
64
(2)
Composite Data Type Recovery in a Retargetable Decompilation
(2) At the beginning, each AE gets its own label ae. Then, the algorithm tries
to apply aggregation rules to construct equivalence classes. Each class represents
one composite object. The first rule in Equation 3 merges two AEs, if their bases
are the same. The second one in Equation 4 performs an array aggregation.
Both bases are the same and the offset difference is lower than the maximal
constant C. This indicates an iteration performed over a complex type (i.e.
array of structures).
ae1 = (b1 , o1 , m1 ), ae2 = (b2 , o2 , m2 ), b1 ≡ b2
(3)
ae1 ≡ ae2
ae1 = (b, o1 , [(C, x11 ), . . . ]), ae2 = (b, o2 , [(C, x21 ), . . . ]), |o1 − o2 | < C
(4)
ArrayAggregation(ae1 , ae2 )
3
Lissom Project’s Retargetable Decompiler
The Lissom project’s [8] retargetable decompiler [9] (available online at [10]) is
independent of any target architecture, file format, operating system, or compiler. It is using ADL processor models and extensive preprocessing to translate
the machine-code into a HLL representation. Currently, the decompiler supports MIPS, ARM, x86, PIC32, and PowerPC architectures using Windows PE
or Unix ELF object file formats (OFF).
The decompilation process (Figure 1) starts with a preprocessing phase [11].
It detects an input’s OFF, compiler, and a potential packer. If the file was
indeed packed, it applies known unpack routines. Next, a plugin-based converter
translates platform-dependent OFF file to the internal Common-Object-FileFormat (COFF). Finally, the generator automatically creates the instruction
decoder from the platform’s ADL model.
The decompiler’s core performs actual reverse compilation. It consists of the
three main parts built on top of the LLVM Compiler System [12]. (1) Front-end
takes an input COFF file and using the generated instruction decoder, it creates
an LLVM intermediate representation (IR). Each machine code instruction is
translated into a sequence of LLVM IR operations characterising its behaviour in
a platform-independent way. The abstraction level of such representation is further lifted by the sequence of static analyses, which recover local/global variables,
functions, parameters/returns, data types, etc. (2) Middle-end takes an LLVM
IR code and applies LLVM built-in optimizations along with our own passes.
Our routines include loops optimisation, constant propagation or control-flow
simplification. (3) Back-end converts optimised IR to the back-end IR (BIR).
Then, it runs few more analyses like a HLL control-flow identification or object
name assignment. Finally, an output code is generated. Currently, we support
C and Python-like language, plus assembly code, call graph and control-flow
graph.
65
D. Kolář and P. Matula
4
Data Type Recovery Infrastructure
The type inference analysis (Figure 2) is part of the decompiler’s front-end. It
operates on the LLVM IR code, in which functions, global variables and local
variables have already been detected. Its goal is to associate every object (i.e.
register, global variable, local variable, function parameter, or return) with a
data type, preferably the same as in the original source code. Furthermore, it
has to modify program’s code to reflect changes made to objects’ types.
Even though LLVM IR is in the SSA form, form’s condition holds only for the
temporary variables used by LLVM micro-operations. Other objects (i.e. global/local variables and registers) are manipulated by the load/store operations,
violating the SSA’s single assignment rule. For this reason, the type recovery
is using results of the reaching definition analysis, which provides definition-use
(DU) and use-definition(UD) chains shown in Equation 5. All object identifiers
in this paper are in fact uses or definitions—pairs of object ID ant its position
in the program. To minimise memory requirements, type recovery is also using a
sparse object representation inspired by Emmerik’s [2] SSA-optimised algorithm.
Our previous article [13], dealing with the data-flow inference analysis, described the core of the whole type reconstruction system. Using the type propagation equations and the type propagation rules derived from instruction semantics, it was able to reconstruct objects’ simple data types. In [11], we showed
how to utilise sources of precise type information such as debugging data, known
library function calls, or dynamic profiling information to increase the quality of
simple data types, or even recover composite data objects. This paper presents
the composite data type analysis algorithm capable to recover composite objects (arrays, structures) without the use of precise type information, which may
not be available for every executable. Technique is based on analysis of memory
access operations and their aggregation into possible composite objects.
Defs(use) = {def 1 , . . . , def n },
Uses(def ) = {use 1 , . . . , use n }
Fig. 1: The concept of the Lissom project’s retargetable decompiler.
66
(5)
Composite Data Type Recovery in a Retargetable Decompilation
5
Composite Data Type Analysis
An object used as an address in a memory access operation is tagged as a pointer
by the simple data type inference. Such object is then passed to the composite
recovery analysis. It tries to build an address expression for the address calculation and aggregate it with other similar address expressions to form a composite
object with the complex data type. Same as the procedure described in Section 2,
it does so in two steps.
1. Address expression (AE, Equation 1) creation.
2. Address expression aggregation—composite object/type construction.
5.1
Running Example
The following sections are illustrating their contents on the example shown in
Figure 3 and associated Equations 6 through 18 (equations are explained during
the illustration).
Because the composite objects structure and a2d are declared as global
variables, address bases are used throughout example AEs. If the objects were
local variables or allocated pointers, as is ls on line 11, it would not be possible
to statically determine their locations (addresses). In such cases, the symbolic
bases are used as is depicted in Equation 12. However, the presented procedure
is generally the same.
Fig. 2: Data type recovery scheme.
67
D. Kolář and P. Matula
1
struct s2 { int a2 [10]; int e2 ;
}; // size = 44
B
struct s1 { int e1 ;
struct s2 a1 [10]; }; // size =444
B
2
3
4
5
6
7
8
9
10
11
struct s1 structure ;
int a2d [10][30];
// starts at address 1000
// starts at address 1444
a2d [ i ][ j ] = X ;
structure . a1 [ i ]. e2 = X ;
structure . a1 [4]. e2 = X ;
// for : i =0..9 ; j =0..19
// for : i =0..9
// random array access
struct s1 * ls = ( struct s1 *) malloc ( sizeof ( struct s1 ) )
;
Fig. 3: Running example program used to illustrate described principles.
((120 ∗ (0 + r1 )) + 1444) JOIN (r2 + 4)
(6)
(JOIN (+ (∗ (120) (+ (0) (r1 ) ) ) (1444) ) (+ (r2 ) (4) ) )
(7)
((120 ∗ r1 ) + 1444) + (r2 ∗ 4)
(8)
(+ (+ (∗ (120) (r1 ) ) (1444) ) (∗ (r2 ) (4) ) )
(9)
(1444, 0, [ (120, r1 ), (4, r2 ) ])
(10)
(1444, 0, [M1 , M2 ]), M1 = [(120, r1 ), (4, r2 )], M2 = [(120, r3 ), (4, r4 )]
(11)
ae1 = (ls, 0, [ ]) ae2 = (ls, 4, [(44, r1 ), (4, r2 )]) ae3 = (ls, 44, [(44, r1 )])
(12)
ae1 = (1000, 0, [ ])
(13)
⇒ (1000, [0, 0], [ ])
(16)
ae2 = (1000, 4, [(44, r1 ), (4, r2 )]) (14)
⇒ (1000, [4, 0], [(44, r1 ), (4, r2 )]) (17)
ae3 = (1000, 44, [(44, r1 )])
⇒ (1000, [4, 40], [(44, r1 )])
5.2
(15)
(18)
Address Expression Creation
The proposed AE construction used by the SmartDec decompiler introduced
in Section 2 does not suit our needs. It employs a fix-point analysis to compute
every object’s AE and further processes only those used at memory access operations. Because of a huge number of objects implied by the SSA form of our LLVM
IR, such approach would be highly ineffective. For this reason, we designed a
novel approach using the symbolic interpretation of address calculation. Instead
of running a full fix-point analysis on all IR objects, it evaluates (builds computation trees) only those objects, which are involved in memory access operations.
68
Composite Data Type Recovery in a Retargetable Decompilation
Such trees are then transformed to equivalent AEs and further aggregated into
composite objects. The main idea is to process only the necessary objects, not
all IR objects.
Symbolic Interpretation by Binary Trees When the first pass encounters a
memory access operation, it tags object addr containing an address as a pointer
and passes it to the symbolic interpreter. In the following text, (X) represents
the unary node where X is either a value or a symbolic object name. Binary
node for the operation X Y is represented as ( (X) (Y )), where (X) and (Y )
are unary or binary nodes.
At the beginning, the tree consists of a single symbolic node (addr ). Then,
the algorithm recursively modifies nodes according to their objects’ definitions,
until the binary tree is not expressing the whole computation of addr ’s value.
The operation is either addition, multiplication, or so called JOIN . Other
instructions (e.g. shifts) are expressed using only the operations. If it is not
possible, the symbolic interpretation fails. The JOIN operation is used to merge
two object definitions for a single use. It is typical for array iterations, where
the first operand (definition) represents an initial iterator value, and the second
one is a value added each iteration to get to the array’s next element.
Running example: Address computation for an array access on line 7 may
look like Equation 6 (infix notation). Equation 7 expresses the same calculation
in binary tree notation. Note that the original iterators i and j were replaced
by registers r1 , r2 .
Binary Tree Simplification To make the address expression creation easier, the algorithm repeatedly applies various kinds of binary tree simplification
rules like simple arithmetic evaluation or multiplication distribution. The most
important procedure is the JOIN operation removal. Since the node
(JOIN (X) (+ (Y ) (Z)))
denotes an incrementation of initial iterator value X by iterator value Y plus
constant increment Z, it can be transformed to (+ (X) (∗ (Y ) (Z))).
Running example: Simplified infix and binary tree notations of the example
used in previous subsection are depicted in Equation 8 and 9.
Binary Tree to AE Conversion The final step is to transform a binary
tree into an AE. It is done by associating each node with its AE and running
a bottom-up propagation. At first, bottom nodes are initialised with simple
AEs equivalent to their values. Then, parent nodes are merging their childrens’
AEs into a single AE, until the root node does not contain address expression
representing the whole binary tree.
Running example: The resulting address expression created for line 7 array
access operation is shown in Equation 10. Value 1444 is AE’s global memory
address base.
69
D. Kolář and P. Matula
5.3
Address Expression Aggregation
AE’s associated with the root nodes are further processed in the second phase
of the composite type reconstruction. The goal is to group AE’s together to
represent a single complex object. Conversion of such objects to actual data
types is straightforward and it is not presented in this paper. The aggregation
rules used by the Lissom decompiler follows. Those adapted from the existing
techniques refer to Equations in Section 2.
AE Equivalence Aggregation Two AEs are equal (access the same element),
if their bases and offsets are equal. The multiplicative component (MC) only
determines which iterators are used in array accesses. We redefine original AE
definition (Equation 2) to contain a list of multiplicative components according
to Equation 19. This allows to aggregate two equal AE’s into one, preserving
both MCs.
Running example: An access with AE identical to Equation 10 (except using
another set of iterators—r3 , r4 ) can be aggregated with Equation 10, forming
the AE in Equation 11.
AE = (b, o, [M1 , . . . , Mn ]) ,
Mi∈{1,...,n} = [ (Ci1 , xi1 ), . . . , (Cim , xim ) ] (19)
Base Equivalence Aggregation Composite object (CO) in Equation 20 aggregates AEs with the common base. Its maximum size is upperBnd . AEs are
contained in AETypeMap ordered list of pairs, which associates AE with its simple data type. In the following text, mathematical structures may be accessed
in the C programming language manner (e.g. obj .upperBnd ). The list ordering is based on predicate : (ae 1 < ae 2 ⇔ ae 1 .o < ae 2 .o). COs with address
bases are stored (in an ascending order by base) in a single global container
GlobComposite. COs with symbolic bases are in containers associated to the
currently processed function. When algorithm gets AE from the binary tree root
node, it finds the corresponding CO based on the base, or creates a new one. AE
is inserted in the correct position in AETypeMap (aggregated with equal AE if
needed), and associated type is included in the simple data type propagation.
Described system implements original aggregation Equation 3, and performs an
additional element ordering into the correct composite object layout.
Running example: The address expressions in Equations 13 through 15 form
an object with the address base 1000. It represents the structure structure.
CO = (base, upperBnd , AETypeMap)
AETypeMap = [ (AE 1 , type 1 ), . . . , (AE n , type n ) ]
(20)
High Quality Types Aggregation The composite type recovery is necessary,
even if precise data types are inferred from an additional type source (e.g. debug
info, known library function call). Object’s type itself is useless, if it is unknown
70
Composite Data Type Recovery in a Retargetable Decompilation
where and how it is accessed. This information is provided by the composite
objects’ address expressions, which in return initialise their types from the precise
source.
Set Upper Object Bound Since global objects are in an ascending ordered,
their upper bound is determined by the distance to the closest subsequent object.
Note: similar principle is used in all papers presented in Section 2.
Running example: Object for structure is at address 1000, object for a2d
is at 1444. structure’s upper bound is (1444 − 1000) = 444.
Composite Object Aggregation All elements of a local composite object
are always accessed using the same symbolic base object, or its copy. Therefore,
all elements’ AEs are correctly aggregated according to subsection Base Equivalence Aggregation. However, an element of a global composite object may be
accessed directly by its real address, instead of CO’s base address and element’s
offset. Such access creates a separate composite object and the address based
aggregation is not possible. This situation can sometimes be solved by the aggregation based on the upper bound of memory region occupied by any ae1 from
the obj 1 object. If there exists object obj 2 whose first AE belongs to this region, then obj 2 is merged with obj 1 . Bound is determined by ae1 ’s base address,
offset, multiplication constant and the maximum iterator value. The last one is
computed by a value range analysis on loop’s induction variable. Utilisation of
a value range analysis is a novel technique introduced by this paper.
Running example: AE created for access on line 9 is ae 2 = (1220, 0, [ ]).
However, there is ae 3 from Equation 15. Based on line 8, the value range analysis
determines the maximum iterator value equals to 9 and that ae 3 in fact contains
ae 2 . The old AE is discarded and (1000, 220, [ ]) is added to the structure
object.
Random Array Access Aggregation So far, we assumed that every array
access uses iteration. However, there also might be a number of random access
operations. Ideally, iterator and random accesses should be joined together, since
they are accessing the same composite object member. Following the logic of the
previous aggregation, it is possible to use value range analysis once again to
detect that the random access is in fact part of some previous array address
expression.
Running example: Picking ae3 = (1000, 44, [(44, r1 )]) from Equation 15, the
address expression (1000, 220, [ ]) created in previous subsection is further aggregated into ae3 = (1000, 44, [ [(44, r1 )], [(44, 4)] ]). Now it is clear that the original AE is in fact a random array access into the forth element of structure.a1
member.
Multidimensional Array Aggregation The most complex step is generalisation of original Equation 4 for multidimensional arrays shown in Equation 21. So
71
D. Kolář and P. Matula
far, offsets were relative to the object base—the first address expression. In this
aggregation, each AE’s offset o is replaced by the offset list OL = [o, 0, . . . , 0]. Its
size is n and it contains o as the first element. Another list CL = [C1 , . . . , Cn ],
which contains all the unique multiplication constants from the AE’s multiplicative component in the descending order, is also created. Application of Equation 21 on ae1 (associated lists: CL1 , OL1 ) fills the subsequent ae2 ’s offset list
OL2 with values relative to ae1 . If two offsets o2i , o1i in OL1 and OL2 at the
same position i are equal, then ae1 and ae2 are part of the same structure at
level i. However, they are not the same element, so they must differ on some
other position greater than i.
∀i ∈ {1, . . . , n} :
o1i ∈ OL1 , o2i ∈ OL2 , C1i ∈ CL1 , o2i − o1i < C1i
(21)
o2i+1 = o2i − o1i , o2i = o1i
Running example: Equations 16 through 18 depict ae1 ,ae2 and ae3 after the
multidimensional array aggregation. Offset values o are replaced by OL lists.
The first offset in both ae2 and ae3 is the same, which indicates that these
two elements are part of the same nested structure located at offset 4 in the
composite object at address 1000. The second offset distinguishes them in this
structure. The fist iterator associated with the constant 44 is used to index the
nested array of structures named a1. The second iterator in ae2 indicates that
a2 is in fact an array of 4 byte elements.
Array Bounds At the end, when none of the above aggregations can be applied,
the algorithm infers array bounds in several different ways. If multiple bounds
are computed for a single array, the maximum value is preferred. In this case, it
is safer to over-approximate, since under-approximation may cause out of range
array accesses.
1. If AE’s multiplication constant list CL = [C1 , . . . , Cn ] have at least two
elements, and AE is a multidimensional array (not a nested structure), the
nested arrays’ bounds are:
∀ i ∈ {2, . . . , n} : bound associated with Ci is equal to Ci−1 /Ci .
Running example: a2d’s second bound is (C1 /C2 ) = (120/4) = 30.
2. The first dimension C1 of ae1 is inferred from the subsequent ae2 that is not
part of the same nested structure as ae1 . If such AE exists, then the bound
associated with C1 is (ae2 .o − ae1 .o)/C1 .
Running example: structure.a1[X].a2 bound is (44 − 4)/4 = 10.
3. If an array expression ae1 is the last in global object obj1 , then the object’s
upper bound gives an upper array size estimate: (obj1 .upperBnd −ae1 .o)/C1 .
Running example: structure.a1 bound is (444 − 4)/44 = 10.
4. Initialised global arrays have values stored in data sections. The algorithm
iterates over memory slots from the object’s start, and based on the element
type it determines, if the current slot still belongs to the array. This is feasible
only for data types with distinguishable values such as pointers or strings,
otherwise it would not be possible to identify array’s end.
72
Composite Data Type Recovery in a Retargetable Decompilation
5. Local arrays are generally initialised by one of these two methods. (1) Short
arrays have their elements filled one by one using direct assignment instructions. Such sequence can be detected and used to infer array’s size. (2) Larger
arrays are initialised by the memory copy routine. It is possible to compute
the size of the copied data and the number of array elements it represents.
6
Experimental Results
In this section, we present an accuracy evaluation of the presented composite
data type recovery algorithm. We test our solution on a set of 10 programs,
which contains reasonably complicated complex data types. All the programs
were originally written in the C programming language, and we decompile them
to C as well. Our results are compared with outputs of the most used reverse
compiler today—the Hex-Rays decompiler (version 1.9, IDA disassembler version
6.6). Each program was compiled for three different architectures (MIPS, x86,
ARM), two optimisation levels (-O0 and -O2), and ELF object file format. Used
compilers and their versions are: psp-gcc version 4.3.5 for MIPS; gcc version
4.7.2 for x86; and gnuarm-gcc version 4.4.1 for ARM. We compiled our test
suite with enabled debugging information, but used it only for the automatic
type comparison. Detailed results are shown in Table 1. Figure 4 summarises
the overall success rate for each combination of architecture and optimisation
level. Since Hex-Rays supports only x86 and ARM decompilation, MIPS samples
were reversed only by our retargetable decompiler. Composite object count takes
into account any local or global object defined as an array or structure, and most
of the pointers (those that were processed by the composite type analysis). The
results were verified automatically using our similarity comparison tool.
We can see, that the quality of our type recovery analysis is comparable with
the state of the art Hex-Rays decompiler. However, there is a notable accuracy
decline from MIPS through x86 to ARM, and between -O0 and -O2 optimisation
levels. It is caused by the increased complexity of the analysed LLVM IR code.
ARM and x86 processor models are more complicated than the MIPS model,
which causes that the decoder generates harder-to-analyse LLVM IR code. The
same thing is caused by more aggressive optimisation levels. The solution is to
further improve our symbolic interpreter, since the failure to build a correct
binary tree is the most common cause of failed composite data type recovery.
7
Conclusion
This paper presents a composite data type recovery technique based on (1) the
symbolic interpretation, and (2) address expression aggregation of memory access addresses. It presents a novel approach for the first task, and significantly
expands existing methods of aggregation for the second task. The design is suitable for analysis of LLVM IR code in the SSA form, which contains a huge
number of temporary objects. Presented techniques were implemented in the
73
D. Kolář and P. Matula
Table 1: Composite data type recovery experiment’s results. Tests cat.c and wc.c are
sources of the well-known UNIX utilities. All the other tests are from the MiBench
embedded benchmark suite [14]. LD stands for the Lissom project’s retargetable decompiler, HR for the Hex-Rays decompiler.
Test name
C source
Original
Reconstructed obj. count
line composite MIPS
x86
ARM
count obj. count O0 O2
O0
O2
O0
O2
LD LD LD HR LD HR LD HR LD HR
aes.c
830
15 13 12 12 11 11 11 11 11
9 10
cat.c
256
3
3
3
3
3
3
3
3
3
2
2
crc 32.c
84
2
2
2
2
1
2
1
2
2
1
1
175
8
6
5
6
6
5
6
5
6
5
6
dijkstra large.c
md5sum.c
619
24 19 17 17 16 14 16 15 17 14 16
pbmsrch large.c
2757
8
7
7
6
7
5
6
7
8
6
7
pgsort.c
492
15 13 12 13 13 11 11 13 12 12 12
sha2.c
260
14 12 11 10 11
8 10
9 11
8 10
2993
18 15 14 15 13 13 12 13 14 12 13
stringsearch 2.c
wc.c
270
4
3
3
3
3
2
3
2
3
1
3
P
111 93 86 87 84 74 79 80 87 70 80
Success rate [%]
84 77 78 76 67 71 72 78 62 72
100
100
80
80
60
60
40
40
20
20
0
M
IP
M
S-
O0
IP
S-
O2
x8
6-O
0
x8
6-O
2
AR
M
-O
0
AR
Success rate (%)
Success rate (%)
Lissom Retargetable Decompiler
Hex-Rays Decompiler
0
M
-O
2
Architecture and optimisation level
Fig. 4: Composite data type recovery success rates comparison between the Lissom
project’s retargetable decompiler and the Hex-Rays decompiler.
Lissom project’s retargetable decompiler and tested on the set of programs compiled for multiple processor architectures. Our solution is capable to reconstruct
composite data types in the following cases:
– Alternately nested (arrays and structures are alternating) global objects in
data sections, whose addresses can be statically computed.
– Alternately nested dynamically allocated objects on the heap.
74
Composite Data Type Recovery in a Retargetable Decompilation
– Local arrays (and structures/arrays nested in them) on the stack, if they are
accessed in iteration.
It is however not able to reconstruct the following situations:
– Directly nested structures, since they were inlined by the compiler and their
elements use the same base as the parent’s structure members.
– Local structures on stack, since the whole stack is behaving like a single
structure. All of its elements are accessed using the frame pointer as the
common base.
Recovery of such constructs is possible only if an additional information like
debug data or known library function signatures is utilised.
In future, we plan to further improve our success rates, especially on x86
and ARM architectures and higher optimisation levels. With improved symbolic
interpretation, we should be able to reach MIPS architecture accuracy rates
across all of the supported platforms. We also plan to implement a reconstruction
of high-level composite data types of the C++ object oriented language (e.g.
classes and related mechanisms).
Acknowledgements
This work was supported by the BUT FIT grant FIT-S-14-2299, and by the European Regional Development Fund in the IT4Innovations Centre of Excellence
project (CZ.1.05/1.1.00/02.0070).
References
1. Mycroft, A.: Type-based decompilation. In: Programming Languages and Systems,
8th European Symposium on Programming, Amsterdam, The Netherlands, 22-28
March, 1999, Proceedings. Volume 1576 of Lecture Notes in Computer Science.,
Springer (1999) 208–223
2. Emmerik, M.V.: Static single assignment for decompilation. (2007)
3. Lee, J., Avgerinos, T., Brumley, D.: Tie: Principled reverse engineering of types
in binary programs. In: NDSS, The Internet Society (2011)
4. Hex-Rays Decompiler. www.hex-rays.com/products/decompiler/ (2013)
5. IDA Disassembler. www.hex-rays.com/products/ida/ (2012)
6. Dolgova, E.N., Chernov, A.V.: Automatic reconstruction of data types in the
decompilation problem. Program. Comput. Softw. 35(2) (2009) 105–119
7. Troshina, K., Derevenets, Y., Chernov, A.: Reconstruction of composite types for
decompilation. In: Proceedings of the 2010 10th IEEE Working Conference on
Source Code Analysis and Manipulation. SCAM ’10 (2010) 179–188
8. Lissom. http://www.fit.vutbr.cz/research/groups/lissom/ (2013)
9. Ďurfina, L., Křoustek, J., Zemek, P., Kolář, D., Hruška, T., Masařı́k, K., Meduna,
A.: Design of a retargetable decompiler for a static platform-independent malware
analysis. In: 5th International Conference on Information Security and Assurance
(ISA’11). Volume 200 of Communications in Computer and Information Science.,
Berlin, Heidelberg, DE, Springer-Verlag (2011) 72–86
75
D. Kolář and P. Matula
10. Retargetable Decompiler. http://decompiler.fit.vutbr.cz/ (2014)
11. Křoustek, J., Matula, P., Kolář, D., Zavoral, M.: Advanced preprocessing of binary
executable files and its usage in retargetable decompilation. International Journal
on Advances in Software 2014(1) (2014) 1–11
12. The LLVM Compiler Infrastructure. http://llvm.org/ (2013)
13. Matula, P., Kolář, D.: Reconstruction of simple data types in decompilation. In:
4th International Masaryk Conference for Ph.D. Students and Young Researchers
(MMK’13). (2013)
14. MiBench version 1.0. http://www.eecs.umich.edu/mibench/ (2012)
76
Multi-Stride NFA-Split Architecture for Regular
Expression Matching Using FPGA
Vlastimil Košař and Jan Kořenek
IT4Innovations Centre of Excellence
Faculty of Information Technology
Brno University of Technology
Božetěchova 2, Brno, Czech Republic
{ikosar, korenek}@fit.vutbr.cz
Abstract. Regular expression matching is a time critical operation for
any network security system. The NFA-Split is an efficient hardware architecture to match a large set of regular expressions at multigigabit
speed with efficient FPGA logic utilization. Unfortunately, the matching speed is limited by processing only single byte in one clock cycle.
Therefore, we propose new multi-stride NFA-Split architecture, which
increases achievable throughput by processing multiple bytes per clock
cycle. Moreover, we investigate efficiency of mapping DU to the FPGA
logic and propose new optimizations of mapping NFA-Split architecture
to the FPGA. These optimizations are able to reduce up to 71.85 % of
FPGA LUTs and up to 94.18 % of BlockRAMs.
1
Introduction
Intrusion Detection Systems (IDS) [1–3] use Regular Expressions (RE) to describe worms, viruses and network attacks. Usually, thousands of REs have to
be matched in the network traffic. Current processors don’t provide enough processing power for wire-speed RE matching at multigigabit speed [4]. Therefore,
many hardware architectures have been designed to accelerate this time critical
operation [5–7]. Usually, hardware architectures are able to achieve high speed
only for small sets of REs due to the limited FPGA resources or capacity of
available memory. Hardware architectures based on Deterministic Finite Automata (DFA) [4, 8, 5] are limited by the size and speed of the memory, because
the determinisation of automaton significantly increases the number of states
and size of the transition table. Architectures based on Nondeterministic Finite
Automata (NFA) [9, 6, 10] are limited by the size and capacity of FPGA chips
since the transition table is mapped directly into the FPGA logic.
With the growing amount of attacks, worms and viruses, security systems
have to match more and more REs. It means that the amount of required FPGA
logic increases not only due to the increasing speed of network links, but also
due to the growth of RE sets. Therefore, it is important to reduce the amount of
consumed FPGA resources to support more REs. A lot of work has been done
in this direction. FPGA resources have been decreased by a shared character
77
V. Košař and J. Kořenek
decoder [6], infix and suffix sharing [7], by better representation of the counting
constraint [10] and by the NFA reduction techniques [11, 12]. High reduction has
been achieved by NFA-Split architecture [13, 14], which splits NFA to deterministic and nondeterministic parts in order to optimize mapping to the FPGA.
The NFA-Split architecture reduces FPGA logic at the cost of on-chip memory
(BlockRAMs). As some kinds of REs can increase the size of transition table and
require a lot of on-chip memory, we have recently introduced optimization [15],
which uses a k-inner alphabet to reduce on-chip memory requirements.
NFA-Split is designed to process only one byte of input stream in every
clock cycle. The matching speed can be increased by increasing the operating
frequency, but for the FPGA the frequency is limited to hundreds of megahertz.
Consequently, the NFA-Split architecture cannot scale the throughput to tens
of gigabits. To achieve higher matching speed, it is necessary to improve the
architecture to support multi-stride automaton and to accept multiple bytes per
clock cycle. Therefore, we propose an NFA-Split architecture for multi-stride
automata, which requires significantly less FPGA resources in comparison to
other multi-stride architectures. For the largest Snort backdoor module, the
proposed architecture was able to reduce the amount of FPGA lookup tables
(LUTs) by 58 %. Moreover, we investigate the efficiency of mapping the DU to
the FPGA logic and propose new optimizations of mapping deterministic and
nondeterministic parts of a NFA to FPGA. Both optimizations are able to reduce
up to 71.85 % of FPGA LUTs and up to 94.18 % of FPGA BlockRAMs.
The paper consists of six sections. Brief summary of the related work is
described after the introduction. Then NFA-Split architecture for multi-stride
automata is introduced in the third section. Optimizations of the NFA-Split
architecture are described in the section four, experimental results are presented
in the section five and conclusions in section six.
2
Related Work
One of the first methods of mapping the NFA to the FPGA was published by
Sidhu and Prasanna [9]. A dedicated character decoder was assigned to each
transition. Clark and Schimmel improved the architecture by shared decoders of
input characters and sharing of prefixes [16]. Lin et al. created an architecture
for sharing infixes and suffixes, but did not specify a particular algorithm to
find them [7]. Sourdis et al. published [10] an architecture that allows sharing
of character classes, static subpatterns and introduced components for efficient
mapping of constrained repetitions to the FPGA.
Current efficient solutions for regular expression matching on common processors (CPUs), graphics processing units (GPUs) and application-specific integrated circuits (ASICs) are based on NFAs. An NFA based architecture for
ASICs was recently introduced in [17]. It is capable processing input data at
1 Gbps. A solution for GPUs capable of processing rule-sets of arbitrary complexity and size is based on NFA. [18]. However, this architecture has unpredictable
performance (950 Mbps - 3.5 Gbps for 8-stride NFA). A NFA based solution for
78
Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA
CPUs was introduced in [19]. It provides considerable best-case performance on
high-end CPUs (2 - 9.3 Gbps on two Intel Xeon X5680 CPUs with total of 12
cores running on 3.33 GHz).
Algorithms based on DFA seek various ways to limit the impact of state
explosion of the memory needed to store the transition table. Delay DFA introduced in [20] extended the DFA by default transitions. The default transitions limited a redundancy caused by similarity of output transitions of different
states. Content Addressed Delayed Input DFA [5] improved the throughput of
the previous methodology by content addressing. The concept of Delay DFA is
further refined in [8]. Extended Finite Automaton [21] extends the DFA by a
finite set of variables and instructions for their manipulations.
Hybrid methods combine DFA and NFA parts to use the best of their respective properties. Becchi introduced hybrid architecture [22] that splits the
automaton to a head-DFA and tail-NFAs. The head-DFA contains frequently
used states, while the tail-NFA contains the others. NFA-Split architecture [13,
14] is designed for FPGA technology. It utilizes properties of REs in IDS systems
and significantly reduces FPGA resources in comparison to other NFA based architectures. As the NFA created from REs has usually only a small subset of
states that can be active at the same time, the architecture splits the NFA into
several DFA parts and one NFA part. The DFA parts contain only states that
cannot be active at once. Therefore, these parts can be efficiently implemented as
a standard DFA in a Deterministic Unit (DU) with binary encoded states. States
in the NFA part are mapped to Nondeterministic Unit (NU), where every state
is represented by a dedicated logic (register and next state logic). Therefore, new
state value can be computed in parallel in every clock cycle.
We have improved the NFA-Split architecture by k-inner alphabet in [15],
which decreases the on-chip memory requirements for matching RE with character classes. Character classes can be specified in REs to define set of characters.
In the automaton, the transition on character class has to be represented by
a set of transitions on individual characters. This can significantly increase the
number of transitions and thus the size of a memory to store the automaton.
The k-inner alphabet allows representing a character class by only one internal symbol and only one transition. Thus, less memory is needed to store the
automaton.
3
NFA-Split Architecture for Multi-Stride Automata
Even while the NFA-Split architecture is highly optimized, requires only reasonable memory and provides high matching speed, it still doesn’t support
multi-stride automata to match multiple characters in a single clock cycle. Consequently, it cannot scale its matching speed well. Therefore, we propose an
extension of the NFA-Split architecture to support multi-stride automata. Moreover, we provide optimizations of mapping the DFA and NFA parts to further
reduce the FPGA logic in order to map larger set of RE to the FPGA.
79
V. Košař and J. Kořenek
We propose the necessary modifications of the NFA-Split architecture to
support multi-stride automata (SNFA-Split) and to make the matching speed
scalable to tens of gigabits. The method of creating the multi-stride automaton
is based on performing all consecutive transitions from one state to all states
reachable in the number of steps equal to the desired stride. Symbols along these
consecutive transitions are merged into one multi-stride symbol. Multi-stride
automata have usually more transitions due to the larger number of symbols [23,
11].
Algorithm 1: Compute all pairs of simultaneously active states. The algorithm uses intersection operation ∩k .
Input: NFA M = (Q, Σ, δ, s, F )
Output: Set of pairs of simultaneously active states
concurrent = {(p, q)|p, q ∈ Q, p 6= q)}
1
2
3
4
5
6
7
8
9
10
11
12
normalize(q1 , q2 ) = (q1 < q2 ) ? ((q1 , q2 ) : ((q2 , q1 );
concurrent = {(s, s)};
workplace = {(s, s)};
while ∃(q1 , q2 ) ∈ workplace do
workplace = workplace\{(q1 , q2 )};
foreach q3 ∈ δ(q1 , a) do
foreach q4 ∈ δ(q2 , b) do
if a ∩k b 6= (∅, ∅, ..., ∅) then
if ((q5 , q6 ) = normalize(q3 , q4 )) 6∈ concurrent then
concurrent = concurrent ∪ {(q5 , q6 )};
workplace = workplace ∪ {(q5 , q6 )};
return concurrent\{(p, p)|p ∈ Q}
To accept multiple characters at once, we have to change the construction of
the NFA-Split architecture. First, it is necessary to modify algorithm [15], which
is able to identify the simultaneously active states in the NFA. The algorithm
tests whether two symbols are equal. To perform this operation, character classes
have to be expanded to individual characters. Then the number of transitions
can be increased up to 2n times, where n is data width of input characters
(usually n = 8). The situation is even worse for the multi-stride automaton. The
amount of transitions can be increased up to 2kn times, where n is data width
of one input character and k is the number of input characters accepted at once.
To avoid this transition growth, we can preserve character classes in symbols
and replace the exact comparison by intersection operation ∩k . The inputs of the
operation ∩k are two multi-stride symbols defined as k-tuples (A1 , A2 , ..., Ak ) and
(B1 , B2 , ..., Bk ), where items Ai , Bi are subsets of input alphabet Ai , Bi ⊆ Σ.
The subsets Ai , Bi can represent individual character or a character class. The
result of the operation is k-tuple C = (C1 , C2 , ..., Ck ), which is defined by the
Eq. 1.
80
Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA
(C1 , C2 , ..., Ck ) = (A1 ∩ B1 , A2 ∩ B2 , ..., Ak ∩ Bk )
(1)
The k-tuple (C1 , C2 , ..., Ck ) contains set of input symbols that are included
in both input k-tuples. If any item Ci is equal to ∅ (empty set), then both input
symbols A and B cannot be expanded to the same k-tuple of characters. This
means that any pair of expanded symbols from k-tuples A and B is not equal. We
denote this situation as ∩k = (∅, ∅, ..., ∅) in the Algorithm 1, which is a modified
algorithm to detect the simultaneously active states in multi-stride automaton.
Then identification of deterministic and nondeterministic parts in NFA can be
the same as in the original NFA-Split architecture.
The DU architecture must be modified to support multi-stride automaton.
In this paper, we consider the architecture of the DU introduced in [15] which
utilizes k-inner alphabets in order to reduce memory requirements. This architecture can be easily extended to support multi-stride automata. As can be seen
in Fig. 1, the architecture remains the same except for the first component, which
transforms input symbols to k inner alphabets. The component is marked by dotted line and is able to join input symbols. For the automaton A = (Q, Σ, δ, q0 , F ),
two symbols a, b ∈ Q can be joined only if ∀q ∈ Q : δ(q, a) = δ(q, b). The proposed architecture uses BlockRAMs [23] as tables that provide efficient transformation of input symbols to k-inner alphabet.
Fig. 1. Overview of DU architecture for SNFA-Split with k-inner alphabets and n input
symbols accepted at once. Dotted line is used to mark the new component to support
multi-stride automata.
The nondeterministic part of the NFA-Split architecture for multi-stride automata is based on shared decoder architecture for multi-stride NFA as has been
introduced in [6].
81
V. Košař and J. Kořenek
4
Optimizations of the NFA-Split Architecture
In the previous paper [15], we have presented reduction of memory and time
complexity of the NFA-Split architecture. In this section, we consider optimizations of DUs and NUs in terms of efficiency of FPGA resource utilization. First,
we analyze the efficiency of state representation in DU and NU. Then we propose an optimization of mapping the states to DU and several optimizations of
mapping NU to the FPGA.
4.1
Optimization of Deterministic Parts of NFA
In this paper, we investigate the efficiency of mapping the DU to FPGA logic.
The main factor influencing the resource utilization of the DU is the size of input
encoding and output decoding logic. The size of the logic depends on the number
of transitions to/from the DU. To analyze the number of transitions to/from the
DU, we define continuous parts of the automaton as sets of states if:
1. All states are represented by the DU.
2. All states are reachable from some input state of the DU.
3. All states together with input/output transitions form a continuous graph
of transitions.
As NFA-Split doesn’t use continuous parts to derive mapping of states to DU
and NU, we have to define a procedure how to identify continuous parts. First,
we have to select an input state of the DU and then traverse along the transitions
until a final state or a transition to NU is reached. If some state has more than
one input transition, we have to traverse backwards until input transition to
the DU is reached. Similarly, if some state has more than one output transition,
then we have to traverse forward through all output transition until final state
or transition to NU is reached. This procedure is finished if no new state can be
added. The input states in already recognized continuous parts are not used for
detection of next continuous parts.
We have analyzed the size of continuous parts for Snort backdoor rule set.
The result is a histogram in Fig. 2. The x-axis represents the size of continuous
parts and the y-axis represents the number of parts of that size. It can be seen
that many continuous parts are very small (usually 1 to 3 states). Resource
utilization for input encoders and output decoders associated with those very
small continuous parts can be larger than the amount of logic resources needed
for implementation of those parts in NU. This holds also for continuous part
of any size with large number of inputs and outputs. Therefore, we define the
Eq. 2 to have a simple condition when it is better to include continuous part pi
in the DU.
costinputs (pi ) + costoutputs (pi ) < costN U (pi )
(2)
This means that the cost of input/output encoding has to be lower than the
cost of implementation in the NU. The cost function costinputs (pi ) computes
82
Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA
50
Original
50
Eliminated
25
40
20
30
30
15
20
20
10
10
10
5
Count
40
Optimized
00 10 20 30 40 50 60 00 10 20 30 40 50 60 00 10 20 30 40 50 60
Size of continuous part
Fig. 2. Histogram of continuous parts in DU for Snort backdoor rule set. Distribution
according to the size is provided for all parts in DU (Original), parts in DU after optimization (Optimized) and parts removed from DU because of inefficiency (Eliminated).
number of LUTs necessary to implement one-hot to binary encoding for input
transitions. The cost function costoutputs (pi ) computes number of LUTs necessary to implement binary to one-hot encoding for output transitions and the cost
function costN U (pi ) computes number of LUTs needed to map the continuous
part pi into NU.
Application of the Eq. 2 on the DU has direct impact to efficiency of the
DU. Therefore, continuous parts violating Eq. 2 are better to be kept in NFA
and represented by NU. Sizes of eliminated and optimized continuous parts are
shown on histograms in Fig. 2. Eliminated parts are removed from the DU to
the NU. Optimized continuous parts remain in the DU.
Characteristics of continuous parts of DU for various sets of RE before and
after the DU optimization as well as characteristics of eliminated parts are shown
in Table 1. Column Original contains characteristics before the optimization, column Optimized contains characteristics after the optimization and column Eliminated contains characteristics of continuous parts removed by the optimization.
Three characteristics were measured: Number of continuous parts (Parts), Average size of continuous part (AS) and Average ratio between inputs/outputs and
size of continuous part (AIOS). The sets of REs come from the Snort IDS [1]
modules and from the L7 decoder [24]. Optimized DU has larger average size of
continuous part and smaller average ratio between inputs/outputs and size of
continuous part.
83
V. Košař and J. Kořenek
Table 1. Characteristics of continuous parts of DU for various sets of REs before and
after the DU optimization.
RE set
Original
Optimized
Eliminated
Parts AS AIOS Parts AS AIOS Parts AS AIOS
[-]
[-]
[-]
[-]
[-]
[-]
[-] [-]
[-]
L7 selected
53 10.66 0.28
37 14.51 0.21
16 1.75 1.57
L7 all
369 9.09 0.85
192 13.05 0.11
177 4.79 3.04
backdoor
284 9.96 0.32
208 12.37 0.13
76 3.39 2.30
web-php
21 10.67 0.30
19 11.68 0.14
2 1.00 18.00
ftp
42 3.55 1.26
19 4.89 0.39
23 2.43 2.70
netbios
40 7.72 0.28
13 20.46 0.08
27 1.59 1.53
voip
61 14.23 0.29
37 21.68 0.11
24 2.75 2.53
web-cgi
16 35.25 0.09
9 61.67 0.03
7 1.29 3.89
4.2
Efficient Encoding of the Nondeterministic Part
The encoding of the nondeterministic part into the NU utilizes the shared decoder architecture [25]. As we have shown in the previous chapter, some specific
continuous parts of the DU can be moved into the NU to improve efficiency of
mapping. Therefore, it is possible to utilize the properties of eliminated continuous parts and use more efficient encoding of the NU. We propose to use the
At-most two-hot encoding (AMTH) introduced in [26], because it implements
small 3 states parts efficiently with two LUTs and two flip-flops (FFs).
The mapping of constrained repetitions in the FPGA architecture also requires optimization. In the current NFA-Split architecture the constrained repetitions can be encoded either in the DUs or in the NU, depending on the type
of the repetition and the NFA structure. The encoding of the Perl compatible
RE (PCRE) into the DU is inefficient, because the counting constraint is represented by many states and transitions and size of the automaton is increased
significantly. For example, PCRE /^[abc]{100}/ in DU needs about 300 rows
of the transition table. The NU architecture with a shared decoder is also not
efficient. It consumes 100 FFs and 100 LUTs for the same example. The usage
of the AMTH improves the efficiency (67 FFs and 67 LUTs). However, special
subcomponents for constrained repetitions introduced in [10] are more efficient
in logic utilization. Therefore, we propose to represent constrained repetition by
this dedicated component in NU.
5
Evaluation
We performed the evaluation
mizations on a selected set of
decoder [24]. All used sets of
Netbench framework has been
84
of proposed SNFA-Split architecture and optiSnort IDS [1] modules and set of REs from L7
REs come from Netbench framework [27]. The
used to implement the proposed architecture to-
Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA
gether with optimizations and to make a comparison with other FPGA based
multi-stride architectures.
Table 2. FPGA logic utilization of SNFA-Split and Clark multi-stride architectures.
The results are for multi-stride automata with two and four input characters accepted
at once.
Statistics
Rules
L7 selected
backdoor
web-cgi
misc
ftp
L7 selected
backdoor
web-cgi
misc
ftp
Stride
Clark
SNFA-Split
Inner
REs Symbols LUT FF LUT FF BRAM Alphabets
[-]
[-]
[-] [-]
[-] [-]
[-]
[-]
29
2 2744 673 1884 182
8
4
154
2 7178 4383 3004 815
20
6
10
2 3456 1332 2688 738
9
2
17
2 3200 1294 2551 944
10
4
35
2 3774 1944 3104 1595
10
4
29
4 4776 678 3631 192
24
8
154
4 13509 4881 6137 1007
44
11
10
4 5608 1360 4598 738
12
4
17
4 5322 1331 4219 951
20
6
35
4 6060 2081 4166 1601
20
6
First, we have evaluated FPGA logic utilization of SNFA-Split architecture.
The results for two and four input characters accepted at once are compared in
Table 2 to the multi-stride architecture with shared decoder of input characters.
The amount of utilized LUTs, FFs and 18 Kb BlockRAMs was estimated for the
Xilinx Virtex-5 architecture. However, the SNFA-Split architecture is suitable
for any FPGA. Column Statistics presents the number of REs in particular set
of REs. Column Clark shows the estimated utilization for the multi-stride architecture with shared decoder of input characters. Column SNFA-Split indicates
the estimated utilization for the SNFA-Split architecture. It can be seen that
the SNFA-Split architecture is able to reduce the amount of utilized LUTs by
58 % for the largest backdoor module. The table also indicates how many inner
alphabets were used, because the utilization of FPGA resources depends on the
number of inner alphabets.
We have also evaluated both proposed optimizations of NFA-Split architecture. Table 3 shows results of DU and NU optimizations. The amount of utilized
FPGA resources is estimated. Column Statistics presents the number of REs in
particular set of REs. Column Original shows the estimated utilization for the
original NFA-Split architecture. Column Reduction indicates the reduction of
FPGA resources by the proposed optimizations. It can be seen that the optimizations were able to achieve significant reduction of BlockRAMs and FPGA
logic: 71.85 % LUTs for the nntp module and 94.18 % BlockRAMs for the voip
module. This reduction is caused primarily by relocation of constrained repe85
V. Košař and J. Kořenek
titions from the DU into the optimized NU. The dedicated subcomponents for
the constrained repetitions are more efficient. It can be seen in the results that
the reduction mainly depends on the presence of counting constraints (e.g., L7
does not contain any, while voip does and the big ones were placed in the DU)
and structure of the automaton. The last row of the Table 3 presents average
reduction of utilized resources for 22 sets of REs from both Snort IDS and L7
project.
Table 3. Reduction of FPGA logic utilization by optimized DU and NU for in the
NFA-Split Architecture.
Statistics
Original
REs LUT FF BlockRAM
Rules
[-]
[-]
[-]
[-]
L7 selected
29 1003 182
4
L7 all
143 8035 2945
8
backdoor
154 1696 727
10
dos
3
803 119
2
ftp
35 2284 1590
2
misc
17 1651 941
2
nntp
12 3133 2483
2
web-cgi
10 1651 736
4
voip
38 1936 834
34
22 RE Sets
548 33421 12726
94
Reduction
LUT FF BlockRAM
[%] [%]
[%]
1.10 -2.74
50
15.08 2.11
25
7.05 1.15
20
-4.36 12.61
0
51.16 89.62
0
37.72 88.42
0
71.85 96.69
0
39.83 88.59
50
35.18 77.46
94.18
21.93 56.67
42.55
Four-stride SNFA-Split architecture running at 150 MHz has worst-case (Malicious network traffic) throughput of 4.8 Gbps. It outperforms GPU based
solution presented in [18]. Even single stride architecture with throughput of
1.2 Gbps outperforms the GPU solution for rule-set L7 all. The efficient CPU
solution [19] outperforms four-stride SNFA-Split architecture when running on
high-end CPUs. However, the results in [19] are measured for best-case situation
(Regular network traffic).
6
Conclusion
The paper has introduced the NFA-Split architecture optimization for
multi-stride automata. The proposed architecture is able to process multiple
bytes in one clock cycle. Therefore, RE matching speed can be increased despite frequency limits of current FPGAs. As can be seen in the Results section, the proposed multi-stride architecture utilizes up to 58 % less LUTs than
multi-stride FPGA architectures with shared decoder. Consequently, additional
REs can be supported.
Moreover, we have proposed several optimizations of the NFA-Split architecture in order to further reduce FPGA resources. First optimization is focused on
86
Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA
the overhead of encoding logic in DU. The optimization moves states from DU
to NU, if the cost of encoding logic is higher than the cost of logic in the NU.
The second proposed optimization is focused on NU mapping to the FPGA.
At-most two-hot encoding and specific subcomponents for constrained repetitions are used to represent states and transitions relocated from DU to NU.
Both optimizations are able to reduce up to 71.85 % of LUTs and up to 94.18 %
of BlockRAMs.
As future work, we want to investigate efficient pattern matching on
100-Gigabit Ethernet.
Acknowledgment
This work was supported by the IT4Innovations Centre of Excellence
CZ.1.05/1.1.00/02.0070 and the BUT project FIT-S-14-2297.
References
1. Snort: Project WWW Page. http://www.snort.org/ (2014)
2. The Bro Network Security Monitor: Project WWW Page. http://www.bro.org/
(2014)
3. Koziol, J.: Intrusion Detection with Snort. Sams, Indianapolis, IN, USA (2003)
4. Becchi, M., Crowley, P.: Efficient Regular Expression Evaluation: Theory to Practice. In: ANCS ’08: Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ACM (2008) 50–59
5. Kumar, S., Turner, J., Williams, J.: Advanced Algorithms for Fast and Scalable Deep Packet Inspection. In: ANCS ’06: Proceedings of the 2006 ACM/IEEE
Symposium on Architecture for Networking and Communications Systems, ACM
(2006) 81–92
6. Clark, C.R., Schimmel, D.E.: Scalable Pattern Matching for High Speed Networks. In: FCCM ’04: Proceedings of the 12th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, IEEE Computer Society (2004) 249–
257
7. Lin, C.H., Huang, C.T., Jiang, C.P., Chang, S.C.: Optimization of Pattern Matching Circuits for Regular Expression on FPGA. IEEE Trans. Very Large Scale
Integr. Syst. 15(12) (2007) 1303–1310
8. Becchi, M., Crowley, P.: A-DFA: A Time- and Space-Efficient DFA Compression
Algorithm for Fast Regular Expression Evaluation. ACM Transactions on Architecture and Code Optimization 10(1) (2013) 4:1–4:26
9. Sidhu, R., Prasanna, V.K.: Fast Regular Expression Matching Using FPGAs.
In: FCCM ’01: Proceedings of the 9th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, IEEE Computer Society (2001) 227–
238
10. Sourdis, I., Bispo, J., Cardoso, J.M.P., Vassiliadis, S.: Regular Expression Matching
in Reconfigurable Hardware. Journal of Signal Processing Systems 51(1) (2008)
99–121
11. Becchi, M., Crowley, P.: Efficient Regular Expression Evaluation: Theory to Practice. In: Proceedings of the 4th ACM/IEEE Symposium on Architectures for
Networking and Communications Systems. ANCS ’08, New York, NY, USA, ACM
(2008) 50–59
87
V. Košař and J. Kořenek
12. Košař, V., Žádnı́k, M., Kořenek, J.: NFA Reduction for Regular Expressions
Matching Using FPGA. In: Proceedings of the 2013 International Conference on
Field Programmable Technology, IEEE Computer Society (2013) 338–341
13. Kořenek, J., Košař, V.: Efficient Mapping of Nondeterministic Automata to FPGA
for Fast Regular Expression Matching. In: Proceedings of the 13th IEEE International Symposium on Design and Diagnostics of Electronic Circuits and Systems
DDECS 2010, IEEE Computer Society (2010) 6
14. Kořenek, J., Košař, V.: NFA Split Architecture for Fast Regular Expression Matching. In: Proceedings of the 6th ACM/IEEE Symposium on Architectures for
Networking and Communications Systems, Association for Computing Machinery
(2010) 2
15. Košař, V., Kořenek, J.: On NFA-Split Architecture Optimizations. In: 2014 IEEE
17th International Symposium on Design and Diagnostics of Electronic Circuits
and Systems (DDECS), IEEE Computer Society (2014) 274–277
16. Clark, C., Schimmel, D.: Efficient Reconfigurable Logic Circuits for Matching
Complex Network Intrusion Detection Patterns. In: Field Programmable Logic
and Application, 13th International Conference, Lisbon, Portugal (2003) 956–959
17. Dlugosch, P., Brown, D., Glendenning, P., Leventhal, M., Noyes, H.: An Efficient
and Scalable Semiconductor Architecture for Parallel Automata Processing. IEEE
Transactions on Parallel and Distributed Systems PP(99) (2014)
18. Cascarano, N., Rolando, P., Risso, F., Sisto, R.: iNFAnt: NFA Pattern Matching
on GPGPU Devices. SIGCOMM Comput. Commun. Rev. 40(5) (2010) 20–26
19. Valgenti, V.C., Chhugani, J., Sun, Y., Satish, N., Kim, M.S., Kim, C., Dubey, P.:
GPP-Grep: High-speed Regular Expression Processing Engine on General Purpose
Processors. In: Proceedings of the 15th International Conference on Research in
Attacks, Intrusions, and Defenses. RAID’12, Berlin, Heidelberg, Springer-Verlag
(2012) 334–353
20. Kumar, S., Dharmapurikar, S., Yu, F., Crowley, P., Turner, J.: Algorithms to
Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection. In:
SIGCOMM ’06: Proceedings of the 2006 Conference on Applications, Technologies,
Architectures, and Protocols for Computer Communications, ACM (2006) 339–350
21. Smith, R., Estan, C., Jha, S., Kong, S.: Deflating the Big Bang: Fast and Scalable
Deep Packet Inspection With Extended Finite Automata. SIGCOMM Comput.
Commun. Rev. 38(4) (2008) 207–218
22. Becchi, M., Crowley, P.: A Hybrid Finite Automaton for Practical Deep Packet
Inspection. In: Proceedings of the 2007 ACM CoNEXT Conference. CoNEXT ’07,
New York, NY, USA, ACM (2007)
23. Brodie, B.C., Taylor, D.E., Cytron, R.K.: A Scalable Architecture For HighThroughput Regular Expression Pattern Matching. SIGARCH Computer Architecture News 34(2) (2006) 191–202
24. L7 Filter: Project WWW Page. http://l7-filter.sourceforge.net/ (2014)
25. Kořenek, J.: Fast Regular Expression Matching Using FPGA. Information Sciences
and Technologies Bulletin of the ACM Slovakia 2(2) (2010) 103–111
26. Yun, S., Lee, K.: Optimization of Regular Expression Pattern Matching Circuit
Using At-Most Two-Hot Encoding on FPGA. International Conference on Field
Programmable Logic and Applications 0 (2010) 40–43
27. Pus, V., Tobola, J., Kosar, V., Kastil, J., Korenek, J.: Netbench: Framework for
Evaluation of Packet Processing Algorithms. Symposium On Architecture For
Networking And Communications Systems (2011) 95–96
88
Computational Completeness Resulting from
Scattered Context Grammars Working Under
Various Derivation Modes
Alexander Meduna and Ondřej Soukup
Brno University of Technology, Faculty of Information Technology Centre of
Excellence, Božetěchova 1/2, 612 66 Brno, Czech Republic
[email protected], [email protected]
Abstract. This paper introduces and studies a whole variety of derivation modes in scattered context grammars. These grammars are conceptualized just like classical scattered context grammars except that
during the applications of their rules, after erasing n nonterminals, they
can insert new substrings possibly at different positions than the original
occurrence of the erased nonterminal.
The paper concentrates its attention on investigating the generative
power of scattered context grammars working under these derivation
modes. It demonstrates that all of them are computationally complete–
that is, they characterize the family of recursively enumerable languages.
Keywords: scattered context grammars; alternative derivation modes; generative power; computational completeness.
1
Introduction
The present section informally sketches scattered context grammars working
under various new derivation modes and explains the reason why they are introduced. This section also describes how the paper is organized.
At present, processing information in a discontinuous way represents a common
computational phenomenon. Indeed, consider a process p that deals with information i. Typically, during a single computational step, p (1) reads n pieces
of information, x1 through xn , in i, (2) erases them, (3) generate n new pieces
of information, y1 through yn , and (4) inserts them into i possibly at different
positions than the original occurrence of x1 through xn , which was erased. To
explore computation like this systematically and rigorously, computer science
obviously needs formal models that reflect it in an adequate way.
Traditionally, formal language theory has always provided computer science with
language-defining models to explore various information processors mathematically, so it should do so for the purpose sketched above as well. However, the
classical versions of these models, such as grammars, work on words so they
89
A. Meduna and O. Soukup
erase and insert subwords at the same position, hence they can hardly serve as
appropriate models of this kind. Therefore, a proper formalization of processors
that work in the way described above needs an adaptation of some classical
well-known grammars so they reflect the above-described computation more adequately. At the same time, any adaptation of this kind should conceptually
maintain the original structure of these models as much as possible so computer science can quite naturally base its investigation upon these newly adapted
grammatical models by analogy with the standard approach based upon their
classical versions. Simply put, while keeping their structural conceptualization
unchanged, these grammatical models should work on words in newly introduced
ways, which more properly reflect the above-mentioned modern computation.
The present paper discusses this topic in terms of scattered context grammars, which definitely represent important language-generating grammatical
models of computation. Indeed, the paper introduces a whole variety of derivation modes in scattered context grammars so they reflect the above-sketched
computation in a more adequate way than the standard derivation mode.
Recall that the notion of a scattered context grammar G represents a languagegenerating rewriting system based upon an alphabet of symbols and a finite set
of rules. The alphabet of symbols is divided into two disjoint subalphabets—the
alphabet of terminal symbols and the alphabet of nonterminal symbols. In G, a
rule r is of the form
(A1 , A2 , . . . , An ) → (x1 , x2 , . . . , xn ),
for some positive integer n. On the left-hand side of r, the As are nonterminals.
On the right-hand side, the xs are strings. G can apply r to any string u of the
form
u = u0 A1 u1 . . . un−1 An un
where us are any strings. Notice that A1 through An are scattered throughout
u, but they occur in the order prescribed by the left-hand side of r. In essence,
G applies r to u so
(1) it deletes A1 , A2 , . . . , An in u, after which
(2) it inserts x1 , x2 , . . . , xn into the string resulting from the deletion (1).
By this application, G makes a derivation step from u to a string v of the form
v = v0 x1 v1 . . . vn−1 xn vn
Notice that x1 , x2 , . . . , xn are inserted in the order prescribed by the right-hand
side of r. However, they are inserted in a scattered way—that is, in between the
inserted xs, some substrings vs occur.
This paper partially introduces the results of larger study, which is currently
being in progress and will be hopefully published soon. In this study, 9 derivation modes of scattered context grammars are defined, however, due to shortage
of space, only a few selected modes are presented here. For consistence, their
90
Computational Completeness Resulting from Scattered Context Grammars
numbering is preserved. The chosen modes are mutually dual or complementary
to the others.
(1) Mode 1 requires that ui = vi for all i = 0, . . . , n in the above described
derivation step.
(3) Mode 3 obtains v from u so it changes u by performing (3a) through (3c),
described next:
(a) A1 , A2 , . . . , An are deleted;
(b) x1 and xn are inserted into u0 and un , respectively;
(c) x2 through xn−1 are inserted in between the newly inserted x1 and xn .
(5) In mode 5, v is obtained from u by (5a) through (5e), given next:
(a) A1 , A2 , . . . , An are deleted;
(b) a central ui is nondeterministically chosen, for some 0 ≤ i ≤ n;
(c) x1 and xn are inserted into u0 and un , respectively;
(d) xj is inserted between uj−2 and uj−1 , for all 1 < j ≤ i;
(e) xk is inserted between uk and uk+1 , for all i + 1 ≤ k < n.
(7) Mode 7 obtains v from u performing the steps stated below:
(a) A1 , A2 , . . . , An are deleted;
(b) a central ui is nondeterministically chosen, for some 0 ≤ i ≤ n;
(c) xj is inserted between uj−2 and uj−1 , for all 1 < j ≤ i;
(d) xk is inserted between uk and uk+1 , for all i + 1 ≤ k < n.
This paper is organized as follows. Section 2 gives all the necessary notation
and terminology to follow the rest of the paper. Then, Section 3 formally introduces all the new derivation modes in scattered context grammars. After that,
Section 4 demonstrates that scattered context grammars working under any of
the newly introduced derivation modes are computationally complete–that is,
they characterize the family of recursively enumerable languages.
2
Preliminaries
We assume that the reader is familiar with formal language theory (see [1, 2]).
For a set W , card(W ) denotes its cardinality. Let V be an alphabet (finite
nonempty set). V ∗ is the set of all strings over V. Algebraically, V ∗ represents
the free monoid generated by V under the operation of concatenation. The unit
of V ∗ is denoted by ε. Set V + = V ∗ − {ε}. Algebraically, V + is thus the free
semigroup generated by V under the operation of concatenation. For w ∈ V ∗ ,
|w| and reversal(w) denote the length of w and the reversal of w, respectively.
For L ⊆ V ∗ , reversal(L) = {reversal(w) | w ∈ L}. The alphabet of w, denoted by
alph(w), is the set of symbols appearing in w. For v ∈ Σ and w ∈ Σ ∗ , occur(v, w)
equals the number of occurrences of v in w.
Let % be a relation over V ∗ . The transitive and transitive and reflexive closure
of % are denoted %+ and %∗ , respectively. Unless explicitly stated otherwise, we
write x % y instead (x, y) ∈ %.
The families of regular languages, context-free languages, and recursively
enumerable languages are denoted by REG, CF, and RE, respectively. Recall
that scattered context grammars characterize RE (see [3]).
91
A. Meduna and O. Soukup
3
Definitions
In this section, we define scattered context grammars and the following new
derivation modes in scattered context grammars. Then, we illustrate them by
examples.
Definition 1. A scattered context grammar (an SCG for short) is a quadruple
G = (V, T, P, S)
where
• V is an alphabet;
• T ⊂V;
• set N = V − T ;
∞
S
∗
∗
∗
• P ⊆
N1 × N2 × · · · × Nm × V1 × V2 × · · · × Vm
m=1
is finite, where each Nj = N , Vj = V , 1 ≤ j ≤ m;
• S ∈ N.
V , T and N are called the total alphabet, the terminal alphabet and the nonterminal alphabet, respectively. P is called the set of productions. Instead of
(A1 , A2 , . . . , An , x1 , x2 , . . . , xn ) ∈ P
where Ai ∈ N , xi ∈ V ∗ , for 1 ≤ i ≤ n, for some n ≥ 1, we write
(A1 , A2 , . . . , An ) → (x1 , x2 , . . . , xn )
S is the start symbol.
t
u
Definition 2. Let G = (V , T , P , S) be an SCG, and let % be a relation over
V ∗ . Set
L(G, %) = {x | x ∈ T ∗ , S %∗ x}
L(G, %) is said to be the language that G generates by %. Set
SC(%) = {L(G, %) | G is an SCG}
SC(%) is said to be the language family that SCGs generate by %.
t
u
Definition 3. Let G = (V , T , P , S) be an SCG. Next, we define the following
direct derivation relations 1⇒ through 9⇒ over V ∗ .
First, let (A) → (x) ∈ P and u = w1 Aw2 ∈ V ∗ . Then,
w1 Aw2 i⇒ w1 xw2 , i ∈ {1, 2, . . . , 9}
Second, let (A1 , A2 , . . . , An ) → (x1 , x2 , . . . , xn ) ∈ P and u = u0 A1 u1 . . . An un ,
z, z 0 , ui , vi , wi ∈ V ∗ , for all 0 ≤ i ≤ n and 1 ≤ j ≤ n − 1, for some n ≥ 2, and
u0 u1 . . . un = v0 v1 . . . vn . Then,
92
Computational Completeness Resulting from Scattered Context Grammars
(1) u0 A1 u1 A2 u2 . . . An un 1⇒ u0 x1 u1 x2 v2 . . . xn un ;
(3) u0 A1 u1 A2 u2 . . . An un 3⇒ v0 x1 v1 x2 v2 . . . xn vn , where u0 = v0 z, un = z 0 vn ;
(5) u0 u1 A1 u2 A2 . . . ui−1 Ai ui Ai+1 ui+1 . . . An un z 5⇒
u0 x1 u1 x2 u2 . . . xi ui−1 ui ui+1 xi+1 . . . un xn z;
(7) u0 A1 u2 A2 . . . ui−1 Ai ui Ai+1 ui+1 . . . An un 7⇒
u0 x2 u2 . . . xi ui−1 ui ui+1 xi+1 . . . un ;
t
u
To illustrate the above-introduced notation, let G = (V , T , P , S) be an
SCG; then, L(G, 5⇒) = {x | x ∈ T ∗ , S 5⇒∗ x} and SC(5⇒) = {L(G, 5⇒) | G is
an SCG}. To give another example, SC(1⇒) denotes the family of all scattered
context languages.
4
Generative Power
In this section, for each defined derivation mode we investigate the generative
power of SCGs using this mode.
Lemma 1. Let L ⊆ Σ ∗ be any recursively enumerable language. Then, L can
be represented as L = h(L1 ∩ L2 ), where h : T ∗ → Σ ∗ is a morphism and L1
and L2 are two context-free languages.
For a proof, see [4].
4.1
Mode 1
We prove that SCGs with mode 1 derivations characterize the family of recursively enumerable languages.
Theorem 1. [3] SC(1⇒) = RE.
Since SC(1⇒) ⊆ RE follows directly from the Church-Turing thesis, we only
have to prove the opposite inclusion.
Proof. Construction. Recall Lemma 1. By the closure properties of context-free
languages, there are context-free grammars G1 and G2 that generate L1 and
reversal(L2 ), respectively. More precisely, let Gi = (Vi , T, Pi , Si ) for i = 1, 2. Let
T = {a1 , . . . , an } and 0, 1, $, S ∈
/ (V1 ∪ V2 ∪ Σ) be the new symbols. Without
any loss of generality, assume that V1 ∩ V2 = ∅. Define the new morphisms
(4) f : ai 7→ h(ai )c(ai );
(1) c : ai 7→ 10i 1;
∗ (5) t : Σ ∪ {0, 1, $} → Σ,
(2) C
:
V
∪
T
→
V
∪
Σ
∪
{0,
1}
,
1
1
1
a 7→ a, a ∈ Σ,
A 7→ A, A ∈ V1 ,
A→
7 ε, A ∈
/ Σ;
a 7→ f (a), a ∈ T ;
0
∗
(6)
t
:
Σ
∪
{0,
1,
$} → {0, 1},
(3) C
:
V
∪
T
→
V
∪
{0,
1}
,
2
2
2
a 7→ a, a ∈ {0, 1},
A 7→ A, A ∈ V2 ,
A 7→ ε, A ∈
/ {0, 1}.
a 7→ c(a), a ∈ T ;
Finally, let G = (V, Σ, P, S) be SCG, with V = V1 ∪ V2 ∪ {S, 0, 1, $} and P
containing the rules
93
A. Meduna and O. Soukup
(1)
(2)
(3)
(4)
(S) → ($S1 1111S2 $);
(A) → (Ci (w)), for all A → w ∈ Pi , where i = 1, 2;
($, a, a, $) → (ε, $, $, ε), for a = 0, 1;
($) → (ε).
Claim 1. L(G, 1⇒) = L.
Proof. Basic idea. First the starting rule from (1) is applied. The starting nonterminals S1 and S2 are inserted into the current sentential form. Then, by using
the rules from (2) G simulates derivations in both G1 and G2 and generates the
sentential form w = $w1 1111w2 $.
Suppose S 1⇒∗ w, where alph(w) ∩ (V1 ∪ V2 ) = ∅. If t0 (w1 ) = reversal(w2 ),
then t(w1 ) = h(v), where v ∈ L1 ∩ L2 and h(v) ∈ L. In other words, w represents a successful derivation of both G1 and G2 , where the both grammars have
generated the same sentence v, therefore G must generate the sentence h(v).
The rules from (3) serve to check, whether the simulated grammars have
generated the identical words. Binary codings of the generated words are erased
while checking the equality. Each time the leftmost and the rightmost symbols
are erased, otherwise some symbol is skipped. If the codings do not match, some
0 or 1 cannot be erased and no terminal string can be generated.
Finally, the symbols $ are erased with the rule from (4), and if G1 , G2 ,
respectively, generated the same sentence and both codings were successfully
erased, then the G has generated the terminal sentence h(v).
t
u
For a rigorous proof, see [3]. Since L is an arbitrary recursively enumerable
language, by Claim 1 the proof of Theorem 1 is completed.
t
u
4.2
Mode 3
In this section, we prove the family of languages generated by SCGs with mode
3 derivations coincides with the family of recursively enumerable languages.
Theorem 2. SC(3⇒) = RE.
Since SC(3⇒) ⊆ RE follows directly from the Church-Turing thesis, we only
have to prove the opposite inclusion.
Proof. Let G = (V, Σ, P, S) be the SCG constructed in the proof of Theorem 1.
Next, we modify G to a new SCG G0 such that L(G, 1⇒) = L(G0 , 1⇒). Finally,
we prove L(G0 , 3⇒) = L(G0 , 1⇒).
Construction. Let G0 = {V, Σ, P 0 , S} be SCG with P 0 containing
(1)
(2)
(3)
(4)
94
(S) → (S1 11$$11S2 );
(A) → (Ci (w)) for A → w ∈ Pi , where i = 1, 2;
(a, $, $, a) → ($, ε, ε, $), for a = 0, 1;
($) → (ε).
Computational Completeness Resulting from Scattered Context Grammars
We establish the proof of Theorem 2 by the following two claims.
Claim 2. L(G0 , 1⇒) = L(G, 1⇒).
Proof. G0 is closely related to G, only the rules from (1) and (3) are slightly
modified. As a result the correspondence of the sentences generated by the simulated G1 , G2 , respectively, is not checked in the direction from the outermost
to the central symbols but from the central to the outermost symbols. Again,
if the current two symbols do not match, they can not be erased both and the
derivation blocks.
t
u
Claim 3. L(G0 , 3⇒) = L(G0 , 1⇒).
Proof. Without any loss of generality, we can suppose the rules from (1) and (2)
are used only before the first usage of the rule from (3). The context-free rules
work unchanged with mode 3 derivations. Then, for every derivation
S 1⇒∗ w = w1 11$$11w2
generated only by the rules from (1) and (2), where alph(w) ∩ (V1 ∪ V2 ) = ∅,
there is the identical derivation
S 3⇒∗ w
and vice versa. Since
w 1⇒∗ w0 , w0 ∈ Σ ∗
if and only in t0 (w1 ) = reversal(w2 ), we can complete the proof of the previous
claim by the following one.
Claim 4. Let the sentential form w be generated only by the rules from (1) and
(2). Without any loss of generality, suppose alph(w) ∩ (V1 ∪ V2 ) = ∅. Consider
S 3⇒∗ w = w1 11$$11w2
Then, w 3⇒∗ w0 , where w0 ∈ Σ ∗ , if and only if t0 (w1 ) = reversal(w2 ).
For better readability, in the next proof we omit all symbols of w1 from Σ—
we consider only nonterminal symbols, which are to be erased.
Basic idea. The rules from (3) are the only with 0s and 1s on their left hand
sides. These symbols are simultaneously erasing to the left and to the right of $s
checking the equality in reverse. While proceeding from the center to the edges,
when there is any symbol skipped, which is remaining between $s, there is no
way, how to erase it, and no terminal string can be generated.
Consider the mode 3 derivations. Even when the symbols are erasing one
after another from the center to the left and right, the derivation mode can
potentially shift left one $ to the left and right one $ to the right skipping some
symbols. Also in this case the symbols between $s can not be erased anymore.
95
A. Meduna and O. Soukup
Proof. If. Recall
w = 10m1 110m2 1 . . . 10mo 111$$1110mo 1 . . . 10m2 110m1 1
Suppose the check works properly not skipping any symbol. Then
w 3⇒∗ w0 = $$
and twice applying the rule from (4) the derivation finishes.
t
u
Proof. Only if. If w1 6= reversal(w2 ), though the check works properly,
w 1⇒∗ w0 = w10 x$$x0 w20
and x, x0 ∈ {0, 1}, x 6= x0 . Continuing the check with application of the rules
from (3) will definitely skip x or x0 . Consequently, no terminal string can be
generated.
We showed, that G0 can generate the terminal string from the sentence form
w, if only if t0 (w1 ) = reversal(w2 ), and the claim holds.
t
u
Since S 1⇒∗ w, w ∈ Σ ∗ , if and only if S 3⇒∗ w, Claim 3 holds.
t
u
We proved L(G, 1⇒) = L, L(G0 , 1⇒) = L(G, 1⇒) and L(G0 , 3⇒) = L(G0 , 1⇒),
therefore L(G0 , 3⇒) = L holds. Since L is an arbitrary recursively enumerable
language, the proof of Theorem 2 is completed.
t
u
4.3
Mode 5
This section investigates mode 5 derivations. It proves the family of languages
SCGs with mode 5 derivations generates corresponds to the family of recursively
enumerable languages.
Theorem 3. SC(5⇒) = RE.
Since SC(5⇒) ⊆ RE follows directly from the Church-Turing thesis, we only
have to prove the opposite inclusion.
Proof. Let G = (V, Σ, P, S) be the SCG constructed in the proof of Theorem 1.
Next, we modify G to a new SCG G0 so L(G, 1⇒) = L(G0 , 5⇒).
Construction. Introduce four new symbols—D,E,F and ◦. Set N = {D,E,F ,◦}.
Let G0 = (V 0 , Σ, P 0 , S) be SCG, with V 0 = V ∪ N and P 0 containing the rules
(1)
(2)
(3)
(4)
(5)
(6)
96
(S) → ($S1 1111S2 $ ◦ E ◦ F );
(A) → (Ci (w)) for A → w ∈ Pi , where i = 1, 2;
(F ) → (F F );
($, a, a, $, E, F ) → (ε, ε, $, $, ε, D), for a = 0, 1;
(◦, D, ◦) → (ε, ◦E◦, ε);
($) → (ε), (E) → (ε), (◦) → (ε).
Computational Completeness Resulting from Scattered Context Grammars
Claim 5. L(G, 1⇒) = L(G0 , 5⇒).
Proof. Context-free rules are not influenced by the derivation mode. The rule
from (3) must generate precisely as many F s as the number of applications of
the rule from (4). Context-sensitive rules of G0 correspond to context-sensitive
rules of G, except the special rule from (5). We show, the construction of G0
forces context-sensitive rules to work exactly in the same way as the rules of G
do.
Every application of the rule from (4) must be followed by the application of
the rule from (5), to rewrite D back to E, which requires the symbol D between
two ◦s. It ensures the previous usage of context-sensitive rule selected the center
to the right of the rightmost affected nonterminal and all right hand side strings
changed their positions with the more left ones. The leftmost right hand side
string is then shifted randomly to the left, but it is always ε. The derivation
mode has no influence on the rule from (5).
¿From the construction of G0 , it works exactly in the same way as G does. t
u
L(G, 1⇒) = L(G0 , 5⇒) and L(G, 1⇒) = L, therefore L(G0 , 5⇒) = L. Since
L is an arbitrary recursively enumerable language, the proof of Theorem 3 is
completed.
t
u
4.4
Mode 7
This section investigates mode 7 derivations and proves SCGs with this mode
derivations are Turing-complete.
Theorem 4. SC(7⇒) = RE.
Since SC(7⇒) ⊆ RE follows directly from the Church-Turing thesis, we only
have to prove the opposite inclusion.
Proof. Let G = (V, Σ, P, S) be the SCG constructed in the proof of Theorem 1.
Next, we modify G to a new SCG G0 so L(G, 1⇒) = L(G0 , 7⇒).
Construction. Introduce four new symbols—E,F ,G and |. Set N = {E,F ,G,|}.
Let G0 = (V 0 , Σ, P 0 , S) be SCG, with V 0 = V ∪ N and P 0 containing the rules
(1)
(2)
(3)
(4)
(5)
(6)
(S) → (F GS1 11$|$11S2 );
(A) → (Ci (w)) for A → w ∈ Pi , where i = 1, 2;
(F ) → (F F );
(a, $, $, a) → (ε, E, E, ε), for a = 0, 1;
(F, G, E, |, E) → (G, $, |, $, ε);
($) → (ε), (G) → (ε), (|) → (ε).
Claim 6. L(G, 1⇒) = L(G0 , 7⇒).
97
A. Meduna and O. Soukup
Proof. The behaviour of context-free rules remains unchanged under mode 7
derivations. Since the rules of G0 simulating the derivations of G1 , G2 , respectively, are identical to the ones of G simulating both grammars, for every derivation of G
S 1⇒∗ $w1 1111w2 $ = w
where w was generated only using the rules from (1) and(2) and alph(w) ∩ (V1 ∪
V2 ) = ∅, there is
S 7⇒∗ F Gw1 11$|$11w2 = w0
in G0 , generated by the corresponding rules from (1) and (2), and vice versa.
Without any loss of generality, we can consider such a sentence form in every
successful derivation. Additionally, in G
w 1⇒∗ v, v ∈ Σ ∗
if and only if t0 (w1 ) = reversal(w2 ). Note, then v = t(w). Therefore, we have to
prove
w0 4⇒∗ v 0 , v 0 ∈ Σ ∗
if and only if t0 (w1 ) = reversal(w2 ). Then obviously v 0 = v and we can complete
the proof by the following claim.
Claim 7. In G0 , for
S 7⇒∗ w = F i Gw1 $|$w2 E
where w was generated only using the rules from (1) through (3) and alph(w) ∩
(V1 ∪ V2 ) = ∅. Then
w 7⇒∗ w0
where w0 ∈ Σ ∗ , if and only if t0 (w1 ) = reversal(w2 ), for some i ≥ 1.
The new rule from (3) may potentially arbitrarily multiply the number of
F s to the left. Then, F s are erasing using the rule from (5). Thus, without any
loss of generality, suppose i equals the number of the future usages of the rule
from (5).
For better readability, in the next proof we omit all symbols of w1 from
Σ—we consider only nonterminal symbols, which are to be erased.
Proof. If. Suppose w1 = reversal(w2 ), then w 7 ⇒∗ ε. We prove this by the
induction on the length of w1 , w2 , where |w1 | = |w2 | = k. Then, obviously i = k.
By the construction of G0 , the least k equals 2, but we prove the claim for all
k ≥ 0.
Basis. Let k = 0. Then
By the rules from (6)
and the basis holds.
98
w = G$|$
G$|$ 7⇒∗ ε
Computational Completeness Resulting from Scattered Context Grammars
Induction Hypothesis. Suppose there exists k ≥ 0 such that the claim holds for all
m, where
w = F m Gw1 $|$w2 , |w1 | = |w2 | = m, 0 ≤ m ≤ k
Induction Step. Consider G0 generates w, where
w = F k+1 Gw1 $|$w2 , |w1 | = |w2 | = k + 1
Since w1 = reversal(w2 ) and |w1 | = |w2 | = k + 1, w1 = w10 a, w2 = aw20 .
The symbols a can be erased by application of the rules from (4) and (5)
under several conditions. First, when the rule from (4) is applied, the center
for interchanging right hand side strings must be chosen between the two $s,
otherwise both Es appear on the same side of the symbol | and the rule from
(5) is not applicable. Next, no 0 or 1 may be skipped, while proceeding in the
direction from center to the edges. Finally, when the rule from (5) is applied,
the center must be chosen to the left of F , otherwise G is erased and the future
application of this rule is excluded.
F k+1 Gw10 a$|$aw20 7⇒ F k+1 Gw10 D|Dw20 7⇒ F k Gw10 $|$w20 = w0
By induction hypothesis w0 7⇒∗ ε, which completes the proof.
Only if. Suppose w1 6= reversal(w2 ), then, there is no w0 , where w 7⇒∗ w0 and
w0 = ε.
Since w1 6= reversal(w2 ), w1 = uav, w2 = va0 u0 and a 6= a0 . Suppose both vs
are correctly erased and no symbol is skipped producing the sentential form
F i Gua$|$a0 u0
Next the rule from (4) can be applied to erase innermost 0s or 1s. However, since
a 6= a0 , even if the center if chosen properly between the two $s, there is 0 or 1
between inserted Es and thus unable to be erased, which completes the proof.
We showed, that G0 can generate the terminal string from the sentence form
w, if only if t0 (w1 ) = reversal(w2 ), and the claim holds.
t
u
We proved S 1 ⇒∗ w, w ∈ Σ ∗ , in G, if and only if S 7 ⇒∗ w in G0 , hence
L(G, 1⇒) = L(G0 , 7⇒) and the claim holds.
t
u
Since L(G, 1⇒) = L(G0 , 7⇒), L(G, 1⇒) = L and L is an arbitrary recursively
enumerable language, the proof of Theorem 4 is completed.
t
u
5
Conclusion
The modern trend in information processing is parallel access to typically distributed data, however, traditionally, in formal languages theory automata and
99
A. Meduna and O. Soukup
grammars process information in continuous and often sequential way. Such models are not entirely suitable for the study of modern approaches in the data processing, where the data are frequently simultaneously read from and written to
the different parts of the memory space, sometimes physically separated.
For modelling the parallel data processing the usage of the scattered context grammars that have been investigated already in a long series of studies,
which brought a number of important results—especially their computational
completeness—, seems to be appropriate. Though, even the scattered context
grammars are not a perfect model of the modern data processing and the whole
variety of suitable modifications can be established.
However, the aim of this study was not the attempt to modify the model
itself, only the way it generates the terminal string. The main motivation was
to break the usual approach of the rewriting and try to divide the process into
deletion and insertion, which may not necessarily take place at the same part of
the sentence form. The mutual relation of these two now separated actions is then
defined by the constraints resulting from the definition of the used derivation
mode. Despite the fact that the additional nondeterminism is brought into the
computational process, it has been proven that the generative power of the model
is not reduced and it is still as powerful as Turing machines.
Acknowledgments
This work was supported by the following grants: BUT FIT FIT-S-14-2299,
MŠMT CZ1.1.00/02.0070, and TAČR TE01010415.
References
1. Rozenberg, G., Salomaa, A., eds.: Handbook of Formal Languages, Vol. 1:
Word, Language, Grammar. Springer, New York (1997)
2. Salomaa, A.: Formal Languages. Academic Press, London (1973)
3. Fernau, H., Meduna, A.: A simultaneous reduction of several measures of descriptional complexity in scattered context grammars. Information Processing
Letters 86(5) (2003) 235–240
4. Harrison, M.A.: Introduction to Formal Language Theory. 1st edn. AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA (1978)
5. Meduna, A., Zemek, P.: Regulated Grammars and Their Transformations.
Faculty of Information Technology, Brno University of Technology (2010)
6. Meduna, A., Techet, J.: Scattered Context Grammars and their Applications.
WIT Press (2010) ISBN: 978-1-84564-426-0.
100
Convergence of Parareal Algorithm Applied on
Molecular Dynamics Simulations
Jana Pazúriková1 and Luděk Matyska1,2
1
2
Faculty of Informatics, Masaryk University,
Botanická 68a, 602 00 Brno, Czech Republic
Institute of Computer Science, Masaryk University,
Botanická 68a, 602 00 Brno, Czech Republic
{pazurikova,ludek}@ics.muni.cz
Abstract. Parallel and distributed computations based on the spatial
decomposition of the problem are beginning to fail to saturate large
supercomputers with their limited strong scalability. An application of
the high performance computing, a molecular dynamics simulation, shows
this limit especially in experiments with longer simulation times. When
the parallelism in space does not suffice, the parallelism in time could
cut the wallclock time at the expense of additional computational resources. The parareal algorithm that decomposes the temporal domain
has been extensively researched and already applied to molecular dynamics simulations, however with rather modest results. We propose a novel
modification that uses the coarse function with a simpler physics model,
not a longer timestep as in state-of-the-art methods. The evaluation with
a prototype implementation indicates that our method provides a rather
long time window for the parallel-in-time computation, a reasonable convergence and stable properties of the simulated system.
1
Introduction
High performance computing largely relies on the parallel and distributed computation. The decomposition divides the problem almost always along the spatial domain, i.e. into subspaces or subsets. Therefore, the scalability of computation depends on the problem size. When solving the fixed-size problem on
an increasing amount of computational resources, one can observe the limit in
the strong scaling3 : after a certain number of resources, adding more does not
shorten the wallclock time of a computation. This limit makes it impossible
to cut the time to result down at the expense of the computational power that
3
Strong scaling is a function that maps the increasing number of computational resources that solve the problem of the fixed size to the wallclock time of a computation; it shows how the wallclock time (usually) decreases as the resources grow but
the problem remains the same. Weak scaling is a function that maps the increasing
number of computational resources that solve the problem of the increasing size to
the wallclock time of a computation; it shows how the wallclock time ideally remains
the same as the resources grow and the problem grows.
101
J. Pazúriková and L. Matyska
is becoming more and more available. The key is to increase the level of parallelism. Many simulations capture changes in space over time. Parallel-in-time
methods simultaneously calculate the results in several time points [1–3]. One
of the common methods, the parareal method [2], first approximates the results
with a coarse function and then iteratively corrects them with a fine function in
parallel while enforcing the continuity.
Molecular dynamics simulations [4, 5] require approaches of the high performance computing due to the large number of computationally expensive steps.
These in silico experiments offer a high resolution view on many scientifically
relevant natural processes such as the protein folding [6], the drug and nanomaterial interaction [7, 8] or a phenomenon occurrence [9]. Their implications
reach to chemistry, biology, pharmacy, medicine, even advanced materials. The
relevance of the simulated process often increases with the longer timescale of
the simulation (or both the size of the system and the longer timescale).
Parallel implementations of MD codes all rely on the spatial decomposition [10] although a few attempts of the temporal decomposition have been
made [11–14]. With the parareal method applied the simulation speeds up rather
poorly. The speedup of the parareal method depends on two conditions [15].
First, the coarse function has to be significantly cheaper than the fine function. In almost all published experiments, authors have chosen the method with
a longer integration timestep, further referred to as the longer-timestep method,
that has a quite small cost ratio. We propose to found the coarse function on
the simpler physics model that, according to our assessments, should provide
a much higher speedup. As for the second condition, the number of iterations
required for the convergence has to be significantly lower than the number of
time points. We evaluated the convergence of our modification before the implementation of the parallel version and evaluation of the speedup to determine
if it presents an approach worth further researching. We present the results of
the convergence evaluation in this paper.
Four more sections follow. First, we shortly introduce molecular dynamics and describe the parareal method. Second, we present our application of
the parareal method on molecular dynamics. Third, we evaluate the convergence
of our method and finally, we discuss the limitations of the current implementation and suggestions to the future work.
2
2.1
Background
Molecular Dynamics Simulations
Molecular dynamics, a tool of computational chemistry, computes movements of
particles due to their interactions over time [4, 5]. In the context of this work,
we consider the model of molecular mechanics that approximates atoms as electrically charged points of mass and their interactions as empirically determined
functions. The model represents an electrostatic N-body problem. The simulation takes input data—types of atoms, the topology, charges qi , positions ri and
velocities vi —and then iteratively repeats the following steps:
102
Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations
1. calculate the potential Uall with the empirical functions and evaluate the force
Fi exerted on an atom i
∂Uall
Fi = −
(1)
∂ri
2. move the particles
Fi = −
∂Uall
v0
r00
= mi ai = mi i = mi i2 ;
∂ri
dt
dt
(2)
3. update the time, optionally generate an output. [5]
The output of the simulation includes the trajectories of particles, forces
and the energy of the system. Other properties of the system can be processed
from the output data by applying statistical mechanics and other functions from
physics and chemistry.
Simulations of molecular dynamics have rather high computational needs
due to two reasons: the large number of demanding steps. The typical timestep
of the integration scheme of the equation (2) is 2 fs (2.10−15 s), in comparison, the timescale of the villin headpiece folding process takes up to 500 µs
(500.10−9 s). Production simulations range from pico- to microseconds, from
thousands to billions of steps. The bottleneck of each step lies in the calculation
of the potential, especially the long-range, electrostatic potential [16].
Standard MD packages [17–20] decompose the spatial domain for their parallel run [10], all show almost perfect weak scaling. The highest limit in the strong
scaling with only parallel-in-space computation has been reached in two experiments. Andoh et al. [21] have conducted a simulation with 107 atoms on over
half a million cores, one step was evaluated in 5 ms. Richards et al. [22] have
achieved 85% efficiency in the simulation with ∼ 109 particles on up to 294000
cores, one step evaluated in 550 ms.
2.2
Parallel-in-Time Computation
The obvious sequentiality of time has been overcome by several methods that
reveal the possibility of parallelism along the temporal domain.
The projects Copernicus [11] and Folding@Home [23] build the Markov model
of process’s different metastable states that gradually explores the whole state
space by many simultaneous short simulations. They use a highly distributed
framework and achieve near-linear strong scaling and fine problem granularity,
e.g. in the simulation of ∼ 104 -atom system on ∼ 5000 AMD cores. Protein
folding presents an appropriate use case for this form of the coarse-grained time
parallelism as it consists of many metastable states connected with short transitions.
Yu et al. in [12] apply the data from prior related simulations to guide the system and predict its changes, however it heavily depends on an almost perfect
success rate of the prediction algorithm. The experiment simulated a nanotube
reacting to an external force.
103
J. Pazúriková and L. Matyska
Since 1960s, mathematicians have developed various methods to calculate
parallel-in-time in the fine granularity: they compute results of a time dependent differential equation in different time points simultaneously [3]. The first
such method by Nievergelt, 1964, [1] later became known as the multiple shooting method. The time-parallel approach to iterative methods for solving partial
differential equations with implicit integration schemes [24] were followed by
applying the multigrid methods for an acceleration [25–27]. In the last years,
the parareal method [2] is gaining the popularity.
2.3
Parareal Algorithm
Lions, Maday and Turinici devised the parareal in time method [2] in 2001,
since then it has been extensively researched [3, 15, 28, 29] and applied to diverse
simulations [30–33]. Most notably, Speck et al. [30] have developed a modification
of the parareal algorithm that made it possible to run a gravitational N-body
simulation with notable strong scaling.
In the traditional, sequential-in-time computation, the accurate yet usually
expensive function F determines the results λt+1 in time t + 1 by known results
λt in time t < T , as Figure 1 shows.
t1
λ1 = v
t2
F
λ2
t3
F
λ3
t4
F
λ4
t5
F
λ5
t6
F
λ6
t7
F
λ7
t8
F
λ8
t9
F
λ9
t10
F
λ10
Fig. 1. Sequential-in-time computation
The parareal method requires the second method: the coarse, yet computationally cheap function G. It can be based on a longer timestep, a coarser spatial
decomposition or a simpler model. The parareal scheme shifts the sequential
nature of time from the expensive function F to the cheap G and approximations made by the coarse function iteratively improves with the fine function in
parallel. It can be viewed as a form of the predictor-corrector scheme [34, 35].
The function G roughly assesses the initial approximation of the results. The
difference between the results from the precise calculation and from the coarse
calculation on the same data presents the error that is included into the calculation in the next iteration. The continuity of the corrected results is again
enforced sequentially by G.
The parareal method running for T time points and K iterations builds in
parallel the sequence λkt that rapidly converges to λt as k increases and each λ
is calculated as:
k+1
k
k
k+1
k
λn+1
= G(λk+1
n ) + F(λn ) − G(λn ) = G(λn ) + ∆n
(3)
Figure 2 depicts the sequential calculation of cheap G (horizontal, successive arrows) and the parallel calculation of expensive F (vertical arrows without
104
Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations
data dependencies). The speedup of the parareal method relies on the significant difference between the computational complexity of F and G and the fast
convergence so that K T . The convergence depends on the chosen functions,
the correction term and the problem properties. After λ210 from Figure 2 is calculated, the computational window shifts to the right on the time axis and λ210
presents the new initial condition.
t1
k=0
λ01 = v
t2
G
F
λ11 = v
∆02
G
F
λ21 = v
λ12
G
∆12
G
λ22
t4
G
λ03
F
∆03
F
∆11
k=2
G
λ02
F
∆01
k=1
t3
λ13
G
F
λ23
G
λ04
F
∆04
∆13
G
t5
λ14
G
F
λ24
G
λ05
F
∆05
∆14
G
t6
λ15
G
F
λ25
G
λ06
F
∆06
∆15
G
t7
λ16
G
F
λ26
G
λ07
F
∆07
∆16
G
t8
λ17
G
F
λ27
t10
λ08
F
∆08
∆17
G
t9
λ18
G
F
∆18
G
λ28
λ19
F
∆19
G
λ29
G
λ210
Fig. 2. Computational flow of the parareal algorithm
3
Parareal Scheme in Molecular Dynamics Simulations
The parareal algorithm has been applied to the molecular dynamics simulations
in several experiments [13, 14, 36, 37]. The coarse function based on a longer
timestep in the integration scheme failed to provide reasonable convergence and
speedup. As Bulin in [14] states, more appropriate are coarse functions based
on a simpler model. We have set the fine function as one step of MD integrated
with Verlet scheme and the long-range potential evaluated with the multilevel
summation method (MSM) [38]. We have examined coarse functions built upon
several concepts in terms of the theoretical speedup and the ability to run in
parallel [39].
The first concept, further simplification of the model reduces the cost by
an abstraction from physics of the problem. The discrete or coarse-grained MD
would produce completely different trajectories than the MSM. It would be challenging to deal with the large correction term that would probably lead to an instability. The cheap and inaccurate methods for the evaluation of demanding
long-range electrostatics, such as the cutoff method or Wolf summation method,
are worth researching.
105
J. Pazúriková and L. Matyska
The second concept, different parameters of the method for the evaluation of
long-range interactions (such as less iterations in Fourier methods or a coarser
grid in MSM) offer only small theoretical speedup, as they are still quite accurate
thus not cheap.
And finally, the different parameters of the integration scheme (such as
a longer timestep or different scheme) do not have any promising results. The
longer-timestep method has been evaluated many times without success. As for
a different integration scheme, MD simulation usually apply Verlet or leapfrog integration schemes that have proven their suitability. Simpler Euler method would
introduce large error, Runge-Kutta method would rapidly increase the cost [19].
The theoretical speedup of the parareal method depends on the cost ratio
between the fine and the coarse function QF /G and the ratio between the number
of time points T and the number of iterations K:
speedup =
QF /G
1+
K
T QF /G
(4)
We set F as one step of MD with Verlet integration scheme and MSM for longrange interactions (a = 12 Å, h = 2 Å, h∗ = 1 Å, m = 2, p = 3 as described
in [39]); G as one step of MD with Verlet integration scheme and the cutoff
method for long-range interactions (rcutof f = 12 Å). For our choice of F and
G, the QF /G reaches 60, we suppose we can achieve the speedup of an order
of magnitude. With the coarse function based on a longer timestep, the cost
ratio equals the ratio between the timestep in G and the timestep in F. However
the timestep longer than a few femtoseconds quickly leads to the simulation’s
blowup, preventing the ratio to get higher than 5.
Apart from the choice of the fine and the coarse function, the definition of
the correction term ∆kn = F(λkn ) − G(λkn ) in context of MD simulation also
determines the behavior of the parareal method. We have set ∆kn as an absolute
difference between two results’ atomic positions and velocities.
We have implemented a prototype of the parareal method and the correction term in C. Both fine and coarse functions are evaluated by LAMMPS [20] in
one-step NPT simulation (with constant number of atoms, pressure and temperature). We have verified the functionality by running an experiment with T time
points and K = T iterations. By definition of the parareal method, the results
in the last time point of the last iteration should be the same as in sequentialin-time experiment. Apart from a rounding error, we have obtained the same
results.
4
Convergence and Stability Evaluation
We evaluated the convergence of the parareal method applied on a molecular dynamics simulation through our prototype implementation. The experiment followed this procedure. First, we ran a sequential-in-time simulation of
a 32000-atom solvated protein rhodopsin [40]. CHARMM force field [41] determined the parameters of all atoms and functions for potential evaluation.
106
Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations
The close-range potentials included all standard bonded interactions and van
der Waals interactions by Lennard-Jones potential. The electrostatic interactions were approximated by MSM with maximum relative force error 10− 4. The
simulation ran for T timesteps. Second, we ran our parareal simulation for T
timesteps and K iterations on the same input. The function F corresponded exactly to one step of sequential-in-time simulation. The function G differed from
F in the evaluation of electrostatic potential: it used smoothed cutoff potential
defined by CHARMM.
In the convergence evaluation, we examined three aspects: (i) the longest
possible computational window, (ii) the number of iterations needed for the reasonable convergence and (iii) the difference between the trajectories. The computational window represents the number of time points we can calculate parallelin-time. In too long windows the thermostat and the barostat may not be able to
keep the system in a viable state. As a common production simulation consists of
more than a few computational windows, we need the results of the sequential-intime (λT , as in Figure 1) and the parallel-in-time (λkT for several k, as in Figure
2) simulation after T time steps to be almost the same. We suppose that for
the reasonable convergence of the whole production simulation, the root mean
square distance (RMSD) should be less than 0.1 Å as the atomic positions of
two such close results virtually do not differ. Frames of obtained trajectories are
compared also by RMSD.
In the stability evaluation, we examined two aspects: the temperature and
the pressure of the system. Any steep changes may suggest that the correction
term is causing instabilities.
Our modification of the parareal method can have the computational window long as much as 40 steps, i.e. 80 fs. We found that with T time points
computing in parallel, we need roughly T /4 or T /3 iterations for close results.
Figure 3 shows how the RMSD between λT and λkT decreases as the number
of iterations, k, increases. The first k that gets RMSD below 0.1 Å is marked
by full square. The distance of λ0t , the initial G approximation, increases with
the number of timesteps in the computational window. The RMSD decreases
with the increasing number of iterations in an asymptotically linear manner.
In Figure 4, we can see the trajectories of the sequential-in-time simulation
and the parareal simulation with T = 40 after different iterations. The error’s
uptake after the first correction takes place from early timepoints, not just in
the end. The results of the 15th iteration almost perfectly copy those of the sequential simulation not only in the latest time point, but along the whole way.
The temperature of the system in the last time point over iterations also gradually decreases, as Figure 5 shows for the simulation with T = 40. It converges
to 300 K set by NPT environment around the 15th iteration, the same as the results converge to the accurate ones. The pressure of the system also gradually
decreases, although the initial uptake is much higher than for the temperature.
107
J. Pazúriková and L. Matyska
T=10
T=20
T=30
T=40
1
RMSD [A]
0.8
0.6
0.4
0.2
0.1
0
2
5
6
10 11
15
20
Iterations
Fig. 3. The results of the parallel-in-time computation converge to the results of
sequential-in-time computation
1
sequential
k=0
k=6
k=12
k=15
0.9
0.8
RMSD [A]
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
40
Timesteps
Fig. 4. The difference between trajectories of sequential-in-time computation (bottom
full line) and trajectories of several iterations, T=40
108
Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations
5000
4500
Temperature [K]
4000
3500
3000
2500
2000
1500
1000
500
0
0
5
10
15
20
Iterations
Fig. 5. The temperature of the system in the parallel-in-time simulation in the 40th
timestep
60000
50000
Pressure [atm]
40000
30000
20000
10000
0
-10000
0
5
10
15
20
Iterations
Fig. 6. The pressure of the system in the parallel-in-time simulation in the 40th
timestep
109
J. Pazúriková and L. Matyska
5
Discussion and Future Work
In this paper, we proposed a modification to the parareal method in the context
of molecular dynamics simulations. We found that it leads to a rather satisfactory convergence and stability and it offers a quite long computational window. Therefore, it is worth exploring the main advantage of the simpler-physics
over the longer-timestep methods: the speedup. If the calculation proceeds in
a pipeline manner, i.e. the computation of F starts immediately after its λ has
been computed, the upper bound of theoretical speedup for T = 40, K = 15 is
59.6.
The modest convergence may hasten if the correction term can be extrapolated. We will further analyze ∆ and experiment with other possibilities: the difference only between selected atoms, the difference in forces or further correction
of the temperature and the pressure. The latter one may prolong the computational window as the simulation explodes after 45 steps due to the sudden
pressure and temperature changes.
The speedup of the parareal method is limited by the cost ratio between
the fine and the coarse function. The efficiency is limited by 1/K, albeit Speck
et al. [30] devised a modified parareal method called PFASST that amortizes
the cost of the correction steps, thus increases the efficiency. Therefore, to evaluate the speedup of our modification, we want to incorporate it into Speck’s
method.
The evaluation of the convergence of the parareal method in MD simulations
is the first step in what could lead to improving the strong scaling and cutting
the wallclock time of molecular dynamics experiments.
References
1. J Nievergelt. Parallel methods for integrating ordinary differential equations.
Communications of the ACM, 7(12):731–733, 1964.
2. JL Lions, Y Maday, and G Turinici. Résolution d’EDP par un schéma en temps
pararéel . Comptes Rendus de l’Académie des Sciences - Series I - Mathematics,
332(7):661–668, April 2001.
3. MJ Gander and S Vandewalle. Analysis of the Parareal Time-Parallel
Time-Integration Method. SIAM Journal on Scientific Computing,
29(2):556–578, January 2007.
4. E Lewars. Computational Chemistry: Introduction to the Theory and
Applications of Molecular and Quantum Mechanics. Springer, 2nd edition, 2010.
5. F Jensen. Introduction to computational chemistry. John Wiley & Sons Ltd,
Great Britain, 2nd edition, 2007.
6. C Lee and S Ham. Characterizing amyloid-beta protein misfolding from
molecular dynamics simulations with explicit water. Journal of Computational
Chemistry, 32(2):349–355, January 2011.
7. L Boechi, CAF de Oliveira, I Da Fonseca, K Kizjakina, P Sobrado, JJ Tanner,
and JA McCammon. Substrate-dependent dynamics of UDP-galactopyranose
mutase: Implications for drug design. Protein Science, 22(11):1490–1501,
November 2013.
110
Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations
8. D Lau and R Lam. Atomistic Prediction of Nanomaterials: Introduction to
Molecular Dynamics Simulation and a Case Study of Graphene Wettability.
IEEE Nanotechnology Magazine, 6(1):8–13, March 2012.
9. G Zhao, JR Perilla, EL Yufenyuy, X Meng, B Chen, J Ning, J Ahn,
AM Gronenborn, K Schulten, C Aiken, and P Zhang. Mature HIV-1 capsid
structure by cryo-electron microscopy and all-atom molecular dynamics. Nature,
497(7451):643–6, May 2013.
10. KJ Bowers, RO Dror, and DE Shaw. Overview of neutral territory methods for
the parallel evaluation of pairwise particle interactions. Journal of Physics”
Conference Series, 16:300, 2005.
11. S Pronk, P Larsson, I Pouya, GR Bowman, IS Haque, K Beauchamp, B Hess,
VS Pande, PM Kasson, and E Lindahl. Copernicus: a new paradigm for parallel
adaptive molecular dynamics. In Proceedings of Supercomputing, SC ’11, pages
60:1—-60:10, New York, NY, USA, 2011. ACM.
12. Y Yu, A Srinivasan, and N Chandra. Scalable Time-Parallelization of Molecular
Dynamics Simulations in Nano Mechanics. In Conference on Parallel Processing,
pages 119–126. Ieee, 2006.
13. L Baffico, S Bernard, Y Maday, G Turinici, and G Zérah. Parallel-in-time
molecular-dynamics simulations. Physical Review E, 66:057701:1–057701:4, 2002.
14. J Bulin. Large-scale time parallelization for molecular dynamics problems.
Technical report, Royal Institute Of Technology, Stockholm, Stockholm, 2013.
15. Y Maday. The parareal in time algorithm. Technical Report R08030, Université
Pierre et Marie Curie, pages 1–24, 2008.
16. P Koehl. Electrostatics calculations: latest methodological advances. Current
opinion in structural biology, 16(2):142–151, April 2006.
17. B Hess, C Kutzner, D van der Spoel, and E Lindahl. GROMACS 4: Algorithms
for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. Journal
of Chemical Theory and Computation, 4(3):435–447, 2008.
18. DA Case, TE Cheatham, T Darden, H Gohlke, R Luo, KM Merz, A Onufriev,
C Simmerling, B Wang, and RJ Woods. The Amber biomolecular simulation
programs. Journal of computational chemistry, 26(16):1668–1688, December
2005.
19. JC Phillips, R Braun, W Wang, J Gumbart, E Tajkhorshid, E Villa, C Chipot,
RD Skeel, L Kalé, and K Schulten. Scalable molecular dynamics with NAMD.
Journal of Computational Chemistry, 26(16):1781–1802, December 2005.
20. SJ Plimpton. Fast Parallel Algorithms for Short Range Molecular Dynamics.
Journal of Computational Physics, 117:1–19, 1995.
21. Y Andoh, N Yoshii, K Fujimoto, K Mizutani, H Kojima, A Yamada, S Okazaki,
K Kawaguchi, H Nagao, K Iwahashi, F Mizutani, K Minami, S Ichikawa,
H Komatsu, S Ishizuki, Y Takeda, and M Fukushima. MODYLAS: A Highly
Parallelized General-Purpose Molecular Dynamics Simulation Program for
Large-Scale Systems with Long-Range Forces Calculated by Fast Multipole
Method (FMM) and Highly Scalable Fine-Grained New Parallel Processing
Algorithms. Journal of Chemical Theory and Computation, 9(7):3201–3209, July
2013.
22. DF Richards, JN Glosli, B Chan, MR Dorr, EW Draeger, JL Fattebert,
WD Krauss, T Spelce, FH Streitz, MP Surh, and JA Gunnels. Beyond
homogeneous decomposition: scaling long-range forces on Massively Parallel
Systems. In Proceedings of the Conference on High Performance Computing
Networking, Storage and Analysis, SC ’09, pages 60:1—-60:12, New York, NY,
USA, 2009. ACM.
111
J. Pazúriková and L. Matyska
23. SM Larson, CD Snow, M Shirts, and VS Pande. Folding @ Home and Genome @
Home : Using distributed computing to tackle previously intractable problems in
computational biology. Technical report, 2002.
24. A Deshpande, S Malhotra, CC Douglas, and MH Schultz. A rigorous analysis of
time domain parallelism. Parallel Algorithms and Applications, 6(1):53–62, 1995.
25. G Horton. The time-parallel Multigrid Method. Communications in Applied
Numerical Methods, 8:585–595, 1992.
26. S Vandewalle and E van de Velde. Space-time concurrent multigrid waveform
relaxation. Annals of Numerical Mathematics, 1(1-4):335–346, 1994.
27. G Horton and S Vandewalle. A space-time multigrid method for parabolic partial
differential equations. SIAM Journal on Scientific Computing, 16(4):848–864,
1995.
28. Y Maday and G Turinici. The Parareal in Time Iterative Solver: a Further
Direction to Parallel Implementation. In Domain Decomposition Methods in
Science and Engineering, pages 441–448, 2005.
29. E Aubanel. Scheduling of tasks in the parareal algorithm. Parallel Computing,
37(3):172–182, March 2011.
30. R Speck, D Ruprecht, R Krause, M Emmett, M Minion, M Winkel, and
P Gibbon. A massively space-time parallel N-body solver. In Proceedings of
Supercomputing, pages 92:1–92:11, 2012.
31. D Samaddar, DE Newman, and R Sánchez. Parallelization in time of numerical
simulations of fully-developed plasma turbulence using the parareal algorithm.
Journal of Computational Physics, 229(18):6558–6573, September 2010.
32. AE Randles. Modeling Cardiovascular Hemodynamics Using the Lattice
Boltzmann Method on Massively Parallel Supercomputers. PhD thesis, Harvard
University, 2013.
33. A Baudron, J Lautard, Y Maday, and O Mula. The parareal in time algorithm
applied to the kinetic neutron diffusion equation. In International Conference on
Domain Decomposition Methods, 2013.
34. WL Miranker and W Liniger. Parallel methods for the numerical integration of
ordinary differential equations. Mathematics of Computation, 21(99):303–320,
1967.
35. CW Gear. The automatic integration of ordinary differential equations.
Communications of the ACM, 14(3):176–179, March 1971.
36. A Srinivasan and N Chandra. Latency tolerance through parallelization of time
in scientific applications. Parallel Computing, 31(7):777–796, July 2005.
37. A Nakano, P Vashishta, and RK Kalia. Parallel multiple-time-step molecular
dynamics with three-body interaction. Computer Physics Communications,
77:303—-312, 1993.
38. DJ Hardy. Multilevel summation for the fast evaluation of forces for the
simulation of biomolecules. PhD thesis, University of Illinois at
Urbana-Champaign, 2006.
39. J Pazúriková. Large-Scale Molecular Dynamics Simulations for Highly Parallel
Infrastructures. Technical report, 2014. http://arxiv.org/abs/1402.7216.
40. LAMMPS. Rhodopsin Benchmark.
http://lammps.sandia.gov/bench.html#rhodo.
41. WD Cornell, P Cieplak, CI Bayly, IR Gould, KM Merz, DM Ferguson,
DC Spellmeyer, T Fox, JW Caldwell, and PA Kollman. A Second Generation
Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules.
Journal of the American Chemical Society, 117(19):5179–5197, May 1995.
112
A Case for a Multifaceted Fairness Model:
An Overview of Fairness Methods for Job
Queuing and Scheduling
Šimon Tóth
Faculty of Informatics, Masaryk University
Botanická 68a, Brno, Czech Republic
[email protected]
Abstract. Job scheduling for HPC and Grid-like systems, while being
a heavily studied subject, suffers from a particular disconnect between
theoretical approaches and practical applications. Most production systems still rely on a small set of rather conservative scheduling policies.
One of the areas that tries to bridge the world of scientific research and
practical application is the study of fairness. Fairness in a system has
strong implications on customer satisfaction, with psychological studies
suggesting that perceived fairness is generally even more important than
the quality of service. This paper provides an overview of different approaches for handling fairness in a job scheduling/queuing system. We
start with analytic approaches that rely on statistical modeling and try
to provide strong categorization and ordering of various scheduling policies according to their fairness. Following that we provide an overview
of recent advancements that rely on simulations and use high resolution
analysis to extract fairness information from realistic job traces. As a
conclusion to this article, we propose a new direction for research. We
propose a novel multifaceted fairness approach, i.e., a combination of different fairness models inside a single system, that could better capture
the heterogeneous fairness-related requirements of different users in the
system. It could serve as a solution to the shortcomings of some of the
methods presented in this paper.
1
Introduction
Job scheduling is a very active research field, that has progressed significantly in
the last 20 years [15]. One aspect that did not change significantly through the
years is a particular disconnect between the theory and practical applications, as
has been repeatedly noted in some of the published works through the years [14,
17, 37].
One research topic that tries to bridge the world of theory and the practical
applications is the study of fairness. In the context of job scheduling/queuing we
are concerned with two main types of fairness. Seniority fairness which takes into
account the position of the customer/user in the queue and proportional fairness
113
Š. Tóth
which takes into account the length/size of the users’ requests [8]. To demonstrate this distinction, let us consider a model situation from a physical queuing
system: “Mr. Short arrives at the supermarket counter with only a single item
in his shopping cart. Directly in front of him, in the queue he finds Mrs. Long
with the shopping cart completely full.”
From one point of view, it would make sense to allow Mr. Short to overtake
Mrs. Long in the queue as he only has a single shopping item and will therefore delay Mrs. Long for a very short time. Respectively processing Mrs. Longs
shopping cart will delay Mr. Short significantly. This decision is based on the
proportional fairness, as we judge the situation in proportion to the length/size
of the requests.
Second approach takes into account that Mrs. Long has already been waiting
in the queue when Mr. Short arrived. As we do not have additional information
in regards to how long has Mrs. Long been already waiting, it makes more sense
to maintain the order in the queue and do not allow Mr. Short to overtake
Mrs. Long. This approach falls into the category of seniority fairness, as we try
to maintain the seniority (order) of the users in the system.
As these two types of fairness are strictly contradicting1 , determining which
type of fairness to include in a scheduling system is a very important step. One
possible approach is to experimentally measure the psychological impacts of
waiting in queuing systems. These studies show that customers have a strong bias
towards the perceived fairness in a system [31, 32], even to the point of preferring
a queue configuration which provides worse performance characteristics, i.e., they
are willing to wait longer if they perceive that they are being treated fairly in
respect to other users in the system. These studies also show that users have
a particular distaste for multi-queue configurations where the queues are not
processed in a round-robin fashion. This is certainly distressing as multi-queue
configuration is present in many production systems.
In this paper we provide an overview of methods for the analysis, measurement and classification of fairness in queuing systems. In particular we will concentrate on methods that are applicable for production systems. We base this
distinction on our first-hand experiences [23] with the implementation of these
methods in the Czech National Grid – MetaCentrum [29]. We conclude this
paper with a proposal of a new direction for research. A novel multifaceted approach to fairness management that combines different fairness approaches to
accommodate the different requirements of various users in the system.
2
Analytical Approaches to Fairness
Analytical approaches to fairness analysis rely on heavily sanitized models of
the systems, such as M/M/1 [34] or M/GI/1 [46, 43]. Both models represent
a single queue system with job arrivals modeled using a Poisson process [20].
In case of M/M/1 the service times of jobs have exponential distribution, in
1
Seniority and proportionality cannot be maintained at the same time, unless all users
arrive in the order dictated by proportional fairness.
114
A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods
case of M/GI/1 the service times of jobs have general (unknown) distribution.
Under such simplification, analytical approaches using statistical analysis can
provide strong categorization and/or ordering of scheduling policies according
to the defined fairness model.
2.1
Proportional Fairness
Proportional fairness relates to fairness of the system in relation to a particular
parameter of a job. Most analytical approaches are concerned with the steadystate response (length) of a job.
Wierman [46, 43–45] proposes a criterion based on slowdown S(x) = T (x)/x
[16, 9], where x is the size of job and T (x) is the steady state response (length).
This criterion classifies scheduling policies into fair and unfair based on the
whether the expected slowdown for the class of jobs of size x is proportional
to the load of the system ρ under the classified policy: E[S(x)]P = 1/(1 − ρ).
Only if the expected slowdown E[S(x)]P is proportional to the system load ρ for
all size classes x of jobs, the scheduling policy is considered fair. This rules out
any scheduling policies that are either non-preemptive, or do not make decisions
based on the jobs size.
2.2
Temporal Fairness
Temporal fairness relates to the seniority of the jobs in the system, that is, jobs
that arrived earlier should be satisfied before jobs that arrived later.
Wierman [44, 45] proposes a politeness criterion as the fraction of the jobs
response time during which was the seniority of the job respected. Clearly for
the First Come First Served (FCFS) policy the P ol(x)P = 1, as FCFS always
respects the seniority of jobs (see Fig. 1). A policy is determined to be polite if
E[P ol(x)]P = 1 − ρ.
Resc1
Job1
Resc2
Job2
Job3
Job5
Job4
Job6
Job8
Job7
Job9
Fig. 1. An example of a polite schedule (following the FCFS policy), constructed using
jobs from Table 1.
2.3
Combined Approaches
The Resource-Allocation Queuing Fairness Measure (RAQFM) [34, 7, 33] is based
on the notion that all users in the system are equal and therefore they should
115
Š. Tóth
Arrival
1
1
1
1
3
3
5
5
5
Job1
Job2
Job3
Job4
Job5
Job6
Job7
Job8
Job9
Runtime
1
1
1
2
1
2
1
1
1
Owner
Green (dashed)
Green (dashed)
Purple (solid)
Orange (dotted)
Purple (dashed)
Orange (dotted)
Green (dashed)
Green (dashed)
Purple (solid)
Table 1. Job information from schedule example (see Fig. 1).
receive an equal share of resources at each point in time. Based on this notion
the measure defines the resulting discrimination for a particular job i as follows:
Rd
Di = aii (si (t) − 1/N (t))dt. That is, the particular jobs discrimination is defined
as the difference between the service received si (t) and service desired 1/N (t)
(where N (t) represents the number of concurrent jobs at that point of time),
integrated over time (from arrival to departure). For an example of user-aware
schedule constructed using RAQFM see Fig. 2.
Resc1
1/4
Resc2
1/4
−1/12
−1/3
−1/3
1/6
1/6
−5/6
1/6
Fig. 2. An example of an user-aware schedule constructed using RAQFM. Values inside
the jobs are the calculated discrimination amounts. This particular schedule provides
reasonably low variance across users: green −1/6, orange 1/12, purple −1/2 (see Section 4.1).
The Discrimination Frequency Measure (DF) [36] is based on the two previously mentioned principles of fairness. The seniority principle is captured by one
formula: ni = |{j : (aj ≥ ai ∧ dj ≤ di )}|, that is the number of jobs that arrived
no earlier than job i but departed no later than job i. The proportionality principle is captured by another formula: mi = |{j : (di ≥ dj > ai ∧ s0j (ai ) ≥ si }|, that
is the number of jobs that at arrival of the job i have at least as much remaining service requirement as i and depart no later than job i. The discrimination
frequency of a particular job is then DFi = mi + ni (see Fig. 3).
The Slowdown Queuing Fairness Measure (SQF) [6] is based on the slowdown
metric [16, 9], assuming that a system is only fair when all jobs in the system
receive the same slowdown: Ti = cxi , that is response time is equal to the size
of the job multiplied by a constant, for all jobs i in the system. The individual
116
A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods
Resc1
Job1
Resc2
Job2
Job3
Job5
Job6
Job4
Job9
Job7
Job8
Fig. 3. An example of a schedule generated using discrimination frequency measure.
Equivalent to a polite schedule as this schedule does not contain any discriminated
jobs.
discrimination of job is then the deviation
from this ideal state Di = Ti − cxi .
PN
SQF is then expressed as SQF = i=1 (cxi − Ti )2 (see Fig. 4).
Resc1
1
Resc2
1
2
1
1.5
1.5
2
1
2
Fig. 4. An example of a schedule generated using SQF measure. Values represent the
computed slowdown for each job. For c = 1.5, this schedule has SQF = 1.75.
2.4
Applicability of Analytical Approaches
Analytical approaches provide relatively simple criterions and measures that
have very well defined behavior. For systems with fairness requirements fitting
one of these measures, they offer an easy implementation option with predictable
results. However, as these metrics are based on heavily simplified models of
queuing systems, they lack several important features. For one, these metrics
are user-agnostic, that is they assume an equality of users and jobs and do not
consider the possibility of a single user being represented by multiple jobs in the
system. These metrics are also resource agnostic, as they do not consider the
possibility of different jobs requesting different amounts of resource, e.g., CPU
cores.
3
Trace Based Approaches
Job traces represent the record of jobs that were processed by a production system, generally containing information about job arrivals, start times, completion
times, amounts of resources requested, etc. These traces can be used for the evaluation of scheduling policies directly in the context of a production system [13].
Job traces can be easily shared [11], which allows researchers to determine how
their new scheduling policies perform with respect to various real workloads.
117
Š. Tóth
3.1
Trace Analysis
Trace analysis provides an alternative approach to experimental psychological
studies. Instead of a complicated and expensive experiments, trace analysis relies
on the extraction of session information from the provided traces [48]. A session
is an uninterrupted interaction between the user and the resource management
system, where the user actively waits for his/hers jobs to finish and then submits
more jobs. In this context the users satisfaction correlates with the response time
of jobs instead of slowdown [39], which is the generally preferred metric.
3.2
Trace Modeling
Utilizing a pre-recorded workload however has its own shortcomings — particularly due to the static nature of the workload it cannot capture any dynamic
behavior in the system. Trace modeling offers a solution to this problem with
additional benefits, such as the ability to change the load of the system on demand [19, 28].
From the perspective of fairness evaluation, it is very important to create a
model that realistically matches the behavior of users in the system. This covers
everything from simple daily, weekly and yearly cycles, where users naturally
submit less jobs during the nighttime, weekends and holidays [27, 40, 12], to
the simulation of user sessions in the system [47]. Including these features in a
workload model significantly improves the evaluation precision [38].
3.3
Applicability of Trace Approaches
Trace based analysis of user sessions provides a very different perspective on
user behavior, as it essentially divides users into two categories “interactive”
and “non-interactive”, with users transitioning from the interactive category as
they reach their tolerance for waiting. In such environment it makes sense to
prefer users that are still in interactive mode, as they are willing to submit more
jobs into the system.
Trace modeling on the other hand provides a deceptively simple premise. Its
purpose is to improve the quality and flexibility of job traces. While the premise
itself is quite simple, the road to this goal quite complicated as it includes both
the detection and modeling of workload cycles (daily, weekly, yearly) and session
boundaries.
4
Simulation Based Approaches
Simulation based approaches try to improve the measurement precision by simulating a real system. This can be facilitated by specialized grid simulators [10,
22] or even by special simulation modes of production schedulers [1, 2].
In this context it is important to mention the commonly implemented fairness
measure: fairshare. Fairshare is an ordering policy that dynamically changes the
118
A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods
order of jobs based on the historical resource usage of their owners (see Fig. 5
and Fig. 6). Fairshare-based ordering policy is supported across many production resource management systems such as PBS [30], TORQUE [3], Moab [2],
Maui [1], Quincy [18] or Hadoop Fair and Capacity Schedulers [5, 4].
Resc1
Job1
Resc2
Job2
Job3
Job6
Job4
Job7
Job5
Job9
Job8
Fig. 5. An example of a schedule generated using fairshare variant where jobs are
accounted once completed.
Resc1
Job1
Resc2
Job3
Job2
Job5
Job6
Job4
Job9
Job7
Job8
Fig. 6. An example of a schedule generated using fairshare variant where jobs are
accounted once started.
4.1
Statistical Approaches
One possibility for analyzing fairness in a system is to choose a desired performance metric (wait time, response time, slowdown) and then analyze the
statistical properties of this variable across users. In particular
we can look at
√
the variance E[x2 ] − E 2 [x] / (n − 1), standard deviation
V
ariance,
variance
index CV = ST D/E[x], or the fairness index 1/ 1 + CV2 proposed by Vasupongayya [42].
The previously mentioned fairshare, as an ordering policy, is not directly
usable for fairness measurement, however there are fairshare inspired metrics.
For example the variance based normalized user wait time
N U W To = T U W To /T U SAo
where T U W To is the total summed wait time for a particular user and T U SAo
is the total summed resourcePusage of a particular user. Fairness of a system can
2
then be expressed as F =
(U W T − N U W To ) where U W T is the average
normalized wait time across all users [21] (see Fig. 7).
119
Š. Tóth
Resc1
0
Resc2
0
1
1/2
1
∞
1
1/2
1/3
Fig. 7. An example of a schedule generated using N U W T metric. Values represent the
N U W To for the owner of the job at the jobs start.
A slightly different approach is offered by the Fair Start Time (FST) metric [35, 26]. FST measures the unfairness caused by the later arriving jobs on the
start time of currently waiting job by constructing a schedule assuming no later
jobs arrive. FST then represents the difference between the computed start time
and the actual start time of the job.
4.2
High Resolution Approaches
All the methods we have presented until now are generally designed to provide
a single comparable value. While this may be sufficient, recent advancements
show that there is a significant amount of lost information when such approach
is chosen. High resolution analysis [25] has recently shown that this information
can be very important for policy decision making and while the results of high
resolution analysis are certainly harder to analyze, quantitative and qualitative
comparisons are still possible [24].
Even more importantly, high resolution approach enables completely novel
measures that would loose their meaning when represented as a single number.
One of these measures is the Expected End Time (EET) measure [41]. This
measure builds on the notion that each user entering the system has an expectation of the amount of resources user will receive at any time. By modeling this
expectation, this measure is capable of computing the Expected End Time for
each of the jobs user submitted into the system by that user. EET can then be
compared with the actual end time of the particular job and cases where the
EET was not achieved can be plotted using a heatmap, for an example of such
heatmap, see Fig. 8.
4.3
Applicability of Simulation Approaches
Simulation approaches lie on the exact opposite of the spectrum as analytical
approaches. Where analytical approaches provide simple to analyze metrics that
are however hard to apply to more complex systems, simulation approaches
provide very good behavior matching for the evaluated system. Their results can
however be very hard to analyze [41]. Even when statistical and high resolution
approaches are employed, the post analysis of these results can be highly nontrivial [25, 24]. For example, which approach is better, the one providing a better
average but higher variance, or the one with worse average but lower variance?
120
A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods
Fig. 8. Heatmaps showing the number of violated EET s, shade represents the number
of violated EETs, x-axis represents time, y-axis represents users. Bottom part of the
graph contains CPU and Memory utilization histograms.
5
Conclusion
In this paper we have presented an overview of methods for classifying, measuring and facilitating the measurement of fairness. Analytical approaches offer the
best solution for as-is deployment as they offer very clearly defined semantics
and simple evaluation. They are however also based on heavily simplified models
of systems which limits their scope. Trace based approaches provide an interesting alternative approach to psychological analysis of fairness and trace modeling
provides an important foundation for high-precision simulations by closely modeling the behavior of users in the system. Simulation based approaches represent
the most complex approach to fairness analysis, they can be used to precisely
evaluate fairness in even the most complicated and dynamic systems. This complexity however comes with its own problems as the results of simulations can
be very hard to analyze.
5.1
Case for Multifacet Fairness
All the methods presented in this paper have one common shortcoming – they
are all designed to either provide, or to facilitate the discovery of an “ultimate”
fairness model. While this simplifies the selection of a fitting fairness model for a
production system it does not address the issue of dynamically evolving requirements these systems usually face. Each change in the users’ workloads has the
possibility to fall into a category that is not well handled by the selected fairness
model. In such case, the user can be either heavily penalized for his/hers new
workload, or even worse, his/hers new workload may cause quality deterioration
for all other users in the system.
For this reason, we propose a new research avenue. Instead of concentrating
on a single fairness model, we should research the possibility of combining a
set of fairness models inside a single scheduler. This would allow the users to
either select a fairness model that matches their expectations, or even allow the
121
Š. Tóth
scheduler to determine this categorization automatically from the users workload
style. In such a model, when a users workload changes, he/she would simply be
reassigned into the proper fairness group. Identical behavior would possible for
users newly entering the system. If a completely new (currently unsupported)
use case would be encountered, one would only have to design a fairness group
matching this particular use case and then integrate it into the framework. This
would much simplify the current process of completely redesigning the entire
fairness model.
We invite the reader to seek out our future publications that will be exploring
this idea further.
Acknowledgments. We highly appreciate the support of the Grant Agency of
the Czech Republic under the grant No. P202/12/0306.
References
1. Adaptive Computing Enterprises, Inc. Maui Scheduler Administrator’s Guide, version 3.2, January 2014. http://docs.adaptivecomputing.com.
2. Adaptive Computing Enterprises, Inc. Moab workload manager administrator’s
guide, version 7.2.6, January 2014. http://docs.adaptivecomputing.com.
3. Adaptive Computing Enterprises, Inc. TORQUE Admininstrator Guide, version
4.2.6, January 2014. http://docs.adaptivecomputing.com.
4. Apache.org. Hadoop Capacity Scheduler, January 2014. http://hadoop.apache.
org/docs/r1.2.1/capacity_scheduler.html.
5. Apache.org. Hadoop Fair Scheduler, January 2014. http://hadoop.apache.org/
docs/r1.2.1/fair_scheduler.html.
6. B. Avi-Itzhak, E. Brosh, and H. Levy. Sqf: A slowdown queueing fairness measure.
Performance Evaluation, 64(9):1121–1136, 2007.
7. B. Avi-Itzhak, H. Levy, and D. Raz. A resource allocation queueing fairness measure: properties and bounds. Queueing Syst., 56(2):65–71, 2007.
8. B. Avi-Itzhak, H. Levy, and D. Raz. Quantifying fairness in queuing systems.
Probability in the Engineering and Informational Sciences, 22(04):495–517, 2008.
9. P. Brucker and P. Brucker. Scheduling algorithms, volume 3. Springer, 2007.
10. R. Buyya and M. Murshed. Gridsim: a toolkit for the modeling and simulation of
distributed resource management and scheduling for grid computing. Concurrency
and Computation: Practice and Experience, 14(13-15):1175–1220, 2002.
11. D. Feitelson. Parallel workloads archive. http://www.cs.huji.ac.il/labs/
parallel/workload.
12. D. Feitelson and E. Shmueli. A case for conservative workload modeling: Parallel
job scheduling with daily cycles of activity. In Modeling, Analysis Simulation of
Computer and Telecommunication Systems, 2009. MASCOTS ’09. IEEE International Symposium on, pages 1–8, Sept 2009.
13. D. G. Feitelson. Packing schemes for gang scheduling. In Job Scheduling Strategies
for Parallel Processing, pages 89–110. Springer, 1996.
14. D. G. Feitelson and L. Rudolph. Parallel job scheduling: Issues and approaches.
In Job Scheduling Strategies for Parallel Processing, pages 1–18. Springer, 1995.
122
A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods
15. D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn. Parallel job scheduling—a
status report. In Job Scheduling Strategies for Parallel Processing, pages 1–16.
Springer, 2005.
16. D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong. Theory
and practice in parallel job scheduling. In Job Scheduling Strategies for Parallel
Processing, pages 1–34. Springer, 1997.
17. E. Frachtenberg and D. G. Feitelson. Pitfalls in parallel job scheduling evaluation.
In Job Scheduling Strategies for Parallel Processing, pages 257–282. Springer, 2005.
18. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg.
Quincy: Fair scheduling for distributed computing clusters. In ACM SIGOPS
22nd Symposium on Operating Systems Principles, pages 261–276, 2009.
19. J. Jann, P. Pattnaik, H. Franke, F. Wang, J. Skovira, and J. Riordan. Modeling
of workload in mpps. In Job Scheduling Strategies for Parallel Processing, pages
95–116. Springer, 1997.
20. D. G. Kendall. Stochastic processes occurring in the theory of queues and their
analysis by the method of the imbedded markov chain. The Annals of Mathematical
Statistics, pages 338–354, 1953.
21. D. Klusácek and H. Rudová. Performance and fairness for users in parallel job
scheduling. In Job Scheduling Strategies for Parallel Processing, pages 235–252.
Springer, 2013.
22. D. Klusáček and H. Rudová. Alea 2: job scheduling simulator. In Proceedings
of the 3rd International ICST Conference on Simulation Tools and Techniques,
page 61. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2010.
23. D. Klusáček and Š. Tóth. On interactions among scheduling policies: Finding efficient queue setup using high-resolution simulations. In F. Silva, I. Dutra, and V. S.
Costa, editors, Euro-Par 2014, volume 8632 of LNCS, pages 138–149. Springer,
2014.
24. D. Krakov and D. G. Feitelson. Comparing performance heatmaps. In Job Scheduling Strategies for Parallel Processing. Citeseer, 2013.
25. D. Krakov and D. G. Feitelson. High-resolution analysis of parallel job workloads.
In Job Scheduling Strategies for Parallel Processing, pages 178–195. Springer, 2013.
26. V. J. Leung, G. Sabin, and P. Sadayappan. Parallel job scheduling policies to
improve fairness: a case study. Technical Report SAND2008-1310, Sandia National
Laboratories, 2008.
27. V. Lo and J. Mache. Job scheduling for prime time vs. non-prime time. In Cluster
Computing, 2002. Proceedings. 2002 IEEE International Conference on, pages 488–
493. IEEE, 2002.
28. U. Lublin and D. G. Feitelson. The workload on parallel supercomputers: modeling
the characteristics of rigid jobs. Journal of Parallel and Distributed Computing,
63(11):1105–1122, 2003.
29. MetaCentrum, January 2014. http://www.metacentrum.cz/.
30. PBS Works. PBS Professional 12.1, Administrator’s Guide, January 2014. http:
//www.pbsworks.com.
31. A. Rafaeli, G. Barron, and K. Haber. The effects of queue structure on attitudes.
Journal of Service Research, 5(2):125–139, 2002.
32. A. Rafaeli, E. Kedmi, D. Vashdi, and G. Barron. Queues and fairness: A multiple
study experimental investigation. Manuscript under review, 2005.
33. D. Raz, B. Avi-Itzhak, and H. Levy. Classes, priorities and fairness in queueing
systems. RUTCOR, Rutgers University, Tech. Rep. RRR-21-2004, 2004.
123
Š. Tóth
34. D. Raz, H. Levy, and B. Avi-Itzhak. A resource-allocation queueing fairness measure. In ACM SIGMETRICS Performance Evaluation Review, volume 32, pages
130–141. ACM, 2004.
35. G. Sabin, G. Kochhar, and P. Sadayappan. Job fairness in non-preemptive job
scheduling. In Parallel Processing, 2004. ICPP 2004. International Conference on,
pages 186–194. IEEE, 2004.
36. W. Sandmann. A discrimination frequency based queueing fairness measure with
regard to job seniority and service requirement. In Next Generation Internet Networks, 2005, pages 106–113. IEEE, 2005.
37. U. Schwiegelshohn. How to design a job scheduling algorithm. In Job Scheduling
Strategies for Parallel Processing, 2014.
38. E. Shmueli and D. Feitelson. Using site-level modeling to evaluate the performance
of parallel system schedulers. In Modeling, Analysis, and Simulation of Computer
and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International
Symposium on, pages 167–178, Sept 2006.
39. E. Shmueli and D. G. Feitelson. Uncovering the effect of system performance on
user behavior from traces of parallel systems. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2007. MASCOTS’07. 15th
International Symposium on, pages 274–280. IEEE, 2007.
40. E. Shmueli and D. G. Feitelson. On simulation and design of parallel-systems
schedulers: are we doing the right thing? Parallel and Distributed Systems, IEEE
Transactions on, 20(7):983–996, 2009.
41. Š. Tóth and D. Klusáček. User-aware metrics for measuring quality of parallel job
schedules. In Job Scheduling Strategies for Parallel Processing, 2014.
42. S. Vasupongayya and S.-H. Chiang. On job fairness in non-preemptive parallel job
scheduling. In IASTED PDCS, pages 100–105, 2005.
43. A. Wierman. Fairness and classifications. ACM SIGMETRICS Performance Evaluation Review, 34(4):4–12, 2007.
44. A. Wierman. Scheduling for today’s computer systems: Bridging theory and practice. PhD thesis, Carnegie Mellon University, 2007.
45. A. Wierman. Fairness and scheduling in single server queues. Surveys in Operations
Research and Management Science, 16(1):39–48, 2011.
46. A. Wierman and M. Harchol-Balter. Classifying scheduling policies with respect to
unfairness in an M/GI/1. In ACM SIGMETRICS Performance Evaluation Review,
volume 31, pages 238–249. ACM, 2003.
47. N. Zakay and D. G. Feitelson. Preserving user behavior characteristics in tracebased simulation of parallel job scheduling. 2014.
48. J. Zilber, O. Amit, and D. Talby. What is worth learning from parallel workloads?:
a user and session based analysis. In Proceedings of the 19th annual international
conference on Supercomputing, pages 377–386. ACM, 2005.
124
Part IV
Presentations
Fault Recovery Method with High Availability
for Practical Applications∗
Jaroslav Borecký, Pavel Vı́t, and Hana Kubátová
Department of Digital Design, Faculty of Information Technology
Czech Technical University in Prague, Technická 9, Prague, Czech Republic
{borecjar, pavel.vit, hana.kubatova}@fit.cvut.cz
Our research is focused on mission critical applications using SRAM based
Field Programmable Gate Arrays (FPGAs). The main goal is to reach higher
availability and dependability and low power using unreliable components (FPGAs) with respect to highest safety according to strict Czech standards [1]. Our
methodology is designed for fast applications and rapid prototyping of modular
systems, which are useful for fast development thanks to its regulars structure.
The methodology combines Concurrent Error Detection (CED) techniques,
FPGA dynamic reconfigurations and our previously designed Modified Duplex
System (MDS) architecture. The methodology tries minimizes area overhead.
It is aimed for practical applications of modular systems, which are composed
from blocks. We applied and tested it on the safety railway station system.
The proposed method is based on static and partial dynamic reconfiguration of
totally self-checking blocks which allows a full recovery from a Single Even Upset
(SEU).
The method is based on two independent FPGA boards with the same design,
it decreases the development time. Each FPGA is divided into two main parts:
a reconfiguration area (RA) and a static area (SA). The whole system is placed
in reconfigurable partitions (RP) of the RA. The SA checks failure signals and
immediately repairs soft errors in RPs. This reduces recovery time, because it
uses partial reconfiguration often, while the whole FPGA reconfiguration only in
critical situations. Every block is designed as TSC, also the static area satisfies
TSC property.
This paper was presented at DSD 2014. The main advantage is in usage of
the partial reconfiguration. This allows faster detection and correction of faults.
Reconfiguration of only one RP is faster than to reconfigure the whole FPGA. It
leads to availability and security increase within minimal area overhead. Smaller
overhead leads to smaller FPGA and low power consumption. In comparison
with TMR, it uses less area, it is faster, cheaper, and with shorter development
time.
References
1. ČSN EN 50126, Czech Technical Norm
”http://nahledy.normy.biz/nahled.php?i=59709”, 2011
∗
This
research
has
SGS14/105/OHK3/1T/18.
been
partially
supported
by
the
project
127
Verification of Markov Decision Processes
using Learning Algorithms?
Tomáš Brázdil1 , Krishnendu Chatterjee2 , Martin Chmelı́k2 , Vojtěch Forejt3 ,
Jan Křetı́nský2 , Marta Kwiatkowska3 , David Parker4 , and Mateusz Ujma3
1
3
Masaryk University, Brno, Czech Republic 2 IST Austria
University of Oxford, UK 4 University of Birmingham, UK
We present a general framework for applying machine-learning algorithms
to the verification of Markov decision processes (MDPs). The primary goal of
these techniques is to improve performance by avoiding an exhaustive exploration of the state space. Our framework focuses on probabilistic reachability,
which is a core property for verification, and is illustrated through two distinct
instantiations. The first assumes that full knowledge of the MDP is available,
and performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the
case where we may only sample the MDP, and yields probabilistic guarantees,
again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. The latter is the first extension of statistical
model checking for unbounded properties in MDPs. In contrast with other related techniques, our approach is not restricted to time-bounded (finite-horizon)
or discounted properties, nor does it assume any particular properties of the
MDP. We also show how our methods extend to LTL objectives. We present
experimental results showing the performance of our framework on several examples.
The paper has been accepted to ATVA 2014.
?
This research was funded in part by the European Research Council (ERC) under
grant agreement 267989 (QUAREM), 246967 (VERIWARE) and 279307 (Graph
Games), by the EU FP7 project HIERATIC, by the Austrian Science Fund (FWF)
projects S11402-N23 (RiSE), S11407-N23 (RiSE) and P23499-N23, by the Czech
Science Foundation grant No P202/12/P612, by the People Programme (Marie Curie
Actions) of the European Union’s Seventh Framework Programme (FP7/2007-2013)
under REA grant agreement n 291734, by EPSRC project EP/K038575/1 and by
the Microsoft faculty fellows award.
128
CEGAR for Qualitative Analysis of
Probabilistic Systems ?
Krishnendu Chatterjee, Martin Chmelı́k, and Przemyslaw Daca
IST Austria
We consider Markov decision processes (MDPs) which are a standard model
for probabilistic systems. We focus on qualitative properties for MDPs that can
express that desired behaviors of the system arise almost-surely (with probability 1) or with positive probability. We introduce a new simulation relation to
capture the refinement relation of MDPs with respect to qualitative properties,
and present discrete graph theoretic algorithms with quadratic complexity to
compute the simulation relation. We present an automated technique for assumeguarantee style reasoning for compositional analysis of MDPs with qualitative
properties by giving a counterexample guided abstraction-refinement approach
to compute our new simulation relation. We have implemented our algorithms
and show that the compositional analysis leads to significant improvements.
Compositional analysis and CEGAR. One of the key challenges in analysis
of probabilistic systems (as in the case of non-probabilistic systems) is the state
explosion problem, as the size of concurrent systems grows exponentially in
the number of components. One key technique to combat the state explosion
problem is the assume-guarantee style composition reasoning, where the analysis
problem is decomposed into components and the results for components are used
to reason about the whole system, instead of verifying the whole system directly.
This simple, yet elegant asymmetric rule is very effective in practice, specially
with a counterexample guided abstraction-refinement (CEGAR) loop.
Our contributions. In this work we focus on the compositional reasoning of
probabilistic systems with respect to qualitative properties, and our main contribution is a CEGAR approach for qualitative analysis of probabilistic systems.
We consider the fragment of pCTL∗ that is relevant for qualitative analysis, and
refer to this fragment as QCTL∗ . The details of our contributions are as follows:
1. To establish the logical relation induced by QCTL∗ we consider the logic
ATL∗ for two-player games and the two-player game interpretation of an
MDP where the probabilistic choices are resolved by an adversary. In case
of non-probabilistic systems and games there are two classical notions for
refinement, namely, simulation and alternating-simulation. We first show
that the logical relation induced by QCTL∗ is finer than the intersection
of simulation and alternating simulation. We then introduce a new notion
of simulation, namely, combined simulation, and show that it captures the
logical relation induced by QCTL∗ .
2. We show that our new notion of simulation, which captures the logic relation of QCTL∗ , can be computed using discrete graph theoretic algorithms
in quadratic time. We present a CEGAR approach for the computation of
combined simulation.
?
The paper was accepted to CAV 2014
129
From LTL to Deterministic Automata:
A Safraless Compositional Approach
Javier Esparza and Jan Křetı́nský
Institut für Informatik, Technische Universität München, Germany
IST Austria
Linear temporal logic (LTL) is the most popular specification language for
linear-time properties. In the automata-theoretic approach to LTL verification,
formulae are translated into ω-automata, and the product of these automata
with the system is analyzed. Therefore, generating small ω-automata is crucial
for the efficiency of the approach. In quantitative probabilistic verification, LTL
formulae need to be translated into deterministic ω-automata. Until recently,
this required to proceed in two steps: first translate the formula into a nondeterministic Büchi automaton (NBA), and then apply Safra’s determinization
or its variants. This is also the approach adopted in PRISM, a leading probabilistic
model checker.
Since automata produced in this way are often very large, we presented an
algorithm that directly constructs a generalized DRA (GDRA) for the fragment
of LTL containing only the temporal operators F and G [2]. The GDRA can
be either (1) de-generalized into a standard DRA, or (2) used directly in the
probabilistic verification process [1]. In both cases we get much smaller automata
for many formulae. For instance, the standard approach translates a conjunction
of three fairness constraints into an automaton with over a million states, while
the algorithm of [2] yields a GDRA with one single state, and a DRA with 462
states.
In this paper we present a novel approach able to handle full LTL, and even
the alternation-free linear-time µ-calculus. The approach is compositional: the
automaton is the parallel composition of a master automaton and an array of
slave automata, one for each G-subformula of the original formula. Intuitively,
the master monitors the formula that remains to be fulfilled and takes care
of checking safety and reachability properties. A slave for each subformula of
the form Gψ checks whether Gψ eventually holds, i.e., whether FGψ holds.
Experimental results show improvement in the sizes of the resulting automata
compared to existing methods.
The paper was accepted and presented at CAV 2014.
References
1. Krishnendu Chatterjee, Andreas Gaiser, and Jan Křetı́nský. Automata with generalized Rabin pairs for probabilistic model checking and LTL synthesis. In CAV,
pages 559–575, 2013.
2. Jan Křetı́nský and Javier Esparza. Deterministic automata for the (F,G)-fragment
of LTL. In CAV, pages 7–22, 2012.
130
Faster Existential FO Model Checking on Posets
Jakub Gajarský
Faculty of Informatics, Masaryk University
[email protected]
The model checking problem, i.e. the problem to decide whether a given
logical sentence is true in a given structure, is one of the fundamental problems
in theoretical computer science. For the familiar first order logic, there is a
well-established line of study of the model checking problem on combinatorial
structures, culminating in the recent result of Grohe, Kreutzer and Siebertz
(STOC 2014) for the class of nowhere-dense graphs.
In contrast, not much is known once we focus on finite algebraic structures.
Recently, Bova, Ganian and Szeider (LICS 2014) investigated the complexity of
the model checking problem for FO and partially ordered sets. They show that
the model checking problem for the existential fragment of FO can be solved in
time f (|φ|).ng(w) , where n is the size of a poset and w its width, i.e. the size
of its largest antichain. In the parlance of parameterized complexity, this means
that the problem is FPT (fixed-parameter tractable) in the size of the formula,
but only XP in the width of the poset. The proof is a bit involved, and goes
by first showing that the model checking problem (for the existential fragment
of FO) is equivalent to the embedding problem for posets, and then reducing
the embedding problem to a suitable family of instances of the homomorphism
problem of certain semilattice structures.
In this talk we improve upon (and simplify) the result of Bova et al. by showing that the model-checking problem is FPT in both the size of the formula and
the width of the poset. We give two different, fixed-parameter algorithms solving
the embedding problem. The first algorithm is a natural, and easy to understand,
polynomial-time reduction to a CSP instance closed under min polymorphisms,
giving us O(n4 ) dependence of the running time on the size of the poset. The
second algorithm has even better, quadratic time complexity and works by reducing the embedding problem to a restricted variant of the multicoloured clique
problem, which is then efficiently solved.
To complement the previous fixed-parameter tractability results, we also investigate possible kernelization of the embedding problem for posets. We show
that the embedding problem (and therefore the existential FO model checking
problem) does not have a polynomial kernel, unless coNP ⊆ NP/poly, which is
thought to be unlikely. This means the embedding problem cannot be efficiently
reduced to an equivalent instance of size polynomial in the parameter.
Presented work is a joint collaboration with Petr Hliněný, Jan Obdržálek and
Sebastian Ordyniak accepted to ISAAC 2014.
131
Fully Automated Shape Analysis
Based on Forest Automata
Lukáš Holı́k, Ondřej Lengál,
Adam Rogalewicz, Jiřı́ Šimáček, and Tomáš Vojnar
FIT, Brno University of Technology, IT4Innovations Centre of Excellence,
Czech Republic
Forest automata (FAs) have recently been proposed as a tool for shape analysis
of complex heap structures. FAs encode sets of tree decompositions of heap
graphs in the form of tuples of tree automata. In order to allow for representing
complex heap graphs, the notion of FAs allowed one to provide user-defined FAs
(called boxes) that encode repetitive graph patterns of shape graphs to be used as
alphabet symbols of other, higher-level FAs. In the presented work, we describe
a newly developed technique of automatically learning the FAs to be used as
boxes that avoids the need of providing them manually. Further, we propose
a significant improvement of the automata abstraction used in the analysis. The
result is an efficient, fully-automated analysis that can handle even as complex
data structures as skip lists, with the performance comparable to state-of-theart fully-automated tools based on separation logic, which, however, specialise
in dealing with linked lists only.
This presentation is based on a paper with the same name that appeared in
the proceedings of CAV 2013.
Acknowledgement. This work was supported by the Czech Science Foundation
(projects P103/10/0306, 13-37876P), the Czech Ministry of Education, Youth,
and Sports (project MSM 0021630528), the BUT FIT project FIT-S-12-1, and
the EU/Czech IT4Innovations Centre of Excellence project CZ.1.05/1.1.00/
02.0070.
132
Multi-objective Genetic Optimization for
Noise-Based Testing of Concurrent Software
Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Hana Pluháčková, and Tomáš
Vojnar
IT4Innovations Centre of Excellence, FIT, Brno University of Technology, Czech Rep.,
{ihruba, krena, iletko, ipluhackova, vojnar}@fit.vutbr.cz
Testing of multi-threaded programs is a demanding work due to the many
possible thread interleavings one should examine. The noise injection technique
helps to increase the number of thread interleavings examined during repeated
test executions provided that a suitable setting of noise injection heuristics is
used. The problem of finding such a setting, i.e., the so called test and noise
configuration search problem (TNCS problem), is not easy to solve according
to ”Testing of Concurrent Programs Using Genetic Algorithms.” (Hrubá, V.,
Křena, B., Letko, Z., and Vojnar, T., SSBSE’12). In this paper, we show how
to apply a multi-objective genetic algorithm (MOGA) to the TNCS problem. In
particular, we focus on generation of TNCS solutions that are suitable for regression testing where tests are executed repeatedly. Consequently, we are searching
for TNCS candidate solutions that cover a high number of distinct interleavings
(especially those which are rare) and provide stable results in the same time.
To achieve this goal, we study suitable metrics and ways how to suppress effects
of non-deterministic thread scheduling on the proposed MOGA-based approach.
We also discuss a choice of a MOGA and its parameters suitable for our setting.
Finally, we show on a set of benchmark programs that our approach provides
better results when compared to the commonly used random approach as well
as to the sooner proposed use of a single-objective genetic approach.
The presentation is based on the paper ”Multi-objective Genetic Optimization for Noise-Based Testing of Concurrent Software” (Hrubá, V., Křena, B.,
Letko, Z., Pluháčková, H., and Vojnar, T., SSBSE’14).
Acknowledgement. We thank Shmuel Ur and Zeev Volkovich for many valuable
comments on the work presented in this paper. The work was supported by
the Czech Ministry of Education under the Kontakt II project LH13265, the
EU/Czech IT4Innovations Centre of Excellence project CZ.1.05/1.1.00/02.0070,
and the internal BUT projects FIT-S-11-1 and FIT-S-12-1. Zdeněk Letko was
funded through the EU/Czech Interdisciplinary Excellence Research Teams Establishment project (CZ.1.07/2.3.00/30.0005).
133
On Interpolants and Variable Assignments?
Pavel Jancik2 , Jan Kofroň2 , Simone Fulvio Rollini1 , and Natasha Sharygina1
1
2
University of Lugano, Switzerland, {name.surname}@usi.ch
D3S, Faculty of Mathematics and Physics, Charles University, Czech Rep.,
{name.surname}@d3s.mff.cuni.cz
Craig interpolants are widely used in program verification as a means of
abstraction. For propositional logic the interpolants can be computed by wellestablished McMillan’s and symmetric Pudlák’s interpolation systems, which are
generalized by the Labeled Interpolation Systems (LISs) (D’Silva, 2010).
In the area of Abstract Reachability Trees resp. Abstract Reachability Graphs
(ARG) interpolants play an important role. In ARG each graph node has a label
(representing an over-approximation of program states reachable at the node)
assigned. The node labels are typically derived from (node) interpolants. A safe,
complete, and well-labeled ARG can be used to show correctness of the corresponding program. In order to obtain a well-labeled ARG, the computed node
interpolants have to be inductive.
To compute node interpolants the ARG has to be first converted into a formula, which is then passed to a solver. If the formula is satisfiable, the program
is not safe and the error trace can be derived from the variable assignment. Otherwise node interpolants can be derived from the refutation proof; to this end the
input formula is split into two parts – A and B. Even though it is possible to use
a standard interpolation systems, this suffers from various drawbacks; the interpolant over-approximates all states on the boundary between A and B, which
can include many ARG nodes. Furthermore, the shared-variables occurring in
the interpolant may not be in-scope at a given node; thus a post-processing steps
(involving, e.g., quantifier elimination) are needed to derive a node label from
the interpolant.
To face the aforementioned issues, (i) we introduce the concept of Partial
Variable Assignment Interpolants (PVAIs) as a generalization of Craig interpolants. A variable assignment focuses the computed interpolant via restricting
the set of clauses taken into account during interpolation. In the scope of ARGs,
a variable assignment is used to exclude some paths from the set being considered
by the (node) interpolant, thus specializing the interpolant to the relevant ones,
i.e., only to those going via the corresponding node. Due to this specialization
it is possible to guarantee that unwanted out-of-scope variables (coming from
ignored paths) do not appear in the interpolant. Furthermore, (ii) we present a
way to compute PVAIs for propositional logic based on an extension of the LISs.
The extension uses variable assignment to omit irrelevant parts of the resolution proofs (thus reducing the interpolant size) as well as to modify the locality
constraints to omit the out-out-scope variables. Last, (iii) we show that the ex?
This work is partially supported by the Grant Agency of the Czech Republic project
14-11384S, and Charles University Foundation grant 203-10/253297.
134
Finding Terms in Corpora for Many Languages
Adam Kilgarriff† , Miloš Jakubı́ček‡† , Vojtěch Kovář‡† ,
Pavel Rychlý‡† , and Vı́t Suchomel‡†
‡ NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic
† Lexical Computing Ltd., Brighton, United Kingdom
[email protected],{xjakub,xkovar3,pary,xsuchom2}@fi.muni.cz
Term candidates for a domain, in a language, can be found by taking a corpus for the domain, and a reference corpus for the language, identifying the
grammatical shape of a term in the language, tokenising, lemmatising and POStagging both corpora, identifying and counting the items in each corpus which
match the grammatical shape, and for each item in the domain corpus, comparing its frequency with its frequency in the refence corpus. Then, the items
with the highest frequency in the domain corpus in comparison to the reference
corpus will be the top term candidates. In this abstract we describe how we
addressed the stages above.
We make the simplifying assumption that terms are noun phrases (in their
canonical form, without leading articles: the term is base station, not the base
stations.) Then the task is to write a noun phrase grammar for the language.
Within the Sketch Engine we already have machinery for shallow parsing, based
on a ’Sketch Grammar’ of regular expressions over part-of-speech tags, written
in Corpus Query Language. Our implementation is mature, stable and fast,
processing million-word corpora in seconds and billion-word corpora in a few
hours. The machinery has most often been used to find <grammatical-relation,
word1, word2> triples for lexicography and related research. It was modified to
find, and count, the items having the appropriate shape for a term.
The challenge of identifying the best candidate terms for the domain, given
their frequency in the domain corpus and the reference corpus, is a variant on
the challenge of finding the keywords in a corpus. A good method is simply to
take the ratio of the normalised frequency of the term in the domain corpus to its
normalised frequency in a reference corpus. Candidate terms are then presented
to the user in a sorted list, with the best candidates – those with the highest
domain:reference ratio – at the top. Each item in the list is clickable: the user
can click to see a concordance for the term, in either the domain or the reference
corpus.
The current challenge we face is to get the correct cannonical form of the
term. In English one (almost) always wants to present each word in the term
candidate in its canonical, dictionary form. But in gender sensitive languages
one does not. A gender respecting lemma turns out necessary in such cases.
Another challenge is to keep the same processing chains for all corpora, regardless the size. The reference corpus is processed in batch mode, and we hope
not to upgrade it more than once a year. The domain corpus is processed at
runtime. For term-finding, we have had to look carefully at the tools, separating
135
A. Kilgarriff et al.
each out into an independent module, so that we can be sure of applying the
same versions throughout.
We have undertaken a first evaluation using the GENIA corpus , in which
all terms have been manually identified. Keyword and term extraction was performed to obtain the top 2000 keywords and top 1000 multi-word terms. Terms
manually annotated in GENIA as well as terms extracted by our tool were normalized before comparison (lower case, spaces and hyphens removed) and then
GENIA terms were looked up in the extraction results. 61 of the top 100 GENIA
terms were found by the system. The terms not found were not English words:
most were acronyms, e.g. EGR1, STAT-6.
We have built a system for finding terms in a domain corpus in Chinese, English, French, German, Japanese, Korean, Portuguese, Russian, Spanish. We will
extend the coverage of languages and improve the system according to further
feedback from users in 2014.
This work has been partly supported by the Ministry of Education of CR within the LINDAT-Clarin project
LM2010013.
This work was presented in KILGARRIFF, Adam, Miloš JAKUBÍČEK, Vojtěch KOVÁŘ, Pavel RYCHLÝ a
Vı́t SUCHOMEL. Finding Terms in Corpora for Many Languages with the Sketch Engine. In Proceedings of the
Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics.
Gothenburg, Sweden: The Association for Computational Linguistics, 2014. s. 53-56, 4 s. ISBN 978-1-937284-75-6.
136
Hereditary properties of permutations are
strongly testable
Tereza Klimošová and Daniel Král’
Institute of Mathematics, University of Warwick, Coventry CV4 7AL, UK.
E-mail: [email protected], [email protected].
Property testing is a topic with growing importance with many connections
to various areas of mathematics and computer science. A property tester is an
algorithm that decides whether a large input object has the considered property
by querying only a small sample of it. Since the tester is presented with a part
of the input structure, it is necessary to allow an error based on the robustness
of the tested property of the input.
The most investigated area of property testing is testing graph properties.
One of the most significant results in this area is that of Alon and Shapira
asserting that every hereditary graph property, i.e., a property preserved by
taking induced subgraphs, is testable with respect to the edit distance. Hoppen,
Kohayakawa, Moreira, Sampaio obtained the similar result for permutations,
showing, that every hereditary property of permutations is weakly testable, i.e.,
testable with respect to the rectangular distance, and they asked whether the
same is true for a finer measure than the rectangular distance, the Kendall’s tau
distance. We resolve this problem in the positive way.
The Kendall’s tau distance is considered to correspond to the edit distance of
graphs. For two permutations π, σ on N elements, it is defined as the minimum
number of swaps of consecutive elements transforming π to σ, divided by N2 .
The Kendall’s tau distance of a permutation π from a permutation property P
is the minimum distance of π from an element of P.
A property P is hereditary if it is closed under taking subpermutations, i.e.,
if π ∈ P, then any subpermutation of π is in P.
We say that a property P is strongly testable through subpermutations there
exists a tester A that is presented with a random subpermutation of the input
permutation of size bounded by a function of ε independent of the input permutation and such that if the input permutation has the property P, then A
accepts with probability at least 1 − ε, and if the input permutation is ε-far from
P with respect to the Kendall’s tau distance, then A rejects with probability at
least 1 − ε.
We have proven that every hereditary permutation property is testable through
subpermutations with respect to the Kendall’s tau distance.
Unlike the algorithm of Hoppen et al. which is based regularity decompositions of permutations, our algorithm is based on a direct combinatorial argument,
which yields better dependance on parameters of the problem.
The result is a joint work with Dan Král’ and was presented at SODA’14
(T. Klimošová and D. Král’: Hereditary properties of permutations are strongly
testable, in: Proc. SODA’14, SIAM, Philadelphia, PA, 2014 1164–1173).
137
Paraphrase and Textual Entailment Generation
in Czech
Zuzana Nevěřilová
Natural Language Processing Centre, Faculty of Informatics,
Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic
The presentation covers automatic paraphrase and textual entailment generation. We focus on Czech language but most of the concepts and ideas are
language independent.
A paraphrase, i.e. a sentence with the same meaning, conveys a certain piece
of information with new words and new syntactic structures. Textual entailment,
i.e. an inference that humans will judge most likely true, can employ real-world
knowledge in order to make some implicit information explicit. Paraphrases can
also be seen as mutual entailments, i.e. if a sentence s1 entails a sentence s2 and
vice versa then s1 and s2 are paraphrases.
Paraphrase and textual entailment generation can support natural language
processing (NLP) tasks that simulate text understanding, e.g. text summarization (express an idea using different words), plagiarism detection (express somebody’s ideas using different words), or question answering (the answer can be
retrieved from a text if the question is reformulated using different words). In
addition, paraphrase generation is similar to the task of machine translation
except that only one language is processed.
We present a new system that generates paraphrases and textual entailments
from a given text in the Czech language. First, the process is rule-based, i.e. the
system analyzes the input text, produces its inner representation, transforms
it according to particular transformation rules, and generates new sentences.
The domain of the input text is not restricted, therefore the generation process
demands huge language resources. Second, the generated sentences are ranked
according to a statistical model and only the best ones are output. The models
are based on corpus data and on previous annotations.
The evaluation whether a paraphrase or textual entailment is correct or not
is left to humans. For this purpose we designed an annotation game based on
a conversation between a detective (the human player) and his assistant (the
system). The result of such annotation is a collection of annotated pairs text–
hypothesis.
Currently, the system and the game are intended to collect data in the Czech
language. However, the idea can be applied for other languages. So far, we have
collected 3,321 text–hypothesis pairs. From these pairs, 1,563 were judged correct
(47.06 %), 1,238 (37.28 %) were judged incorrect entailments, and 520 (15.66 %)
were judged non-sense or unknown.
The results are presented on CICLing 2014 and on 17th International Conference on Text, Speech and Dialogue.
138
Minimizing Running Costs in Consumption
Systems
Petr Novotný
Faculty of Informatics, Masaryk University, Brno, Czech Republic
A standard approach to optimizing long-run running costs of discrete systems
is based on minimizing the mean-payoff, i.e., the long-run average amount of resources (“energy”) consumed per transition. More precisely, a system is modelled
as a finite directed graph C, where the set of states S corresponds to configurations, and transitions model the discrete computational steps. Each transition
is labeled by a non-negative integer specifying the amount of energy consumed
by a given transition. Then, to every run % in C one can assign the associated
mean-payoff, which is the limit of average energy consumption per transition
computed for longer and longer prefixes of %. A basic algorithmic task is to find
a suitable controller for a given system which minimizes the mean-payoff. Recently, the problem has been generalized by requiring that the controller should
also achieve a given linear time property ϕ, i.e., the run produced by a controller
should satisfy ϕ while minimizing the mean-payoff (Chatterje et al, 2005). This
is motivated by the fact that the system is usually required to achieve some
functionality, and not just “run” with minimal average costs.
However, the above approach inherently assumes that all transitions are always enabled, i.e., the amount of energy consumed by a transition is always
available. This is not always realistic. For example, an autonomous robotic device has a battery of finite capacity that has to be recharged periodically, and
the total amount of energy consumed between two successive charging cycles is
bounded by the capacity. Hence, a controller minimizing the mean-payoff must
obey this capacity restriction.
In this paper we study the controller synthesis problem for consumption
systems with a finite battery capacity, where the task of the controller is to
minimize the mean-payoff while satisfying the capacity restriction and preserving
the functionality of the system encoded by a given linear-time property. We
show that an optimal controller always exists, and it may either need only finite
memory or require infinite memory (it is decidable in polynomial time which of
the two cases holds). Further, we show how to compute an effective description
of an optimal controller in polynomial time. Finally, we consider the limit values
achievable by larger and larger battery capacity, show that these values are
computable in polynomial time, and we also analyse the corresponding rate of
convergence. To the best of our knowledge, these are the first results about
optimizing the long-run running costs in systems with bounded energy stores.
The presentation is based on the paper “Minimizing Running Costs in Consumption Systems” by T. Brázdil, D. Klaška, A. Kučera and P. Novotný, which
was accepted for publication in proceedings of CAV 2014.
139
Testing Fault-Tolerance Methodologies in
Electro-mechanical Applications ⋆
Jakub Podivinsky and Zdenek Kotasek
Faculty of Information Technology, Brno University of Technology, Czech Republic
{ipodivinsky,kotasek}@fit.vutbr.cz
The aim of the presentation is to introduce a new platform under development for estimating the fault-tolerance quality of electro-mechanical applications
based on FPGAs. In several areas, such as aerospace and space applications or
automotive safety-critical applications, fault tolerant electro-mechanical (EM)
systems are highly desirable. In these systems, the mechanical part is controlled
by its electronic controller. Currently, a trend is to add even more electronics
into EM systems. We have identified two areas that we would like to focus on
in our research of fault-tolerant systems: The first one is that methodologies are
validated and demonstrated only on simple electronic circuits implemented in
FPGAs. However, in real systems different types of blocks must be protected
against faults at the same time and must communicate with each other. Therefore, a general evaluation platform for testing, analysis and comparison of aloneworking or cooperating fault-tolerance methodologies is needed. As for the second area of the research and the main contribution of our work, we feel that it
must be possible to check the reactions of the mechanical part of the system if
the functionality of its electronic controller is corrupted by faults. In the presentation, a working example of such EM application will be demonstrated that was
evaluated using our platform: the mechanical robot and its electronic controller
in FPGA. Different building blocks of the electronic robot controller allow to
model different effects of faults on the whole mission of the robot (searching a
path in a maze). In the experiments, the mechanical robot is simulated in the
simulation environment where the effects of faults injected into its controller can
be seen. In this way, it is possible to differentiate between the fault that causes
the failure of the system and the fault that only decreases the performance.
Further extensions of the platform focus on the interconnection of the platform
with the functional verification environment working directly in FPGA that allows automation and speed-up of checking the correctness of the system after
the injection of faults.
The original work was accepted on MEDIAN 2014 [1] and DSD 2014 [2].
References
1. Podivinsky, J., Simkova, M., Kotasek, Z.: Complex Control System for Testing FaultTolerance Methodologies. In: Proceedings of The Third Workshop on MEDIAN.
COST (2014).
2. Podivinsky, J., Cekan, O., Simkova, M., Kotasek, Z.: The Evaluation Platform for
Testing Fault-Tolerance Methodologies in Electro-mechanical Applications. In: 17th
Euromicro Conference on Digital Systems Design. Verona (2014).
⋆
This work was supported by the following projects: National COST LD12036, project
Centrum excellence IT4Innovations (ED1.1.00/02.0070), EU COST Action IC1103
- MEDIAN and BUT project FIT-S-14-2297.
140
A Simple and Scalable Static Analysis for Bound
Analysis and Amortized Complexity Analysis
Moritz Sinn, Florian Zuleger, and Helmut Veith
TU Vienna
Automatic methods for computing bounds on the resource consumption of
programs are an active area of research [11, 8, 3, 9, 13, 4, 1, 10, 2]. We present the
first scalable bound analysis for imperative programs that achieves amortized
complexity analysis. Our techniques can be applied for deriving upper bounds
on how often loops can be iterated as well as on how often a single or several
control locations can be visited in terms of the program input.
The majority of earlier work on bound analysis has focused on mathematically intriguing frameworks for bound analysis. These analyses commonly employ
general purpose reasoners such as abstract interpreters, software model checkers
or computer algebra tools and therefore rely on elaborate heuristics to work in
practice. Our work takes an orthogonal approach that complements previous research. We propose a bound analysis based on a simple abstract program model,
namely lossy vector addition systems with states. We present a static analysis
with four well-defined analysis phases that are executed one after each other: program abstraction, control-flow abstraction, generation of a lexicographic ranking
function and bound computation.
A main contribution of our work is a thorough experimental evaluation. We
compare our approach against recent bounds analysis tools [3, 1, 2, 5], and show
that our approach is faster and at the same time achieves better results. Additionally, we demonstrate the scalability of our approach by a comparison against
our earlier tool [13], which to the best of our knowledge represents the only tool
evaluated on a large publicly available benchmark of C programs. We show that
our new approach achieves better results while increasing the performance by
an order of magnitude. Moreover, we discuss on this benchmark how our tool
achieves amortized complexity analysis in real-world code.
Our technical key contribution is a new insight how lexicographic ranking
functions can be used for bound analysis. Earlier approaches such as [3] simply
count the number of elements in the image of the lexicographic ranking function in order to determine an upper bound on the possible program steps. The
same idea implicitly underlies the bound analyses [6, 8, 7, 9, 13, 2, 5]. However,
this reasoning misses arithmetic dependencies between the components of the
lexicographic ranking function. In contrast, our analysis calculates how much a
lexicographic ranking function component is increased when another component
is decreased. This enables amortized analysis.
The talk presents work [12] published at CAV 2014.
141
M. Sinn, F. Zuleger, and H. Veith
References
1. Elvira Albert, Puri Arenas, Samir Genaim, German Puebla, and Damiano Zanardini. Cost analysis of object-oriented bytecode programs. Theor. Comput. Sci.,
413(1):142–159, 2012.
2. Elvira Albert, Samir Genaim, and Abu Naser Masud. On the inference of resource
usage upper and lower bounds. ACM Trans. Comput. Log., 14(3):22, 2013.
3. Christophe Alias, Alain Darte, Paul Feautrier, and Laure Gonnord. Multidimensional rankings, program termination, and complexity bounds of flowchart
programs. In SAS, pages 117–133, 2010.
4. Diego Esteban Alonso-Blas and Samir Genaim. On the limits of the classical
approach to cost analysis. In SAS, pages 405–421, 2012.
5. Marc Brockschmidt, Fabian Emmes, Stephan Falke, Carsten Fuhs, and Juergen
Giesl. Alternating runtime and size complexity analysis of integer programs. In
TACAS, page to appear, 2014.
6. Bhargav S. Gulavani and Sumit Gulwani. A numerical abstract domain based on
expression abstraction and max operator with application in timing analysis. In
CAV, pages 370–384, 2008.
7. Sumit Gulwani, Sagar Jain, and Eric Koskinen. Control-flow refinement and
progress invariants for bound analysis. In PLDI, pages 375–385, 2009.
8. Sumit Gulwani, Krishna K. Mehra, and Trishul M. Chilimbi. Speed: precise and
efficient static estimation of program computational complexity. In POPL, pages
127–139, 2009.
9. Sumit Gulwani and Florian Zuleger. The reachability-bound problem. In PLDI,
pages 292–304, 2010.
10. Jan Hoffmann, Klaus Aehlig, and Martin Hofmann. Multivariate amortized resource analysis. ACM Trans. Program. Lang. Syst., 34(3):14, 2012.
11. Martin Hofmann and Steffen Jost. Static prediction of heap space usage for firstorder functional programs. In POPL, pages 185–197, 2003.
12. Moritz Sinn, Florian Zuleger, and Helmut Veith. A simple and scalable static
analysis for bound analysis and amortized complexity analysis. In Armin Biere
and Roderick Bloem, editors, CAV, volume 8559 of Lecture Notes in Computer
Science, pages 745–761. Springer, 2014.
13. Florian Zuleger, Sumit Gulwani, Moritz Sinn, and Helmut Veith. Bound analysis
of imperative programs with the size-change abstraction. In SAS, pages 280–297,
2011.
142
Optimal Temporal Logic Control for
Deterministic Transition Systems with
Probabilistic Penalties?
Mária Svoreňová1 , Ivana Černá1 , and Calin Belta2
1
2
Faculty of Informatics, Masaryk University, Brno 60200, Czech Republic
[email protected], [email protected]
Dep. of Mechanical Engineering, Boston University, Boston, MA 02215, USA
[email protected]
While optimal control theory is a mature discipline, control of systems from
a temporal logic specification has gained considerable attention in control literature only recently. The combination of the two areas, where the goal is to
optimize the behavior of a system subject to correctness constraints, is a largely
open area with a potentially high impact in applications. In this work, we employ
formal methods such as automata-based model checking and games to solve an
optimal temporal logic control problem motivated by robotic applications.
As an example, consider a mobile robot involved in a complex mission under
tight fuel and time constraints. We assume such a system being modeled as a
weighted deterministic transition system required to satisfy a Linear Temporal
Logic (LTL) formula over its labels. Every state of the system is associated with
a time-varying, locally sensed penalty modeled as a Markov chain (MC) that
can be used to encode environmental phenomena with known statistics, such as
energy or time demands for the mobile robot that change according to traffic
load. Motivated by persistent surveillance robotic missions, our goal in this work
is to minimize the expected average cumulative penalty incurred between consecutive satisfactions of a desired property, while at the same time satisfying an
additional temporal logic constraint. We provide two solutions to this problem.
First, we derive a provably correct optimal strategy within the class of strategies
that do not exploit values of penalties sensed in real time, only the a priori known
transition probabilities of the penalties’ MCs. Second, by taking advantage of
locally sensing the penalties, we construct heuristic strategies that lead to lower
collected penalty, while still ensuring satisfaction of the LTL constraint.
The abstract is based on the following published work and its extension:
Svoreňová, M., Černá, I., Belta, C.: Optimal Receding Horizon Control for Finite Deterministic Systems with Temporal Logic Constraints.
American Control Conference (ACC), 2013, 4399–4404.
Svoreňová, M., Černá, I., Belta, C.: Optimal Temporal Logic Control
for Deterministic Transition Systems with Probabilistic Penalties. IEEE
Transactions on Automatic Control, accepted.
?
The work was partially supported at MU by grants GAP202/11/0312, LH11065, and
at BU by ONR grants MURI N00014-09-1051, MURI N00014-10-10952 and by NSF
grant CNS-1035588.
143
Understanding the Importance of Interactions
among Job Scheduling Policies?
Šimon Tóth1 and Dalibor Klusáček2
1
Faculty of Informatics, Masaryk University, Brno, Czech Republic
2
CESNET a.l.e., Prague, Czech Republic
[email protected], [email protected]
Many studies in the past two decades focused on the problem of efficient
job scheduling in large computational systems. While many new scheduling algorithms have been proposed, mainstream resource management systems and
schedulers are still using only a limited set of scheduling policies. For example,
the core of the system is generally based on the simple First Come First Served
(FCFS) approach, while backfilling (a trivial optimization of FCFS to increase
utilization) is typically the most advanced option available. Since backfilling has
been proposed in 1995, it is obvious that there is some misunderstanding between the research community and system administrators concerning “what is
really important”.
In this work [3] — recently presented at the Euro-Par conference — we show
that the problem of operating a production scheduler is far more complex than
just choosing a proper scheduling algorithm. Using our experience from the Czech
National Grid Infrastructure MetaCentrum we explain several additional challenges that appear when searching for a functional solution. These problems are
related to the fact that real systems must meet far more complicated requirements than those that are typically considered in classical research papers [1].
In fact, production systems need to balance various policies that are set in
place to satisfy both resource providers and users. Among them — according to
our findings — the most important policies are often those that define how queues
are ordered and prioritized and what are their corresponding limits. Queue limits
define, e.g., the number of CPUs that can be used at a given moment by a given
class of jobs. The major problem in this area is that although many works
address these separate policies, e.g., fairshare for fair resource allocation, complex
interactions between policies are not properly discussed in the literature.
In our work [3] we describe how to approach these interactions when developing site-specific policies. Notably, we describe how (priority) queues interact
with scheduling algorithms, fairshare and with anti-starvation mechanisms.
Importantly, we have considered a real-life problem where we were searching
for a new scheduling configuration for MetaCentrum. To achieve that, we have
used detailed and high-resolution simulations using the advanced jobs scheduling simulator Alea [2]. One of the most important finding was that a minor
?
This work was kindly supported by the LM2010005 project funded by the Ministry of Education, Youth, and Sports of the Czech Republic and by the grant
No. P202/12/0306 funded by the Grant Agency of the Czech Republic.
144
Understanding the Importance of Interactions among Job Scheduling Policies
“conservative” modification of existing system configuration — which was initially considered as safe — may produce unforeseen chain reactions in the system, leading to much poorer performance. The newly developed configuration
for MetaCentrum, which has significantly increased its overall performance, is
a rather complex modification of previous setup. The whole queue configuration has been modified, introducing new queues with new limits. At the same
time, fairness was stressed out which required modifications of the queue ordering scheme. Finally, the overall throughput has been increased by optimizing
existing job anti-starvation mechanism.
The newly developed configuration is applied in production use within MetaCentrum’s TORQUE resource manager since January 2014 without any major
problems. In fact, the amount of utilized CPU hours have increased by 23% compared to the same period of time prior the new solution was deployed. Also the
number of processed jobs has increased by 87%. Importantly, even with a higher
throughput, job wait times remained decent, in fact they were decreased significantly, as is shown in Fig. 1 (more jobs now wait shorter than previously).
Fig. 1. The comparison of waiting times for the second half of year 2013 (old configuration) and the first half of year 2014 (new configuration).
References
1. Eitan Frachtenberg and Dror G. Feitelson. Pitfalls in parallel job scheduling
evaluation. In Dror G. Feitelson, Eitan Frachtenberg, Larry Rudolph, and Uwe
Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, volume
3834 of LNCS, pages 257–282. Springer Verlag, 2005.
2. Dalibor Klusáček and Hana Rudová. Alea 2 – job scheduling simulator. In 3rd
International ICST Conference on Simulation Tools and Technique. ICST, 2010.
3. Dalibor Klusáček and Šimon Tóth. On interactions among scheduling policies:
Finding efficient queue setup using high-resolution simulations. In Fernando Silva,
Inês Dutra, and Vı́tor Santos Costa, editors, Euro-Par 2014, volume 8632 of LNCS,
pages 138–149. Springer, 2014.
145
Author Index
Avros, R., 15
Barnat, J., 28
Belta, C., 143
Benáček, P., 40
Beneš, N., 28
Bezděk, P., 28
Blažek, R. B., 40
Borecký, J., 127
Brázdil, T., 128
Čejka, T., 40
Černá, I., 143
Černá, I., 28
Chatterjee, K., 128, 129
Chmelı́k, M., 128, 129
Daca, P., 129
Dvořák, M., 52
Esparza, J., 130
Forejt, V., 128
Gajarský, J., 131
Holı́k, L., 132
Hrubá, V., 15, 133
Jakubı́ček, M., 135
Jančı́k, P., 134
Kekely, L., 40
Kilgarriff, A., 135
Klimošová, T., 137
Klusáček, D., 144
Kofroň, J., 134
Kolář, D., 63
Kořenek, J., 52, 77
Košař, V., 77
Kotásek, Z., 140
Kovář, V., 135
Král’, D., 137
Křena, B., 15, 133
Křetı́nský, J., 128, 130
Kubátová, H., 40, 127
Kwiatkowska, M., 128
Lengál, O., 132
Letko, Z., 133
Matula, P., 63
Matyska, L., 101
Meduna, A., 89
Nevěřilová, Z., 138
Novotný, P., 139
Parker, D., 128
Pazúriková, J., 101
Pluháčková, H., 133
Pluháčková, H., 15
Podivı́nský, J., 140
Rogalewicz, A., 132
Rollini, S. F., 134
Rychlý, P., 135
Sharygina, N., 134
Šimáček, J., 132
Sinn, M., 141
Soukup, O., 89
Suchomel, V., 135
Svoreňová, M., 143
Tóth, Š., 113, 144
Ujma, M., 128
Ur, S., 15
Veith, H., 141
Vı́t, P., 127
Vojnar, T., 15, 132, 133
Volkovich, Z., 15
Závodnı́k, T., 52
Zuleger, F., 141
}w
!"#$%&'()+,-./012345<yA|
Organisers
Faculty of Informatics
Masaryk University
Botanická 68a
Brno, Czech Republic
http://www.fi.muni.cz
Faculty of Information Technology
Brno University of Technology
Božetěchova 2
Brno, Czech Republic
http://www.fit.vutbr.cz
Workshop Sponsors
Petr Hliněný, Zdeněk Dvořák,
Jiřı́ Jaroš, Jan Kofroň, Jan Kořenek,
Petr Matula, Karel Pala (Eds.)
MEMICS 2014
Ninth Doctoral Workshop on Mathematical
and Engineering Methods in Computer Science
Printing and publishing: NOVPRESS s.r.o., nám. Republiky 725/15, 614 00 Brno
Edition: first
Year of publication: 2014
Typesetting: camera ready by paper authors and PC members, data conversion
and design by Jaroslav Rozman
Cover design: Tomáš Staudek
ISBN 978-80-214-5022-6

Podobné dokumenty