guerilla černý text trixx
Transkript
guerilla černý text trixx
MEMICS 2014 Petr Hliněný, Zdeněk Dvořák, Jiřı́ Jaroš, Jan Kofroň, Jan Kořenek, Petr Matula and Karel Pala (Eds.) MEMICS 2014 Ninth Doctoral Workshop on Mathematical and Engineering Methods in Computer Science Telč, Czech Republic, October 17–19, 2014 Editors Petr Hliněný Faculty of Informatics Masaryk University Botanická 68a, Brno, Czech Republic Zdeněk Dvořák Faculty of Mathematics and Physics Charles University Ke Karlovu 3, Praha, Czech Republic Jiřı́ Jaroš Faculty of Information Technology Brno University of Technology Božetěchova 2, Brno, Czech Republic Jan Kofroň Faculty of Mathematics and Physics Charles University Malostranské náměstı́ 25, Praha, Czech Republic Jan Kořenek Faculty of Information Technology Brno University of Technology Božetěchova 2, Brno, Czech Republic Petr Matula Faculty of Informatics Masaryk University Botanická 68a, Brno, Czech Republic Karel Pala Faculty of Informatics Masaryk University Botanická 68a, Brno, Czech Republic Subject classification: Information technology ISBN 978-80-214-5022-6 Company and university graphics are properties of their respective owners and are published as provided Preface This volume contains the local proceedings of the 9th Doctoral Workshop on Mathematical and Engineering Methods in Computer Science (MEMICS 2014) held in Telč, Czech Republic, during October 17—19, 2014. The aim of the MEMICS workshop series is to provide an opportunity for PhD students to present and discuss their work in an international environment. The scope of MEMICS is broad and covers many fields of computer science and engineering. In the year 2014, submissions were invited especially in the following (though not exclusive) areas: – – – – – – Algorithms, logic, and games, High performance computing, Computer aided analysis, verification, and testing, Hardware design and diagnostics, Computer graphics and image processing, and Artificial intelligence and natural language processing. There were 28 submissions from 10 countries. Each submission was thoroughly evaluated by at least four Programme Committee members who also provided extensive feedback to the authors. Out of these submissions, 9 papers were selected for publication in LNCS post-proceedings, and 9 other papers for publication in these local proceedings. In addition to regular papers, MEMICS workshops also invite PhD students to submit a presentation of their recent research results that have already undergone a rigorous peer review process and have been presented at a high quality international conference or published in a recognized journal. A total of 16 presentations out of 22 submissions from 6 countries were included into the MEMICS 2014 programme. Short abstracts of accepted presentations also appear in these local proceedings. All of the contributed papers were presented by PhD students who received immediate feedback from their peers and the participating senior researchers. All students were encouraged to actively take part in the discussions, express their opinions, exchange ideas and compare methods, traditions and approaches between groups and institutions whose representatives were participating in the workshop. The highlights of the MEMICS 2014 programme included six keynote lectures delivered by internationally recognized researchers. The full papers of these keynote lectures were also included for publication in the LNCS postproceedings. The speakers were: – Gianni Antichi from University of Cambridge who gave a talk on An OpenSource Hardware Approach for High Performance Low-Cost QoS Monitoring of VoIP Traffic, V – Derek Groen from University College London who gave a talk on Highperformance multiscale computing for modelling cerebrovascular bloodflow and nanomaterials. – Jozef Ivanecký from European Media Laboratory who gave a talk on Today’s Challenges for Embedded ASR, – Daniel Lokshtanov from University of Bergen who gave a talk on Tree Decompositions and Graph algorithms, – Michael Tautschnig from Queen Mary University of London who gave a talk on Automating Software Analysis at Very Large Scale and – Stefan Wörz from University of Heidelberg who gave a talk on 3D ModelBased Segmentation of 3D Biomedical Images The MEMICS tradition of best paper awards continued also in the year 2014. The best regular papers were selected during the workshop, taking into account their scientific and technical contribution together with the quality of presentation. The awards consisted of a diploma accompanied by a financial prize of roughly 400 Euro. The money was donated by Red Hat Czech Republic and by Y Soft, two of the MEMICS 2014 Industrial Sponsors. The successful organization of MEMICS 2014 would not be possible without generous help and support from the organizing institutions: Brno University of Technology and Masaryk University in Brno. We thank the Programme Committee members and the external reviewers for their careful and constructive work. We thank the Organizing Committee members who helped create a unique and relaxed atmosphere which distinguishes MEMICS from other computer science meetings. We also gratefully acknowledge the support of the EasyChair system and the great cooperation with the Lecture Notes in Computer Science team of Springer Verlag. Brno, October 2014 Petr Hliněný General chair of MEMICS 14 Zdeněk Dvořák, Jiřı́ Jaroš, Jan Kofroň, Jan Kořenek, Petr Matula and Karel Pala PC track chairs of MEMICS 14 VI Organisation General Chair Petr Hliněný, Masaryk University, Brno, Czech Republic Programme Committee Co-Chairs Zdeněk Dvořák, Charles University, Czech Republic Jiřı́ Jaroš, Brno University of Technology, Czech Republic Jan Kofroň, Charles University, Czech Republic Jan Kořenek, Brno University of Technology, Czech Republic Petr Matula, Masaryk University, Czech Republic Karel Pala, Masaryk University, Czech Republic Programme Committee Gianni Antichi, University of Cambridge, UK Tomáš Brázdil, Masaryk University, Czech Republic Markus Chimani, Osnabrück University, Germany Jan Černocký, Brno University of Technology, Czech Republic Eva Dokladalova, ESIEE Paris, France Jiřı́ Filipovič, Masaryk University, Czech Republic Robert Ganian, Vienna University of Technology, Austria Dieter Gollmann, TU Hamburg, Germany Derek Groen, University College London, UK Juraj Hromkovič, ETH Zürich, Switzerland Ondřej Jakl, VŠB-TU Ostrava, Czech Republic Hidde de Jong, INRIA, France Zdeněk Kotásek, Brno University of Technology, Czech Republic Lukasz Kowalik, University of Warsaw, Poland Hana Kubátová, Czech Technical University in Prague, Czech Republic Michal Laclavı́k, Slovak Academy of Sciences, Bratislava, Slovakia Markéta Lopatková, Charles University in Prague, Czech Republic Julius Parulek, University of Bergen, Norway Maciej Piasecki, Wroclaw University of Technology, Poland Geraint Price, Royal Holloway, University of London, UK Viktor Puš, CESNET, Czech Republic Ricardo J. Rodrı́guez, Technical University of Madrid, Spain Adam Rogalewicz, Brno University of Technology, Czech Republic Cristina Seceleanu, MDH, Sweden Jiřı́ Srba, Aalborg University, Denmark VII Andreas Steininger, TU Wien, Austria Jan Strejček, Masaryk University, Czech Republic David Šafránek, Masaryk University, Czech Republic Ivan Šimeček, Czech Technical University in Prague, Czech Republic Petr Švenda, Masaryk University, Czech Republic Catia Trubiani, GSSI, Italy Pavel Zemčı́k, Brno University of Technology, Czech Republic Florian Zuleger, TU Wien, Austria Steering Committee Tomáš Vojnar, chair, Brno University of Technology, Brno, Czech Republic Milan Češka, Brno University of Technology, Brno, Czech Republic Zdeněk Kotásek, Brno University of Technology, Brno, Czech Republic Mojmı́r Křetı́nský, Masaryk University, Brno, Czech Republic Antonı́n Kučera, Masaryk University, Brno, Czech Republic Luděk Matyska, Masaryk University, Brno, Czech Republic Organizing Committee Radek Kočı́, chair, Brno University of Technology, Czech Republic Zdeněk Letko, Brno University of Technology, Czech Republic Jaroslav Rozman, Brno University of Technology, Czech Republic Hana Pluháčková, Brno University of Technology, Czech Republic Lenka Turoňová, Brno University of Technology, Czech Republic Additional Reviewers Kfir Barhum Hans-Joachim Boeckenhauer Yu-Fang Chen Pavel Čeleda Vojtěch Forejt Lukáš Holı́k Ivan Kolesár Jan Křetı́nský Sacha Krug VIII Julio Mariño František Mráz Mads Chr. Olesen Jakub Pawlewicz Martin Plátek Fernando Rosa-Velardo Václav Šimek Marek Trtı́k Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII I II III Invited Lectures—Abstracts Contributed Papers - Abstracts Regular Papers Boosted Decision Trees for Behaviour Mining of Concurrent Programs . . . Renata Avros, Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Hana Pluháčková, Tomáš Vojnar, Zeev Volkovich, and Shmuel Ur 15 LTL model checking of Parametric Timed Automata . . . . . . . . . . . . . . . . . . Peter Bezděk, Nikola Beneš, Jiřı́ Barnat, and Ivana Černá 28 FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomáš Čejka, Lukáš Kekely, Pavel Benáček, Rudolf B. Blažek, and Hana Kubátová 40 Hardware Accelerated Book Handling with Unlimited Depth . . . . . . . . . . . Milan Dvořák, Tomáš Závodnı́k, and Jan Kořenek 52 Composite Data Type Recovery in a Retargetable Decompilation . . . . . . . Dušan Kolář and Peter Matula 63 Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vlastimil Košař and Jan Kořenek 77 Computational Completeness Resulting from Scattered Context Grammars Working Under Various Derivation Modes . . . . . . . . . . . . . . . . . . Alexander Meduna and Ondřej Soukup 89 Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Jana Pazúriková and Luděk Matyska IX A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods for Job Queuing and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Šimon Tóth IV Presentations Fault Recovery Method with High Availability for Practical Applications . 127 Jaroslav Borecký, Pavel Vı́t, and Hana Kubátová Verification of Markov Decision Processes using Learning Algorithms . . . . 128 Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelı́k, Vojtěch Forejt, Jan Křetı́nský, Marta Kwiatkowska, David Parker, and Mateusz Ujma CEGAR for Qualitative Analysis of Probabilistic Systems . . . . . . . . . . . . . . 129 Krishnendu Chatterjee, Martin Chmelı́k, and Przemyslaw Daca From LTL to Deterministic Automata: A Safraless Compositional Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Javier Esparza and Jan Křetı́nský Faster Existential FO Model Checking on Posets . . . . . . . . . . . . . . . . . . . . . . 131 Jakub Gajarský Fully Automated Shape Analysis Based on Forest Automata . . . . . . . . . . . 132 Lukáš Holı́k, Ondřej Lengál, Adam Rogalewicz, Jiřı́ Šimáček, and Tomáš Vojnar Multi-objective Genetic Optimization for Noise-Based Testing of Concurrent Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Hana Pluháčková, and Tomáš Vojnar On Iterpolants and Variable Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Pavel Jančı́k, Jan Kofroň, Simone Fulvio Rollini, and Natasha Sharygina Finding Terms in Corpora for Many Languages . . . . . . . . . . . . . . . . . . . . . . . 135 Adam Kilgarriff, Miloš Jakubı́ček, Vojtěch Kovář, Pavel Rychlý, and Vı́t Suchomel Hereditary properties of permutations are strongly testable . . . . . . . . . . . . . 137 Tereza Klimošová and Daniel Král’ Paraphrase and Textual Entailment Generation in Czech . . . . . . . . . . . . . . . 138 Zuzana Nevěřilová Minimizing Running Costs in Consumption Systems . . . . . . . . . . . . . . . . . . . 139 Petr Novotný X Testing Fault-Tolerance Methodologies in Electro-mechanical Applications 140 Jakub Podivı́nský and Zdeněk Kotásek A Simple and Scalable Static Analysis for Bound Analysis and Amortized Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Moritz Sinn, Florian Zuleger, and Helmut Veith Optimal Temporal Logic Control for Deterministic Transition Systems with Probabilistic Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Mária Svoreňová, Ivana Černá, and Calin Belta Understanding the Importance of Interactions among Job Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Šimon Tóth and Dalibor Klusáček Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 XI XII Part I Invited Lectures—Abstracts Gianni Antichi, Computer Laboratory, University of Cambridge, United Kingdom Hardware accelerated networking systems: practice against theory Computer networks are the hallmark of the 21st century’s society and underpin virtually all infrastructures of the modern world. Building, running and maintaining enterprise networks is getting ever more complicated and difficult. Part of the problem is related to the proliferation of real-time applications (voice, video, gaming), which demand higher bandwidth and low-latency connections pushing network devices to work at higher speeds. In this scenario, hardware acceleration comes to aid speeding up time-critical operations. This talk will introduce the most common network processing hardware accelerated operations such as IPlookup, packet classification and network monitoring taking as a reference the widely used NetFPGA platform. We will present a list of technical challenges that must be addressed to transition from a simple layer-2 switching device to a highly accurate network monitoring system, passing by a packet classifier. Derek Groen, Centre for Computational Science, University College London High-performance multiscale computing for modelling cerebrovascular bloodflow and nanomaterials Stroke is a leading cause of adult disability, and responsible for 50,000 deaths in the UK in 2012. Brain haemorrhages are responsible for 15% of all strokes in the UK, and 50% of all strokes in children. Within UCL we have developed the HemeLB simulation environment to try and obtain more understanding about Brain haemorrhages and blood flow in sparse geometries in general. In this talk I will introduce HemeLB and summarize the many research efforts made around it in recent years. I will present a cerebrovascular blood flow simulation which incorporates input from the wider environment in a cerebrovascular network by coupling a 1D discontinuous Galerkin model to a 3D lattice-Boltzmann model, as well as several advances that we have made to improve the performance of our code. These include vectorization of the code, improved domain decomposition techniques and some preliminary results on using non-blocking collectives. I will also present our ongoing work on clay-polymer nanocomposites where we use a three-level multiscale scheme to produce a chemically-specific model of clay-polymer nanocomposites. We applied this approach to study collections of clay mineral tactoids interacting with two synthetic polymers, polyethylene glycol and polyvinyl alcohol. The controlled behaviour of layered materials in a polymer matrix is centrally important for many engineering and manufacturing applications. Our approach opens up a route to computing the properties of complex soft materials based on knowledge of their chemical composition, molecular structure and processing conditions. 3 Jozef Ivanecký, Stephan Mehlhase, European Media Laboratory, Germany Today‘s Challenges fof Embedded ASR Automatic Speech Recognition (ASR) is pervading nowadays to areas unimaginable few years ago. Such a progress in past few years was achieved not because of core embedded ASR technology improvement, but mainly because of massive changes in “Smart phones world” as well as availability of a small powerful and affordable Linux based HW. These changes answered two important questions: 1. How to make ASR always available? 2. How to install local affordable ASR system almost anywhere? In recent years we can also observe grow of freely available ASR systems with acceptable speed and accuracy. Together with changes in mobile world it is possible to embed remote ASR into applications very quickly and without deep knowledge about the speech recognition. What is the future of real embedded ASR systems in this case? The goal of this talk is to present two embedded ASR applications which would not be possible without above mentioned changes over recent years and point out their advantages in contrast to today‘s quick solutions. The first one demonstrates how changes in users behaviors allowed to design usable voice enabled house control application accepted by all age groups. The focus of second one is mainly on extremely reliable in-car real time speech recognition system which can use also remote ASR for some specific tasks. Daniel Lokshtanov, University of Bergen, Norway Tree Decompositions and Graph algorithms A central concept in graph theory is the notion of tree decompositions - these are decompositions that allow us to split a graph up into “nice” pieces by “small” cuts. It is possible to solve many algorithmic problems on graphs by decomposing the graph into “nice” pieces, finding a solution in each of the pieces, and then gluing these solutions together to form a solution to the entire graph. Examples of this approach include algorithms for deciding whether a given input graph is planar, the k-Disjoint paths algorithm of Robertson and Seymour, as well as a plenthora of algorithms on graphs of bounded tree-width. By playing with the formal definition of “nice” one arrives at different kinds of decompositions, with different algorithmic applications. For an example graphs of bounded treewidth are graphs that may be decomposed into “small” pieces by “small” cuts. The structure theorem for minor-free graphs of Robertson and Seymour states that minor-free graphs are exactly the graphs that may be decomposed by “small” cuts into pieces that “almost” can be drawn on a surface of small genus. In this talk we will ask the following lofty question: is it possible that every graph has one, “nicest” tree decomposition which simultaneously decomposes the graph into parts that are “as nice as possible” for any reasonable definition of nice? And, if such a decomposition exists, how fast can we find such it algorithmically? 4 Michael Tautschnig, University of London, United Kingdom Automating Software Analysis at Very Large Scale Actual software in use today is not known to follow any uniform normal distribution, whether syntactically—in the language of all programs described by the grammar of a given programming language, or semantically—for example, in the set of reachable states. Hence claims deduced from any given set of benchmarks need not extend to real-world software systems. When building software analysis tools, this affects all aspects of tool construction: starting from language front ends not being able to parse and process real-world programs, over inappropriate assumptions about (non-)scalability, to failing to meet actual needs of software developers. To narrow the gap between real-world software demands and software analysis tool construction, an experiment using the Debian Linux distribution has been set up. The Debian distribution presently comprises of more than 22000 source software packages. Focussing on C source code, more than 400 million lines of code are automatically analysed in this experiment, resulting in a number of improvements in analysis tools on the one hand, but also more than 700 public bug reports to date. Stefan Wörz, University of Heidelberg, Germany 3D Model-Based Segmentation of 3D Biomedical Images A central task in biomedical image analysis is the segmentation and quantification of 3D image structures. A large variety of segmentation approaches have been proposed including approaches based on different types of deformable models. A main advantage of deformable models is that they allow incorporating a priori information about the considered image structures. In this contribution we give a brief overview of often used deformable models such as active contour models, statistical shape models, and analytic parametric models. Moreover, we present in more detail 3D analytic parametric intensity models, which enable accurate and robust segmentation and quantification of 3D image structures. Such parametric models have been successfully used in different biomedical applications, for example, for the localization of 3D anatomical point landmarks in 3D MR and CT images, for the quantification of vessels in 3D MRA and CTA images, as well as for the segmentation of cells and subcellular structures in 3D microscopy images. 5 6 Part II Contributed Papers Abstracts Petr Bauch, Vojtěch Havel, and Jiřı́ Barnat, Masaryk University, Brno, Czech Republic LTL Model Checking of LLVM Bitcode with Symbolic Data The correctness of parallel and reactive programs is often easier specified using formulae of temporal logics. Yet verifying that a system satisfies such specifications is more difficult than verifying safety properties: the recurrence of a specific program state has to be detected. This paper reports on the development of a generic framework for automatic verification of linear temporal logic specifications for programs in LLVM bitcode. Our method searches explicitly through all possible interleavings of parallel threads (control non-determinism) but represents symbolically the variable evaluations (data non-determinism), guided by the specification in order to prove the correctness. To evaluate the framework we compare our method with state-of-the-art tools on a set of unmodified C programs. Stephan Beyer and Markus Chimani, University of Osnabrück, Germany Steiner Tree 1.39-Approximation in Practice We consider the currently strongest Steiner tree approximation algorithm that has recently been published by Goemans, Olver, Rothvoß and Zenklusen (2012). It first solves a hypergraphic LP relaxation and then applies matroid theory to obtain an integral solution. The cost of the resulting Steiner tree is at most (1.39 + ε)-times the cost of an optimal Steiner tree where ε tends to zero as some parameter k tends to infinity. However, the degree of the polynomial running time depends on this constant k, so only small k are tractable in practice. The algorithm has, to our knowledge, not been implemented and evaluated in practice before. We investigate different implementation aspects and parameter choices of the algorithm and compare tuned variants to an exact LP-based algorithm as well as to fast and simple 2-approximations. Jan Fiedor, Zdeněk Letko, João Lourenço and Tomáš Vojnar, Brno University of Technology, Czech Republic and Universidade Nova de Lisboa, Portugal On Monitoring C/C++ Transactional Memory Programs Transactional memory (TM) is an increasingly popular technique for synchronising threads in multi-threaded programs. To address both correctness and performance-related issues of TM programs, one needs to monitor and analyse their execution. However, monitoring concurrent programs (including TM programs) can have a non-negligible impact on their behaviour, which may hamper objectives of the intended analysis. In this paper, we propose several approaches for monitoring TM programs and study their impact on the behaviour of the monitored programs. The considered approaches range from specialised lightweight monitoring to generic heavyweight monitoring. The implemented 9 monitoring tools are publicly available for further applications, and the implementation techniques used for lightweight monitoring can be used as an inspiration for developing further specialised lightweight monitors. Radek Hrbáček, Brno University of Technology, Czech Republic Bent Functions Synthesis on Intel Xeon Phi Coprocessor A new approach to synthesize bent Boolean functions by means of Cartesian Genetic Programming (CGP) has been proposed recently. Bent functions have important applications in cryptography due to their high nonlinearity. However, they are very rare and their discovery using conventional brute force methods is not efficient enough. In this paper, a new parallel implementation is proposed and the performance is evaluated on the Intel Xeon Phi Coprocessor. Vojtěch Nikl and Jiřı́ Jaroš, Brno University of Technology, Czech Republic Parallelisation of the 3D Fast Fourier Transform Using the Hybrid OpenMP/MPI Decomposition The 3D fast Fourier transform (FFT) is the heart of many simulation methods. Although the efficient parallelisation of the FFT has been deeply studied over last few decades, many researchers only focused on either pure message passing (MPI) or shared memory (OpenMP) implementations. Unfortunately, pure MPI approaches cannot exploit the shared memory within the cluster node and the OpenMP cannot scale over multiple nodes. This paper proposes a 2D hybrid decomposition of the 3D FFT where the domain is decomposed over the first axis by means of MPI while over the second axis by means of OpenMP. The performance of the proposed method is thoroughly compared with the state of the art libraries (FFTW, PFFT, P3DFFT) on three supercomputer systems with up to 16k cores. The experimental results show that the hybrid implementation offers 10-20% higher performance and better scaling especially for high core counts. Juraj Nižnan, Radek Pelánek, and Jiřı́ Řihák, Masaryk University, Brno, Czech Republic Mapping Problems to Skills Combining Expert Opinion and Student Data Construction of a mapping between educational content and skills is an important part of development of adaptive educational systems. This task is difficult, requires a domain expert, and any mistakes in the mapping may hinder the potential of an educational system. In this work we study techniques for improving a problem-skill mapping constructed by a domain expert using student data, particularly problem solving times. We describe and compare different techniques for the task – a multidimensional model of problem solving times 10 and supervised classification techniques. In the evaluation we focus on surveying situations where the combination of expert opinion with student data is most useful. Karel Štěpka and Martin Falk, Masaryk University and Institute of Biophysics of ASCR, Brno, Czech Republic Image Analysis of Gene Locus Positions within Chromosome Territories in Human Lymphocytes One of the important areas of current cellular research with substantial impacts on medicine is analyzing the spatial organization of genetic material within the cell nuclei. Higher-order chromatin structure has been shown to play essential roles in regulating fundamental cellular processes, like DNA transcription, replication, and repair. In this paper, we present an image analysis method for the localization of gene loci with regard to chromosomal territories they occupy in 3D confocal microscopy images. We show that the segmentation of the territories to obtain a precise position of the gene relative to a hard territory boundary may lead to undesirable bias in the results; instead, we propose an approach based on the evaluation of the relative chromatin density at the site of the gene loci. This method yields softer, fuzzier “boundaries”, characterized by progressively decreasing chromatin density. The method therefore focuses on the extent to which the signals are located inside the territories, rather than a hard yes/no classification. Vladimı́r Štill, Petr Ročkai, and Jiřı́ Barnat, Masaryk University, Brno, Czech Republic Context-Switch-Directed Verification in DIVINE In model checking of real-life C and C++ programs, both search efficiency and counterexample readability are very important. In this paper, we suggest context-switch-directed exploration as a way to find a well-readable counterexample faster. Furthermore, we allow to limit the number of context switches used in state-space exploration if desired. The new algorithm is implemented in the DIVINE model checker and enables both unbounded and bounded contextswitch-directed exploration for models given in LLVM bitcode, which efficiently allows for verification of multi-threaded C and C++ programs. David Wehner, ETH Zürich, Switzerland A New Concept in Advice Complexity of Job Shop Scheduling In online scheduling problems, we want to assign jobs to machines while optimizing some given objective function. In the class we study in this paper, we are given a number m of machines and two jobs that both want to use each of the given machines exactly once in some predefined order. Each job consists of m tasks and each task needs to be processed on one particular machine. The 11 objective is to assign the tasks to the machines while minimizing the makespan, i.e., the processing time of the job that takes longer. In our model, the tasks arrive in consecutive time steps and an algorithm must assign a task to a machine without having full knowledge of the order in which the remaining tasks arrive. We study the advice complexity of this problem, which is a tool to measure the amount of information necessary to achieve a certain output quality. A great deal of research has been carried out in this field; however, this paper studies the problem in a new setting. In this setting, the oracle does not know the exact future anymore but only all possible future scenarios and their probabilities. This way, the additional information becomes more realistic. We prove that the problem is more difficult with this oracle than before. Moreover, in job shop √ scheduling, we provide a lower bound of 1 + 1/(6 m) on the competitive √ ratio of any online algorithm with advice and prove an upper bound of 1 + 1/ m on the competitive ratio of an algorithm from Hromkovič et al. 12 Part III Regular Papers Boosted Decision Trees for Behaviour Mining of Concurrent Programs Renata Avros2 , Vendula Hrubá1 , Bohuslav Křena1 , Zdeněk Letko1 , Hana Pluháčková1 , Tomáš Vojnar1 , Zeev Volkovich2 , and Shmuel Ur1 1 2 IT4Innovations Centre of Excellence, FIT, Brno University of Technology, Brno, CZ {ihruba, krena, iletko, ipluhackova, vojnar}@fit.vutbr.cz, [email protected] Ort Braude College of Engineering, Software Engineering Department, Karmiel, IL {r avros, vlvolkov}@braude.ac.il Abstract. Testing of concurrent programs is difficult since the scheduling non-determinism requires one to test a huge number of different thread interleavings. Moreover, a simple repetition of test executions will typically examine similar interleavings only. One popular way how to deal with this problem is to use the noise injection approach, which is, however, parameterized with many parameters whose suitable values are difficult to find. In this paper, we propose a novel application of classification-based data mining for this purpose. Our approach can identify which test and noise parameters are the most influential for a given program and a given testing goal and which values (or ranges of values) of these parameters are suitable for meeting this goal. We present experiments that show that our approach can indeed fully automatically improve noise-based testing of particular programs with a particular testing goal. At the same time, we use it to obtain new general insights into noise-based testing as well. 1 Introduction Testing of concurrent programs is known to be difficult due to the many different interleavings of actions executed in different threads to be tested. A single execution of available tests used in traditional unit and integration testing usually exercises a limited subset of all possible interleavings. Moreover, repeated executions of the same tests in the same environment usually exercise similar interleavings [2, 3]. Therefore, means for increasing the number of tested interleavings within repeated runs, such as deterministic testing [2], which controls threads scheduling and systematically enumerates different interleavings, and noise injection [3], which injects small delays or context switches into the running threads in order to see different scheduling scenarios, have been proposed and applied in practice. In order to measure how well a system under test (SUT) has been exercised and hence to estimate how good a given test suite is, testers often collect and analyse coverage metrics. However, one can gain a lot more information from the test executions. One can, e.g., get information on similarities of the behaviour 15 R. Avros et al. witnessed through different tests, on the behaviour witnessed only within tests that failed, and so on. Such information can be used to optimize the test suite, to help debugging the program, etc. In order to get such information, data mining techniques appear to be a promising tool. In this paper, we propose a novel application of data mining allowing one to exploit information present in data obtained from a sample of test runs of a concurrent program to optimize the process of noise-based testing of the given program. To be more precise, our method employs a data mining method based on classification by means of decision trees and the AdaBoost algorithm. The approach is, in particular, intended to find out which parameters of the available tests and which parameters of the noise injection system are the most influential and which of their values (or ranges of values) are the most promising for a particular testing goal for the given program. The information obtained by our approach can certainly be quite useful since the efficiency of noise-based testing heavily depends on a suitable setting of the test and noise parameters, and finding such values is not easy [8]. That is why, repeated testing based on randomly chosen noise parameters is often used in practice. Alternatively, one can try to use search techniques (such as genetic algorithms) to find suitable test and noise settings [8, 7]. The classifiers obtained by our data mining approach can be easily used to fully automatically optimize the most commonly used noise-based testing with a random selection of parameter values. This can be achieved by simply filtering out randomly generated noise settings that are not considered as promising by the classifier. Moreover, it can also be used to guide and consequently speed up the manual or search-based process of finding suitable values of test and noise parameters (in the latter case, the search techniques would look for a suitable refinement of the knowledge obtained by data mining). Finally, if some of the noise parameters or generic test parameters (such as the number of threads) appear as important across multiple test cases and test goals, they can be considered as important in general, providing a new insight into the process of noise-based testing. In order to show that the proposed approach can indeed be useful, we apply it for optimizing the process of noise-based testing for two particular testing goals on a set of several benchmark programs. Namely, we consider the testing goals of reproducing known errors and covering rare interleavings which are likely to hide so far unknown bugs. Our experimental results confirm that the proposed approach can discover useful knowledge about the influence and suitable values of test and noise parameters, which we show in two ways: (1) We manually analyse information hidden in the classifiers, compare it with our long-term experience from the field, and use knowledge found as important across multiple case studies to derive some new recommendations for noise-based testing (which are, of course, to be validated in the future on more case studies). (2) We show that the obtained classifiers can be used—in a fully automated way—to significantly improve efficiency of noise-based testing using a random selection of test and noise parameters. 16 Boosted Decision Trees for Behaviour Mining of Concurrent Programs Plan of the paper. The rest of the paper is structured as follows. Section 2 briefly introduces the techniques that our paper builds on, namely, noise-based testing of concurrent programs, data mining based on decision trees, and the AdaBoost algorithm. Section 3 presents our proposal of using data mining in noise-based testing of concurrent programs. Section 4 provides results of our experiments and presents the newly obtained insights of noise-based testing. Section 5 summarizes the related work. Finally, Section 6 provides conclusions and a discussion of possible future work. 2 Preliminaries In our previous works, e.g., [8, 10], we have used noise injection to increase the number of interleavings witnessed within the executions of concurrent program and thus to increase the chance of spotting concurrency errors. Noise injection is a quite simple technique which disturbs thread scheduling (e.g., by injecting, removing, or modifying delays, forcing context switches, or halting selected threads) with the aim of driving the execution of a program into less probable scenarios. The efficiency of noise injection highly depends on the type of the generated noise, on the strength of the noise (which are both determined using some noise seeding heuristics), as well as on the program locations and program executions into which some noise is injected (which is determined using some noise placement heuristics). Multiple noise seeding and noise placement heuristics have been proposed and experimentally evaluated [10]. Searching for an optimal configuration of noise seeding and noise placement heuristics in combination with a selection of available test cases and their parameters has been formalized as the test and noise configuration search problem (TNCS) in [7, 8]. To assess how well tests examine the behaviour of an SUT, error manifestation ratio and coverage metrics can be used. Coverage metrics successfully used for testing of sequential programs (like statement coverage) are not sufficient for testing of concurrent programs as they do not reflect concurrent aspects of executions. Concurrency coverage metrics [1] are usually tailored to distinguish particular classes of interleavings and/or to capture synchronization events that occur within the execution. Some of the metrics target concurrency issues from a general point of view while some other metrics, e.g., those inspired by particular dynamic detectors of concurrency errors [9], concentrate on selected concurrency aspects only (e.g., on behaviours potentially leading to a deadlock or to a data race). In this work, we, in particular, use the GoldiLockSC∗ coverage metric which measures how many internal states of the GoldiLock data race detector with the fast short circuit checks [5] have been reached [9]. The data mining approach proposed in this paper is based on binary classification. Binary classification problems consist in dividing items of a given collection into two groups using a suitable classification rule. Methods for learning such classifiers include decision trees, Bayesian networks, support vector machines, or neural networks [12]. The use of decision trees is the most popular of 17 R. Avros et al. those because they are known for quite some time and can be easily understood. A decision tree can be viewed as a hierarchically structured decision diagram whose nodes are labelled by Boolean conditions on the items to be classified and whose leaves represent classification results. The decision process starts in the root node by evaluating the condition associated with it on the item to be classified. According to the evaluation of the condition, a corresponding branch is followed into a child node. This descent, driven by the evaluation of the conditions assigned to the encountered nodes, continues until a leaf node, and hence a decision, is reached. Decision trees are usually employed as a predictive model constructed via a decision tree learning procedure which uses a training set of classified items. In the paper, we—in particular—employ the advanced classification technique called Adaptive Boosting (shortly, AdaBoost) [6] which reduces the natural tendency of decision trees to be unstable (meaning that a minor data oscillation can lead to a large difference in the classification). This technique makes it possible to correct the functionality of many learning algorithms (so-called weak learners) by weighting and mixing their outcomes in order to get the output of the boosted classifier. The method works in iterations (phases). In each iteration, the method aims at producing a new weak classifier in order to improve the consistency of the previously used ones. In our case, AdaBoost uses decision trees as the weak learners with the classification result being −1 or +1. In each phase, the algorithm adds new weighted decision trees obtained by concentrating on items difficult to classify by the so far learnt classifier and updates weights of the previously added decision trees to keep the sum of the weights equal to one. The resulting advanced classifier then consists of a set of weighted decision trees that are all applied on the item to be classified, their classification results are weighted by the appropriate weights, summarized, and the sign of the result provides the final decision. 3 Classification-based Data Mining in Noise-based Testing In this section, we first propose our application of AdaBoost in noise-based testing. Subsequently, we discuss how the information hidden in the classifier may be analysed to draw some conclusions about which test and noise parameters are important for particular test cases and test goals or even in general. Finally, we describe two concrete classification properties that are used in our experiments. 3.1 An Application of AdaBoost in Noise-based Testing First, in order to apply the proposed approach, one has to define some testing goal expressible as a binary property that can be evaluated over test results such that both positive and negative answers are obtained. The requirement of having both positive and negative results can be a problem in some cases, notably in the case of discovering rare errors. In such a case, one has to use a property that 18 Boosted Decision Trees for Behaviour Mining of Concurrent Programs approximates the target property of interest (e.g., by replacing the discovery of rare errors by discovering rare behaviours in general). Subsequently, once testing based on settings chosen in this way manages to find some behaviours which were originally not available (e.g., behaviours leading to a rare error), the process can be repeated on the newly available test results to concentrate on a repeated discovery of such behaviours (e.g., for debugging purposes or for the purpose of finding further similar errors). Once the property of interest is defined, a number of test runs is to be performed using a random setting of test and noise parameters in each run. For each such run, the property of interest is to be evaluated and a couple (x, y) is to be formed where x is a vector recording the test and noise settings used and y is the result of evaluating the property of interest. This process has to be repeated to obtain a set X = {(x1 , y1 ), . . . , (xn , yn )} of such couples to be used as the input for learning the appropriate classifier. Now, the AdaBoost algorithm can be applied. For that, the common practice is to split the set X to two sets—the training set and the testing set, use the training set to get a classifier, and then use the testing set for evaluating the precision of the obtained classifier. To evaluate the precision, one can use the notions of accuracy and sensitivity. Accuracy gives the probability of a successful classification and can be computed as the fraction of the number of correctly classified items and the total number of items. Sensitivity (also called as the negative predictive value or NPV) expresses the fraction of correctly classified negative results and can be computed as the number of the items correctly classified negatively divided by the sum of correctly and incorrectly negatively classified items (see e.g. [12]). Moreover, in order to increase confidence in the obtained results, this process of choosing the training and validation set and of learning and validating the classifier can be repeated several times, allowing one to judge the average values and standard deviation of accuracy and sensitivity. If the obtained classifier is not validated successfully, one can repeat the AdaBoost algorithm with more boosting phases and/or a bigger set X of data. A successfully validated classifier can subsequently be analysed to get some insight which test and noise parameters are influential for testing the given program and which of their values are promising for meeting the defined testing goal. Such a knowledge can then in turn be used by testers when thinking of how to optimize the testing process. We discuss a way how such an analysis can be done in Section 3.2 and we apply it in Section 4.3. Moreover, the obtained classifier can also be directly used to improve performance of noise-based testing based on random selection of parameters by simply filtering out the settings that get classified as not meeting the considered testing goal. The fact that such an approach does indeed significantly improve the testing process is experimentally confirmed in Section 4.4. 3.2 Analysing Information Hidden in Classifiers In order to be able to easy analyse the information hidden in the classifiers generated by AdaBoost, we have decided to restrict the height of the basic deci19 R. Avros et al. sion trees used as weak classifiers to one. Moreover, our preliminary experiments showed us that increasing the height of the weak classifiers does not lead to significantly better classification results. A decision tree of height one consists of a root labelled by a condition concerning the value of a single test or noise parameter and two leaves corresponding to positive and negative classification. AdaBoost provides us with a set of such trees, each with an assigned weight. We convert this set of trees into a set of rules such that we get a single rule for each parameter that appears in at least one decision tree. The rules consist of a condition and a weight, and they are obtained as follows. First, decision trees with negative weights are omitted because they correspond to weak classifiers with the weighted error greater than 0.5.3 Next, the remaining decision trees are grouped according to the parameter about whose value they speak. For each group of the trees, a separate rule is produced such that the conjunction of the decision conditions of the trees from the group is used as the condition of the rule. The weight of the rule is computed by summarising the weights of the trees from the concerned group and normalising the result by dividing it by the sum of the weights of all trees from all groups. The obtained set of rules can be easily used to gain some insight into how the test and noise injection parameters should be set in order to increase efficiency of the testing process—either for a given program and testing goal or even in general. In particular, one can look for parameters that appear in rules with the highest weights (which speak about parameters whose correct setting is the most important to achieve the given testing goal), for parameters that are important in all or many test cases (and hence can be considered to be important in general), as well as for parameters that do not appear in any rules (and hence appear to be irrelevant). 3.3 Two Concrete Classification Properties In the experiments described in the next section, we consider two concrete properties according to which we classify test runs. First, we consider the case of finding TNCS solutions suitable for repeatedly finding known errors. In this case, the property of interest is simply the error manifestation property that indicates whether an error manifested during the test execution or not. Subsequently, we consider the case of finding TNCS solutions suitable for testing rare behaviours in which so far unknown bugs might reside. In order to achieve this goal, we use classification according to a rare events property that indicates whether a test execution covers at least one rare coverage task of a suitable coverage metric—in our experiments, the GoldiLockSC∗ is used for this purpose. To distinguish rare coverage tasks, we collect the tasks that were covered in at least one of the performed test runs (i.e., both from the training and validation sets), and for each such coverage task, we count the frequency of its occurrence in all of the considered runs. We define the rare tasks as those that occurred in less than 1 % of the test executions. 3 20 Note that the AdaBoost methodology suggests that the employed weak classifiers should not be of this kind, but they can appear in practical applications. Boosted Decision Trees for Behaviour Mining of Concurrent Programs 4 Experimental Evaluation In this section, we first describe the test data which we used for an experimental evaluation of our approach. Then, we describe the precision of the classifiers inferred from this data. Subsequently, we analyse the knowledge hidden in the classifiers, compare it with our previously obtained experience, and derive some new insights about importance of the different test and noise parameters. Finally, we demonstrate that a use of the proposed data mining approach does indeed improve (in a fully automated way) the process of noise-based testing with random setting of the parameters. 4.1 Experimental Data The results presented below are based on 5 multi-threaded benchmark programs that contain a known concurrency error. We use data collected during our previous work [7]. Namely, our case studies are the Airlines (0.3 kLOC), Animator (1.5 kLOC), Crawler (1.2 kLOC), Elevator (0.5 kLOC), and Rover (5.4 kLOC). For each program, we collected data from 10,000 executions with a random test and noise injection setting. We collected various data about the test executions, such as whether an error occurred during the execution (used as our error manifestation property) and various concurrency coverage information, including the GoldiLockSC∗ coverage used for evaluating the rare events property. In our experiments, we consider vectors of test and noise parameters having 12 entries, i.e., x = (x1 , x2 , . . . , x12 ). Here, x1 ∈ {0, . . . , 1000} represents the noise frequency which controls how often the noise is injected and ranges from 0 (never) to 1000 (always). The x2 ∈ {0, . . . , 100} parameter controls the amount of injected noise and ranges from 0 (no noise) to 100 (considerable noise). The x3 ∈ {0, . . . , 5} parameter selects one of six available basic noise injection heuristics (based on injecting calls of yield(), sleep(), wait(), using busy waiting, a combination of additional synchronization and yield(), and a mixture of these techniques). The x4 , x5 , x7 , x8 , x9 ∈ {0, 1} parameters enable or disable the advanced injection heuristics haltOneThread, timeoutTampering, sharedVarNoise, nonVariableNoise, advSharedVarNoise1, and advSharedVarNoise2, respectively. The x6 ∈ {0, 1, 2} parameter controls the way how the sharedVarNoise advanced heuristic behaves (namely, whether it is disabled (0), injects the noise at accesses to one randomly selected shared variable (1) or at accesses to all such variables (2)). A more detailed description of the particular noise injection heuristics can be found in [3, 7, 8, 10]. Furthermore, x10 ∈ {1, . . . , 10} and x11 , x12 ∈ {1, . . . , 100} encode parameters of some of the test cases themselves. In particular, Animator and Crawler are not parametrised, and x10 , x11 , x12 are not used with them. In the Airlines and Elevator test cases, the x10 parameter controls the number of used threads, and in the Rover test case, the x10 ∈ {0, . . . , 6} parameter selects one of the available test scenarios. The Airlines test case is the only one that uses the x11 and x12 parameters, which are in particular used to control how many cycles the test does. 21 R. Avros et al. Table 1. Average accuracy and sensitivity of the learnt AdaBoost classifiers. Error manifestation Accurancy CaseStudies Mean Std Airlines Animator Crawler Elevator Rover 0.7695 0.0086 0.937 0.0054 0.9975 0.00076 0.8335 0.0038 0.9714 0.0031 Rare behaviours Sensitivity Mean Std 0.6229 0.0321 0.9866 0.0052 0.999 0.00077 0.9982 0.0016 0.9912 0.0012 Accurancy Mean 0.9755 0.7815 0.7642 0.6566 0.8737 Std 0.0056 0.0054 0.0402 0.0051 0.1092 Sensitivity Mean Std 0.9964 0.0021 0.9071 0.0217 0.9741 0.0765 0.6131 0.027 0.9687 0.137 Table 2. Inferred weighted rules for the error manifestation classification property. Airlines Rules x1 < 275 x3 < 0.5 or 3.5 < x3 x6 < 1.5 2.5 < x10 73.5 < x12 Weights 0.16 0.50 0.04 0.18 0.12 Animator Rules 705 < x1 2.5 < x3 < 3.5 x6 < 0.5 Weights 0.19 0.55 0.26 Crawler Rules x1 < 215 15 < x2 1.5 < x3 < 3.5 0.5 < x4 x5 < 0.5 x6 < 1.5 or 4.5 < x3 Weights 0.32 0.1 0.38 0.05 0.08 0.07 Elevator Rules x1 < 5 x3 < 0.5 or 3.5 < x3 < 4.5 x7 < 0.5 8.5 < x10 Weights 0.93 0.04 0.01 0.02 Rover Rules 515 < x1 2.5 < x3 < 3.5 0.5 < x4 x6 < 0.5 Weights 0.21 0.48 0.08 0.23 4.2 Precision of the Classifiers In our experiments, we used the implementation of AdaBoost available in the GML AdaBoost Matlab Toolbox4 . We have set it to use decision trees of height restricted to one and to use 10 boosting phases. The algorithm was applied 100 times on randomly chosen divisions of the test data into the training and validation groups. Table 1 summarises the average accuracy and sensitivity of the learnt AdaBoost classifiers. One can clearly see that both the average accuracy and sensitivity are quite high, ranging from 0.61 to 0.99. Moreover, the standard deviation is very low in all cases. This indicates that we always obtained results that provide meaningful information about our test runs. 4.3 Analysis of the Knowledge Hidden in the Obtained Classifiers We now employ the approach described in Section 3.2 to interpret the knowledge hidden in the obtained classifiers. Tables 2 and 3 show the inferred rules 4 22 http://graphics.cs.msu.ru/en/science/research/machinelearning/AdaBoosttoolbox Boosted Decision Trees for Behaviour Mining of Concurrent Programs and their weights for the error manifestation property and the rare behaviours property, respectively. For each test case, the tables contain a row whose upper part contains the condition of the rule (in the form of interval constraints) and the lower part contains the appropriate weight from the interval (0, 1). In order to interpret the obtained rules, we first focus on rules with the highest weights (corresponding to parameters with the biggest influence). Then we look at the parameters which are present in rules across the test cases (and hence seem to be important in general) and parameters that are specific for particular test cases only. Next, we pinpoint parameters that do not appear in any of the rules and therefore seem to be of a low relevance in general. As for the error manifestation property (i.e., Table 2), the most influential parameters are x3 in four of the test cases and x1 in the Crawler test case. This indicates that the selection of a suitable noise type (x3 ) or noise frequency (x1 ) is the most important decision to be done when testing these programs with the aim of reproducing the errors present in them. Another important parameter is x6 controlling the use of the sharedVarNoise heuristic. Moreover, the parameters x1 , x3 , and x6 are considered important in all of the rules, which suggests that, for reproducing the considered kind of errors, they are of a general importance. In two cases (namely, Crawler, and Rover ), the advanced haltOneThread heuristic (x4 ) turns out to be important. In the Crawler and Rover test cases, this heuristic should be enabled in order to detect an error. This behaviour fits into our previous results [10] in which we show that, in some cases, this unique heuristic (the only heuristic which allows one to exercise thread interleavings which are normally far away from each other) considerably contributes to the detection of an error. Finally, the presence of the x10 and x12 parameters in the rules derived for the Airlines test case indicates that the number of threads (x10 ) and the number of cycles executed during the test (x12 ) pays an important role in the noise-based testing of this particular test case. The x10 parameter (i.e., the number of threads) turns out to be important for the Elevator test case too, indicating that the number of threads is of a more general importance. Finally, we can see that the x8 , x9 and x11 parameters are not present in any of the derived rules. This indicates that the advSharedVarNoise noise heuristics are of a low importance in general, and the x11 parameter specific for Airlines is not really important for finding errors in this test case. For the case of classifying according to the rare behaviour property, the obtained rules are shown in Table 3. We can again find the highest weights in rules based on the x3 parameter (Animator, Crawler, Rover ) and on the x1 parameter (Airlines). However, in the case of Elevator, the most contributing parameter is now the number of threads used by the test (x10 ). The rule suggests to use certain numbers of threads in order to spot rare behaviours (i.e., it is important to consider not only a high number of threads). The generated sets of rules often contain the x3 parameter controlling the type of noise (all test cases except for Airlines) and the x6 parameter which controls the sharedVarNoise heuristic. These parameters thus appear to be of a general importance in this case. 23 R. Avros et al. Table 3. Inferred weighted rules for the rare behaviours classification property. Airlines Rules x1 < 295 or 745 < x1 < 925 x2 < 35 Weights 0.52 0.06 Animator Rules 0.5 < x3 < 3.5 or 4.5 < x3 Weights 0.8 Crawler Rules 0.5 < x3 < 3.5 or 4.5 < x3 0.5 < x4 Weights 0.46 0.08 Elevator Rules 0.5 < x3 < 3.5 0.5 < x4 0.5 < x5 or 4.5 < x3 Weights 0.22 0.05 0.2 Rover Rules 2.5 < x3 < 3.5 or 4.5 < x3 x4 < 0.5 Weights 0.46 0.26 0.5 < x5 61.5 < x12 < 91.5 0.1 0.32 0.5 < x6 < 1.5 0.2 0.5 < x5 0.2 0.5 < x6 < 1.5 0.26 1.5 < x6 1.5 < x10 < 4.5 or 7.5 < x10 0.06 0.47 x6 < 0.5 0.16 0.5 < x7 0.12 Next, the parameter x12 does again turn out to be important in the Airlines test case, and the x10 parameter is important in the Elevator test case. This indicates that even for testing rare behaviours, it is important to adjust the number of threads or test cycles to suitable values. Finally, the x8 , x9 , and x11 parameters do not show up in any of the rules, and hence seem to be of a low importance in general for finding rare behaviours (which is the same as for reproduction of known errors). Overall, the obtained results confirmed some of the facts we discovered during our previous experimentation such as that different goals and different test cases may require a different setting of noise heuristics [10, 7, 8] and that the haltOneThread noise injection heuristics (x4 ) provides in some cases a dramatic increase in the probability of spotting an error [10]. More importantly, the analysis revealed (in an automated way) some new knowledge as well. Mainly, the type of noise (x3 ) and the setting of the sharedVarNoise heuristic (x6 ) as well as the frequency of noise (x1 ) are often the most important parameters (although the importance of x1 seems to be a bit lower). Further, it appears to be important to suitably adjust the number of threads (x10 ) whenever that is possible. 4.4 Improvement of Noise-based Testing with Random Parameters Finally, we show that the obtained classifiers can be used to fully automatically improve the process of noise-based testing with randomly chosen values of parameters. For that, we reuse the 7,500 test runs out of 10,000 test runs recorded with random parameter values for each of the case studies. In particular, we randomly choose 2,500 test runs as training set for our AdaBoost approach to produce classifiers. Then, from the rest of the test runs, we randomly choose 5,000 test runs to compare our approach with the random approach. From these 5,000 test runs, we first select runs that were performed using settings considered as suitable for the respective testing goals by the classifiers 24 Boosted Decision Trees for Behaviour Mining of Concurrent Programs Table 4. A comparison of the random approach and the newly proposed AdaBoost approach. Error manifestation CaseStudies Airlines Animator Crawler Elevator Rover Rare behaviours Rand. AdaBoost Pos. Impr. Rand. AdaBoost Pos. Impr. 56.26 14.81 0.18 16.75 6.65 75.43 54.05 0.25 27.66 36.25 1,612 901 2,806 1,410 822 1.34 3.65 1.39 1.65 5.45 1.94 39.53 22.41 52.77 10.76 1.64 57.95 31.26 59.51 23.21 2,444 3,258 1,513 1,398 1,620 0.85 1.47 1.39 1.13 2.16 that we have obtained. Then, we compute what fractions of all the runs and what fractions of all the selected runs satisfy the testing goals for the considered case studies, which shows us the efficiency of the different testing approaches. In Table 4, the columns Pos. contain the numbers of test runs (out of the considered 5,000 runs) classified positively by the obtained classifiers for the two considered test goals. The columns Rand. give the percentage of runs out of the 5,000 runs performed under purely randomly chosen values of parameters that met the considered testing goals (i.e., found an error or a rare behaviour, respectively). The columns AdaBoost give this percentage for the selected runs (i.e., those whose number is in the columns Pos.). Finally, the columns Impr. present how many times the efficiency of testing with the selected values of parameters is better than that of purely random noise-based testing (i.e., it contains the ratio of the values in the AdaBoost and Rand. columns). The improvement columns clearly show that our AdaBoost technique often brings an improvement (with one exception described bellow), which ranges from 1.13 times in the case of the rare behaviours property and the Elevator test case to 5.45 times in the case of the error manifestation property and the Rover test case. In the case of the Airlines test case and the rare behaviours property, our technique provided worse results (impr. 0.85). This is mostly caused by the simplicity of the case study and hence lack of rare behaviours in the test runs. Therefore, our approach did not have enough samples to construct a successful classifier. Nevertheless, we can conclude that our classification approach can really improve the efficiency of testing in majority of studied cases. 5 Related Work Most of the existing works on obtaining new knowledge from multiple test runs of concurrent programs focus on gathering debugging information that helps to find the root cause of a failure [4, 11]. In [11], a machine learning algorithm is used to infer points in the execution such that the error manifestation probability is increased when noise is injected into them. It is then shown that such places are often involved in the erroneous behaviour of the program. Another approach [4] uses a data mining-like technique, more precisely, the feature selection algorithm, 25 R. Avros et al. to infer a reduced call graph representation of the SUT, which is then used to discover anomalies in the behaviour of the SUT within erroneous executions. There is also rich literature and tool support for data mining test results without a particular emphasis on concurrent programs. The existing works study different aspects of testing, including identification of test suite weaknesses [1] and optimisation of the test suite [13]. In [1], a substring hole analysis is used to identify sets of untested behaviours using coverage data obtained from testing of large programs. Contrary to the analysis of what is missing in coverage data and what should be covered by improving the test suite, other works focus on what is redundant. In [13], a clustering data mining technique is used to identify tests which exercise similar behaviours of the program. The obtained results are then used to prioritise the available tests. 6 Conclusions and Future Work In the paper, we have proposed a novel application of classification-based data mining in the area of noise-based testing of concurrent programs. In particular, we proposed an approach intended to identify which of the many noise parameters and possibly also parameters of the tests themselves are important for a particular testing goal as well as which values of these parameters are suitable for meeting this goal. As we have demonstrated on a number of case studies, the proposed approach can be used to fully automatically improve the noise-based testing approach of a particular program with a particular testing goal. Moreover, we have also used our approach to derive new insights into the noise-based testing approach itself. Apart from validating our findings on more case studies, there is plenty of space for further research in the area of applications of data mining in testing of concurrent programs. One can ask many interesting questions and search for the answers using different techniques, such as outliers detection, clustering, association rules mining, etc. For example, many of the concurrency coverage metrics based on dynamic detectors contain a lot of information on the behaviour of the tested programs, and when mined, this information could be used for debugging purposes. Acknowledgement. The work was supported by the bi-national Czech-Israel project (Kontakt II LH13265 by the Czech Ministry of Education and 3-10371 by Ministry of Science and Technology of Israel), the EU/Czech IT4Innovations Centre of Excellence project CZ.1.05/1.1.00/02.0070, and the internal BUT projects FIT-S-12-1 and FIT-S-14-2486 . Z. Letko was funded through the EU/ Czech Interdisciplinary Excellence Research Teams Establishment project CZ.1.07/2.3.00/30.0005. References 1. Yoram Adler, Noam Behar, Orna Raz, Onn Shehory, Nadav Steindler, Shmuel Ur, and Aviad Zlotnick. Code Coverage Analysis in Practice for Large Systems. In Proc. of ICSE’11, pages 736–745. ACM, 2011. 26 Boosted Decision Trees for Behaviour Mining of Concurrent Programs 2. Thomas Ball, Sebastian Burckhardt, Katherine E. Coons, Madanlal Musuvathi, and Shaz Qadeer. Preemption Sealing for Efficient Concurrency Testing. In Proc. of TACAS’10, volume 6015 of LNCS, pages 420–434. Springer-Velrlag, 2010. 3. Orit Edelstein, Eitan Farchi, Evgeny Goldin, Yarden Nir, Gil Ratsaby, and Shmuel Ur. Framework for Testing Multi-threaded Java Programs. Concurrency and Computation: Practice and Experience, 15(3-5):485–499. Wiley, 2003. 4. Frank Eichinger, Victor Pankratius, Philipp W. L. Große, and Klemens Böhm. Localizing Defects in Multithreaded Programs by Mining Dynamic Call Graphs. In Proc. of TAIC PART’10, volume 6303 of LNCS, pages 56–71. Springer-Velrlag, 2010. 5. Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. Goldilocks: A Race and Transaction-aware Java Runtime. In Proc. of PLDI’07, pages 245–255. ACM, 2007. 6. Yoav Freund and Robert E. Schapire. A Short Introduction to Boosting. In In Proc. of IJCAI’99, pages 1401–1406. Morgan Kaufmann, 1999. 7. Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Hana Pluháčková, and Tomáš Vojnar. Multi-objective Genetic Optimization for Noise-based Testing of Concurrent Software. In Proc. of SSBSE’14, volume 8636 of LNCS, pages 107–122. SpringerVerlag, 2014. 8. Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Shmuel Ur, and Tomáš Vojnar. Testing of Concurrent Programs Using Genetic Algorithms. In Proc. of SSBSE’12, volume 7515 of LNCS, pages 152–167. Springer-Velrlag, 2012. 9. Bohuslav Křena, Zdeněk Letko, and Tomáš Vojnar. Coverage Metrics for Saturation-based and Search-based Testing of Concurrent Software. In Proc. of RV’11, volume 7186 of LNCS, pages 177–192. Springer-Velrlag, 2012. 10. Zdeněk Letko, Tomáš Vojnar, and Bohuslav Křena. Influence of Noise Injection Heuristics on Concurrency Coverage. In Proc. of MEMICS’11, volume 7119 of LNCS, pages 123–131, Springer-Velrlag, 2012. 11. Rachel Tzoref, Shmuel Ur, and Elad Yom-Tov. Instrumenting Where It Hurts: An Automatic Concurrent Debugging Technique. In Proc. of ISSTA’07, pages 27–38. ACM, 2007. ACM. 12. Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 3rd edition, 2011. 13. Shin Yoo, Mark Harman, Paolo Tonella, and Angelo Susi. Clustering Test Cases to Achieve Effective and Scalable Prioritisation Incorporating Expert Knowledge. In Proc. of ISSTA’09, pages 201–212. ACM, 2009. 27 LTL Model Checking of Parametric Timed Automata Peter Bezděk, Nikola Beneš? , Jiřı́ Barnat?? , and Ivana Černá?? Faculty of Informatics, Masaryk University, Brno, Czech Republic {xbezdek1,xbenes3,barnat,cerna}@fi.muni.cz Abstract. The parameter synthesis problem for timed automata is undecidable in general even for very simple reachability properties. In this paper we introduce restrictions on parameter valuations under which the parameter synthesis problem is decidable for LTL properties. The proposed problem could be solved using an explicit enumeration of all possible parameter valuations. However, we introduce a symbolic zone-based method for synthesising bounded integer parameters of parametric timed automata with an LTL specification. Our method extends the ideas of the standard automata-based approach to LTL model checking of timed automata. Our solution employs constrained parametric difference bound matrices and a suitable notion of extrapolation. 1 Introduction Model checking [1] is a formal verification technique applied to check for logical correctness of discrete distributed systems. While it is often used to prove the unreachability of a bad state (such as an assertion violation in a piece of code), with a proper specification formalism, such as the Linear Temporal Logic (LTL), it can also check for many interesting liveness properties of systems, such as repeated guaranteed response, eventual stability, live-lock, etc. Timed automata have been introduced in [2] and have emerged as a useful formalism for modelling time-critical systems as found in many embedded and cyber-physical systems. The formalism is built on top of the standard finite automata enriched with a set of real-time clocks and allowing the system actions to be guarded with respect to the clock valuations. In the general case, such a timed system exhibits infinite-state semantics (the clock domains are continuous). Nevertheless, when the guards are limited to comparing clock values with integers only, there exists a bisimilar finite state representation of the original infinite-state real-time system referred to as the region abstraction. A practically efficient abstraction of the infinite-state space came with the so called zones [3]. The zone-based abstraction is much coarser and the number of zones reachable ? ?? 28 The author has been supported by the MEYS project No. CZ.1.07/2.3.00/30.0009 Employment of Newly Graduated Doctors of Science for Scientific Excellence. The authors have been supported by the MEYS project No. LH11065 Control Synthesis and Formal Verification of Complex Hybrid Systems. LTL model checking of Parametric Timed Automata from the initial state is significantly smaller. This in turns allows for an efficient implementation of verification tools for timed automata, see e.g. UPPAAL [4]. Very often the correctness of a time-critical system relates to a proper timing, i.e. it does not only depend on the logical result of the computation, but also on the time at which the results are produced. To that end the designers are not only in the need of tools to verify correctness once the system is fully designed, but also in the need of tools that would help them to derive proper time parameters of individual system actions that would make the system as a whole satisfy the required specification. After all this problem of parameter synthesis is more urgent in practice than the verification as such. The problem of the existence of a parameter valuation for a reachability property of a parametric timed automaton has been shown to be undecidable in [5] for a parametric timed automaton with as few as 3 clocks. To obtain a decidable problem we need to restrict parameter valuations to bounded integers. When modelling a real-time system, designers can usually provide practical bounds on time parameters of individual system actions. Therefore, introducing a parameter synthesis method with such a restriction is still reasonable. Our goal is to solve the parameter synthesis problem for linear time properties over parametric timed automata where the parameter valuation function is restricted to bounded range over integer values. As part of our goal, we propose a solution that avoids the parameter scan approach in order to provide a potentially more efficient method. To that end we introduce a finite abstraction over parametric difference bound matrices, which allows us to deploy our solution based on a zone abstraction. An extension of the model checker Uppaal, capable of synthesising linear parameter constraints for correctness of parametric timed automata has been described in [6] together with a subclass of parametric timed automata, for which the emptiness problem is decidable. In [7] authors show that the problem of the existence of bounded integer parameter values such that some TCTL property is satisfied is PSPACE-complete. They also give symbolic algorithms for reachability and unavoidability properties. Contribution We show how to apply the standard automata-based approach to LTL model checking of Vardi and Wolper [8] in the context of an LTL formula, a parametric timed automaton and bounds on parameters. In particular, we show how to construct a Büchi automaton coming from the parametric system under verification using a zone-based abstraction and an extrapolation. Due to space constraints, the proofs of theorems from this paper are given in [9]. 2 Preliminaries and Problem Statement In order to state our main problem formally, we need to describe the notion of a parametric timed automaton. We start by describing some basic notation. 29 P. Bezděk, N. Beneš, J. Barnat, and I. Černá Let P be a finite set of parameters. An affine expression is an expression of the form z0 + z1 p1 + . . . + zn pn , where p1 , . . . , pn ∈ P and z0 , . . . , zn ∈ Z. We use E(P ) to denote the set of all affine expressions over P . A parameter valuation is a function v : P → Z which assigns an integer number to each parameter. Let lb : P → Z be a lower bound function and ub : P → Z be an upper bound function. For an affine expression e, we use e[v] to denote the integer value obtained by replacing each p in e by v(p). We use maxlb,ub (e) to denote the maximal value obtained by replacing each p with a positive coefficient in e by ub(p) and replacing each p with a negative coefficient in e by lb(p). We say that the parameter valuation v respects lb and ub if for each p ∈ P it holds that lb(p) ≤ v(p) ≤ ub(p). We denote the set of all parameter valuations respecting lb and ub by V allb,ub (P ). In the following, we only consider parameter valuations from V allb,ub (P ). Let X be a finite set of clocks. We assume the existence of a special zero clock, denoted by x0 , that has always the value 0. A guard is a finite conjunction of expressions of the form xi − xj ∼ e where xi , xj ∈ X, e ∈ E(P ) and ∼ ∈ {≤, <}. We use G(X, P ) to denote the set of all guards over a set of clocks X and a set of parameters P . A plain guard is a guard containing only expressions of the form xi −xj ∼ e where xi , xj ∈ X, e ∈ E(P ), ∼ ∈ {≤, <}, and xi = x0 or xj = x0 . We also use G(X, P ) to denote the set of all plain guards over a set of clocks X and a set of parameters P . A clock valuation is a function η : X → R≥0 assigning nonnegative real numbers to each clock such that η(x0 ) = 0. We denote the set of all clock valuations by V al(X). Let g ∈ G(X, P ) and v be a parameter valuation and η be a clock valuation. Then g[v, η] denotes a boolean value obtained from g by replacing each parameter p with v(p) and each clock x with η(x). A pair (v, η) satisfies a guard g, denoted by (v, η) |= g, if g[v, η] evaluates to true. A semantics of a guard g, denoted by JgK, is a set of valuation pairs (v, η) such that (v, η) |= g. For a given parameter valuation v we write JgKv for the set of clock valuations {η | (v, η) |= g}. We define two operations on clock valuations. Let η be a clock valuation, d a non-negative real number and R ⊆ X a set of clocks. We use η + d to denote the clock valuation that adds the delay d to each clock, i.e. (η + d)(x) = η(x) + d for all x ∈ X \ {x0 }. We further use η[R] to denote the clock valuation that resets clocks from the set R, i.e. η[R](x) = 0 if x ∈ R, η[R](x) = η(x) otherwise. We can now proceed with the definition of a parametric timed automaton and its semantics. Definition 2.1 (PTA). A parametric timed automaton (PTA) is a tuple M = (L, l0 , X, P, ∆, Inv ) where – – – – – – 30 L is a finite set of locations, l0 ∈ L is an initial location, X is a finite set of clocks, P is a finite set of parameters, ∆ ⊆ L × G(X, P ) × 2X × L is a finite transition relation, and Inv : L → G(X, P ) is an invariant function. LTL model checking of Parametric Timed Automata g,R We use q −−→∆ q 0 to denote (q, g, R, q 0 ) ∈ ∆. The semantics of a PTA is given as a labelled transition system. A labelled transition system (LTS) over a set of symbols Σ is a triple (S, s0 , →), where S is a set of states, s0 ∈ S is an initial a state and → ⊆ S × Σ × S is a transition relation. We use s − → s0 to denote (s, a, s0 ) ∈ →. Definition 2.2 (PTA semantics). Let M = (L, l0 , X, P, ∆, Inv ) be a PTA and v be a parameter valuation. The semantics of M under v, denoted by JM Kv , is an LTS (SM , s0 , →) over the set of symbols {act} ∪ R≥0 , where – SM = L × V allb,ub (X) is a set of all states, – s0 = (l0 , 0), where 0 is a clock valuation with 0(x) = 0 for all x, and – the transition relation → is specified for all (q, η), (q 0 , η 0 ) ∈ S such that η is a clock valuation as follows: d • (l, η) − → (l0 , η 0 ) if l = l0 , d ∈ R≥0 , η 0 = η + d, and (v, η 0 ) |= Inv (l0 ), g,R act • (l, η) −−→ (l0 , η 0 ) if ∃g, R : l −−→∆ l0 , (v, η) |= g, η 0 = η[R], and (v, η 0 ) |= Inv (l0 ). The transitions of the first kind are called delay transitions, the latter are called action transitions. act act We write s1 −−→d s2 if there exists s0 ∈ SM and d ∈ R≥0 such that s1 −→ d s0 −→ s2 . A proper run π of JM Kv is an infinite alternating sequence of delay d 0 and action transitions that begins with a delay transition π = (l0 , η0 ) −→ (l0 , η0 + act d 1 d0 ) −−→ (l1 , η1 ) −→ · · · . A proper run is called a Zeno run if the sum of all its delays is finite. For the rest of the paper, we assume that we only deal with a deadlockfree PTA, i.e. that for each considered parameter valuation v there is no state without a reachable action transition in JAKv . We deal with Zeno runs later. Let M be a PTA, L : L → 2Ap be a labelling function that assigns a set of atomic propositions to each location of M , v be a parameter valuation, and ϕ be an LTL formula. We say that M under v with L satisfies ϕ, denoted by (M, v, L) |= ϕ if for all proper runs π of JM Kv , π satisfies ϕ where atomic prepositions are determined by L. Unfortunately, it is known that the parameter synthesis problem for a PTA is undecidable even for very simple (reachability) properties [5]. Instead of solving the general problem, we thus focus on a more constrained version. We may now state our main problem. Problem 2.3 (The bounded integer parameter synthesis problem). Given a parametric timed automaton M , a labelling function L, an LTL property ϕ, a lower bound function lb and an upper bound function ub, the problem is to compute the set of all parameter valuations v such that (M, v, L) |= ϕ and lb(p) ≤ v(p) ≤ ub(p). 31 P. Bezděk, N. Beneš, J. Barnat, and I. Černá Problem 2.3 is trivially decidable using a region abstraction and parameter scan approach. Unfortunately, the size of the region-based abstraction grows exponentially with the number of clocks and the largest integer number used. As a result, the region-based abstraction is difficult to be used in practice for an analysis of more than academic toy examples, even though it has its theoretical value. Unlike the region-based abstraction, a single state in a zone-based abstraction is no longer restricted to represent only those clock values that are between two consecutive integers. Therefore, the zone-based abstraction is much coarser and the number of zones reachable from the initial state is significantly smaller. In order to avoid the necessity of an explicit enumeration of all parameter valuations we use the zone-based abstraction together with the symbolic representation of parameter valuation sets. Our algorithmic framework which solves Problem 2.3 consists of three steps. As the first step, we extend the standard automata-based LTL model checking of timed automata [8] to parametric timed automata. We employ this approach in the following way. From a PTA M and an LTL formula ϕ we produce a product parametric timed Büchi automaton (PTBA) A. The accepting runs of the automaton A correspond to the runs of M violating the formula ϕ (analogously as in the case of timed automata). As the second step, we employ a symbolic semantics of a PTBA A with a suitable extrapolation. From the symbolic state space of a PTBA A we finally produce a Büchi automaton B. As the last step, we need to detect all parameter valuations such that there exists an accepting run in Büchi automaton B. This is done using our Cumulative NDFS algorithm. Now, we proceed with the definitions of a Büchi automaton, a parametric timed Büchi automaton and its semantics. Definition 2.4 (BA). A Büchi automaton (BA) is a tuple B = (Q, q0 , Σ, →, F ), where – – – – – Q is a finite set of states, q0 ∈ Q is an initial state, Σ is a finite set of symbols, →⊆ Q × Σ × Q is a set of transitions, and F ⊆ Q is the set of accepting states (acceptance condition). An ω-word w = a0 a1 a2 . . . ∈ Σ ω is accepting if there is an infinite sequence of ai states q0 q1 q2 . . . such that qi −→ qi+1 for all i ∈ N, and there exist infinitely many i ∈ N such that qi ∈ F . Definition 2.5 (PTBA). A parametric timed Büchi automaton (PTBA) is a pair A = (M, F ) where – M = (L, l0 , X, P, ∆, Inv ) is a PTA, and – F ⊆ L is a set of accepting locations. 32 LTL model checking of Parametric Timed Automata Zeno runs represent non-realistic behaviours and it is desirable to ignore them in analysis. Therefore, we are interested only in non-Zeno accepting runs of a PTBA. There is a well-known transformation to the strongly non-Zeno form [10] of a PTBA, which guarantees that each accepting run is non-Zeno. For the rest, we assume that we have the strongly non-Zeno form of a PTBA, as introduced in [10]. Definition 2.6 (PTBA semantics). Let A = (M, F ) be a PTBA and v be a parameter valuation. The semantics of A under v, denoted by JAKv , is defined as JM Kv = (SM , s0 , →). We say a state s = (l, η) ∈ SM is accepting if l ∈ F . A proper run π = d act d act 1 0 s01 −→ . . . of JAKv is accepting if there exists an infinite s00 −→ s1 −→ s0 −→ set of indices i such that si is accepting. 3 Symbolic Semantics A constraint is an inequality of the form e ∼ e0 where e, e0 ∈ E and ∼ ∈ {>, ≥ , ≤, <}. We define c[v] as a boolean value obtained by replacing each p in c by v(p). A valuation v satisfies a constraint c, denoted v |= c, if c[v] evaluates to true. The semantics of a constraint c, denoted JcK, is the set of all valuations that satisfy c. A finite set of constraints C is called a constraint set. A valuation satisfies a constrain set CTif it satisfies each c ∈ C. The semantics of a constraint set C is given by JCK = c∈C JcK. A constraint set C is satisfiable if JCK 6= ∅. A constraint c covers a constraint set C, denoted C |= c, exactly when JCK ⊆ JcK. As in [6], we identify the relation symbol ≤ with a boolean value true and < with a boolean value false. Then, we treat boolean connectives on relation symbols ≤, < as operations with boolean values. For example, (≤ =⇒ <) = <. Now, we define a parametric difference bound matrix, a constrained parametric difference bound matrix, several operations on them, and a PTBA symbolic semantics. These definitions are introduced in detail in [6]. Definition 3.1. A parametric difference bound matrix (PDBM) over P and X is a set D which contains for all 0 ≤ i, j ≤ |X| a guard of the form xi −xj ≺ij eij where xi , xj ∈ X and eij ∈ E(P ) ∪ {∞} and i = j =⇒ eii = 0. We denote by Dij a guard of the form xi − xj ≺ij eij contained V in D . Given a parameter valuation v, the semantics of D is given by JDKv = J i,j Dij Kv . A PDBM D is satisfiable with respect to v if JDKv is non-empty. If f is a guard of the form xi − xj ≺ e with i 6= j (a proper guard), then D[f ] denotes the PDBM obtained from D by replacing Dij with f . We denote by PDBMS (P, X) the set of all PDBM over parameters P and clocks X. Definition 3.2. A constrained parametric difference bound matrix (CPDBM) is a pair (C, D), where C is a constraint set and D is a PDBM and for each 0 ≤ i ≤ |X| it holds that C |= e0i ≥ 0. The semantics of (C, D) is given by JC, DK = {(v, η) | v ∈ JCK ∧ η ∈ JDKv }. We call (C, D) satisfiable if JC, DK is non-empty. We denote by CPDBMS the set of all CPDBM. A CPDBM (C, D) is in the canonical form iff for all i, j, k, C |= eij (≺ik ∧ ≺kj )eik + ekj . 33 P. Bezděk, N. Beneš, J. Barnat, and I. Černá Definition 3.3 (Applying a guard). Suppose g is a simple guard of the form xi − xj ≺ e. Suppose (C, D) is a constrained PDBM in the canonical form and Dij = (eij , ≺ij ). The application of a guard g on (C, D) generally results in a set of constrained PDBMs and is defined as follows: {(C, D[g])} if C |= ¬eij (≺ij =⇒ ≺)e, {(C, D)} if C |= eij (≺ij =⇒ ≺)e, (C, D)[g] = {(C ∪ {e (≺ =⇒ ≺)e}, D), else, ij ij (C ∪ {¬eij (≺ij =⇒ ≺)e}, D[g])} where D[g] is defined as follows: ( (e, ≺) D[g]kl = Dkl if k = i and l = j, else. We can generalise this definition to conjunctions of simple guards as follows: def D[gi0 ∧ gi1 ∧ . . . ∧ gik ] ⇔ D[gi0 ][gi1 ] . . . [gik ]. Definition 3.4 (Resetting a clock). Suppose D is a PDBM in the canonical form. D with a reset clock xr , denoted as D[xr ], represents a PDBM D after resetting the clock xr and is defined as follows: D0j if i 6= j and i = r, D[xr ]ij = Di0 if i 6= j and j = r, Dij else. We can generalise this definition to reset of a set of clocks as follows: def D[xi0 , xi1 , . . . , xik ] ⇔ D[xi0 ][xi1 ] . . . [xik ]. Definition 3.5 (Time successors). Suppose D is a PDBM in the canonical form. The time successor of D, denoted as D↑ , represents a PDBM with all upper bounds on clocks removed and is defined as follows: ( (∞, <) if i 6= 0 and j = 0, ↑ Dij = Dij else. It follows from the definition that the reset and time successor operations preserve the canonicity. After an application of a guard the canonical form needs to be computed. To compute the canonical form of the given CPDBM we need to derive the tightest constraint on each clock difference. Deriving the tightest constraint on a clock difference can be seen as finding the shortest path in the graph interpretation of the CPDBM [6,11]. The canonisation operation is usually implemented using extended Floyd-Warshall algorithm where on each relaxation a split action on the constraint set can occur. Therefore, the result of the canonisation is a set containing constrained parametric difference bound matrices in the canonical form. 34 LTL model checking of Parametric Timed Automata Definition 3.6 (Canonisation). First, we define a relation −→F W on constrained parametric bound matrices as follows, for all 0 ≤ k, i, j ≤ |X| + 1 – (k, i, j, C1 , D1 ) −→F W (k, i, j + 1, C2 , D2 ) if (C2 , D2 ) ∈ (C1 , D1 )[xi − xj (≺ik ∧ ≺kj )eik + ekj ] – (k, i, |X| + 1, C1 , D1 ) −→F W (k, i + 1, 0, C2 , D2 ) if (C2 , D2 ) ∈ (C1 , D1 )[xi − xj (≺ik ∧ ≺kj )eik + ekj ] – (k, |X| + 1, 0, C1 , D1 ) −→F W (k + 1, 0, 0, C2 , D2 ) if (C2 , D2 ) ∈ (C1 , D1 )[xi − xj (≺ik ∧ ≺kj )eik + ekj ] The relation −→F W can be seen as a representation of the computation steps of the extended nondeterministic Floyd-Warshall algorithm. Now, suppose (C, D) is a CPDBM. The canonical form of (C, D), denoted as (C, D)c , represents a set of CPDBMs with a tightest constraint on each clock difference in D and is defined as follows. (C, D)c = {(C 0 , D0 ) | (0, 0, 0, C, D) −→∗F W (|X| + 1, 0, 0, C 0 , D0 )} Definition 3.7 (PTBA symbolic semantics). Let A = ((L, l0 , X, P, ∆, Inv ), F ) be a PTBA. Let lb and ub be a lower bound function and an upper bound function on parameters. The symbolic semantics of A with respect to lb and ub is a transition system (SA , Sinit , =⇒), denoted as JAKlb,ub , where – SA = L × {JC, DK | (C, D) ∈ CP DBM S} is the set of all symbolic states, – the set of initial states S0 is defined as {(l0 , JC, DK) | (C, D) ∈ (∅, E ↑ )[Inv (l0 )]}, where • E is a PDBM with E i,j = (0, ≤) for each i, j, and • for each p ∈ P , the constraints p ≥ lb(p) and p ≤ ub(p) are in C. – There is a transition (l, JC, DK) =⇒ (l0 , JCc0 , Dc0 K) if g,R • l −→∆ l0 and • (C 00 , D00 ) ∈ (C, D)[g] and • (Cc00 , Dc00 ) ∈ (C 00 , D00 )c and • (C 0 , D0 ) ∈ (Cc00 , Dc00 [R]↑ )[Inv(l0 )] and • (Cc0 , Dc0 ) ∈ (C 0 , D0 )c . A symbolic state is represented by a tuple (l, JC, DK) where l is a location, (C, D) is a CPDBM. We say a state S = (l, JC, DK) ∈ SA is accepting if l ∈ F . We say π = S0 =⇒ S1 =⇒ . . . is a run of JAK if S0 ∈ Sinit and for each i Si ∈ SA and Si−1 =⇒ Si . A run respects a parameter valuation v if for each state Si = (li , JCi , Di K) it holds that v ∈ JCi K. A run π is accepting if there exists an infinite set of indices i such that Si is accepting. For the rest of the paper we fix lb, ub and we use JAK to denote JAKlb,ub . The transition system JAK may be infinite. In order to obtain a finite transition system we need to apply a finite abstraction over JAK. Definition 3.8 (Time-abstracting simulation). Given an LTS (S, s0 , →), a time-abstracting simulation R over S is a binary relation satisfying following conditions: 35 P. Bezděk, N. Beneš, J. Barnat, and I. Černá act act – s1 Rs2 and s1 → s01 implies the existence of s2 → s02 such that s01 Rs02 , and d – s1 Rs2 and d1 ∈ R≥0 and s1 →1 s01 implies the existence of d2 ∈ R≥0 and d2 s2 → s02 such that s01 Rs02 . We define the largest simulation relation over S (4S ) in the following way: s 4S s0 if there exists a time-abstracting simulation R and (s, s0 ) ∈ R. When S is clear from the context we shall only use 4 instead of 4S in the following. In the following definition, for a parameter valuation v, a concrete state s1 = (l1 , η) from JAKv , and a symbolic state S2 = (l2 , JC, DK) from JAK we write s1 ∈v S2 if l1 = l2 , v ∈ C, and η ∈ JDKv . Definition 3.9 (PTBA abstract symbolic semantics). Let A = (M, F ) be a PTBA. An abstraction over JAK = (SA , Sinit , =⇒) is a mapping α : SA → 2SA such that the following conditions hold: – (l0 , JC 0 , D0 K) ∈ α((l, JC, DK)) implies l = l0 ∧ JC 0 K ⊆ JCK ∧ JC 0 , DK ⊆ JC 0 , D0 K, – for each v ∈ JCK there exists S1 , S2 such that S2 = (l, JC 0 , D0 K) ∈ α(S1 ) and for each s ∈v S2 there exists a state s0 ∈v S1 satisfying s 4 s0 . An abstraction α is called finite if its image is finite. An abstraction α over JAK induces a new transition system denoted as JAKα = (QA , Qinit , =⇒α ) where – QA = {S | S ∈ α(S 0 ) and S 0 ∈ SA }, – Qinit = {S | S ∈ α(S 0 ) and S 0 ∈ Sinit }, and – Q =⇒α Q0 if there is S ∈ SA such that Q0 ∈ α(S) and Q =⇒ S. An accepting state, a run and an accepting run are defined analogously as in the JAK case. If the α is finite the JAKα can be viewed as a Büchi automaton. Now, we define a parametric extension of the well known k-extrapolation [12]. Definition 3.10. Let A be a PTBA, (l, JC, DK) be a symbolic state of JAK and Dij = xi −xj ≺ij eij for each 0 ≤ i, j ≤ |X|. We define the kp-extrapolation αkp V in the following way: (l, JC 0 , D0 K) ∈ αkp ((l, JC, DK)) if C 0 = C ∧ 0≤i,j≤|X| c0ij and for each 0 ≤ i, j ≤ |X|: – – – – it it it it holds holds holds holds that that that that 0 Dij 0 Dij 0 Dij 0 Dij = xi − xj = xi − xj = xi − xj = xi − xj ≺ij eij and c0ij = eij ≤ M (xi ) or < ∞ and c0ij = eij > M (xi ) or ≺ij eij and c0ij = eij ≥ −M (xj ) or < −M (xj ) and c0ij = eij < −M (xj ), where M (x) is the maximum value in {maxlb,ub (e) | e is compared with x in A}. Lemma 3.11. Let A be a PTBA. The kp-extrapolation is a finite abstraction over JAK = (SA , Sinit , =⇒). 36 LTL model checking of Parametric Timed Automata Proof. First, we prove that the kp-extrapolation is an abstraction. It is easy to see that the kp-extrapolation satisfies the first condition (l0 , JC 0 , D0 K) ∈ α((l, JC, DK)) implies l = l0 ∧ JC 0 K ⊆ JCK ∧ JC 0 , DK ⊆ JC 0 , D0 K. The validity of the second condition follows from the following observation. For each v ∈ JCK and each η 0 ∈ JD0 Kv there exists η ∈ JDKv such that for each clock x and each guard g the following implication holds: η 0 (x) |= g =⇒ η(x) |= g. Now, we need to show that the kp-extrapolation is finite. From the definition we have the fact that the number of locations is finite and the number of sets of bounded parameter valuations is finite. We need to show that there are only finitely many sets JC, DK when the kp-extrapolation is applied. This follows from the fact that the kp-extrapolation allows values either from the finite range < −M (xi ), M (xi ) > or the value ∞. t u Theorem 3.12. Let A be a PTBA and α be a finite abstraction. For each parameter valuation v the following holds: there exists an accepting run of JAKv if and only if there exists an accepting run respecting v of JAKα . 4 Parameter Synthesis Algorithm We recall that our main objective is to find all parameter valuations for which the parametric timed automaton satisfies its specification. In the previous sections we have described the standard automata-based method employed under a parametric setup which produces a Büchi automaton. For the rest of this section we denote for each state s = (l, JC, DK) of the Büchi automaton on the input the set of valuations JCK as s.JCK. We say that a sequence of states s1 =⇒ s2 =⇒ . . . =⇒ sn =⇒ s1 is a cycle under the parameter valuation v if each state si in the sequence satisfies v ∈ si .JCK. A cycle is called accepting if there exists 0 ≤ i ≤ n such that si is accepting. Contrary to the standard LTL model checking, it is not enough to check the emptiness of the produced Büchi automaton. Our objective is to check the emptiness of the produced Büchi automaton for each considered parameter valuation. We introduce the Cumulative NDFS algorithm as an extension of the well-known NDFS algorithm. Our modification is based on the set F ound which accumulates all detected parametric valuations such that an accepting cycle under these valuations was found. Contrary to the NDFS algorithm, whenever Cumulative NDFS detects an accepting cycle, parameter valuations are saved to the set F ound and the computation continues with a search for another accepting cycle. Note the fact that whenever we reach a state s0 with s0 .JCK ⊆ F ound we already have found an accepting cycle under all valuations from s0 .JCK and there is no need to continue with the search from s0 . Therefore, we are able to speed up the computation whenever we reach such a state. Now, we mention the crucial property of monotonicity. The set of parameter valuations s.JCK can not grow along any run of the input automaton. Lemma 4.1 states this observation, which follows from the definition of successors in JAKα and the definition of operations on CPDBMs. The clear consequence of Lemma 4.1 is the fact that each state s on a cycle has the same set s.JCK. 37 P. Bezděk, N. Beneš, J. Barnat, and I. Černá 1 2 3 4 5 6 7 8 9 10 11 12 Algorithm CumulativeNDFS (G) F ound ← Stack ← Outer ← Inner ← ∅ OuterDF S(sinit ) return Accepted ← F ound Procedure OuterDF S(s) Stack ← Stack ∪ {s} Outer ← Outer ∪ {s} foreach s0 such that s → s0 do if s0 ∈ / Outer ∧ s0 ∈ / Stack ∧ s0 .JCK 6⊆ F ound then OuterDF S(s0 ) if s ∈ Accepting ∧ s.JCK 6⊆ F ound then InnerDF S(s) Stack ← Stack \ {s} return 13 14 15 16 17 18 19 20 21 Procedure InnerDF S(s) Inner ← Inner ∪ {s} foreach s0 such that s → s0 do if s0 ∈ Stack then “Cycle detected” F ound ← F ound ∪ s0 .JCK return if s0 ∈ / Inner ∧ s0 .JCK 6⊆ F ound then InnerDF S(s0 ) return Algorithm 1: Cumulative NDFS Lemma 4.1. Let A be a PTBA, α be an abstraction and s be a state in JAKα . For every state s0 reachable from s it holds that s0 .JCK ⊆ s.JCK. Theorem 4.2. Let A be a PTBA and α an abstraction over JAK. A parameter valuation v is contained in the output of the CumulativeNDFS(JAKα ) if and only if there exists an accepting run respecting v in JAKα . We recall that our objective was to synthesise the set of all parameter valuations such that the given parametric timed automaton satisfies the given LTL property. In order to compute this set we employed a zone-based semantics, an extrapolation technique and the Cumulative NDFS algorithm. We have shown the way to compute all parameter valuations for which the given LTL formula is not satisfied. Now, as the last step in the solution to Problem 2.3, we need to complement the set Accepted. Thus, the solution to Problem 2.3 is the complement of the set Accepted, more precisely the set V allb,ub (X, P ) \ Accepted. To conclude this section, we state that Theorem 4.2 together with Theorem 3.12 imply the correctness of our solution to Problem 2.3. 5 Conclusion and Future Work We have presented a logical and algorithmic framework for the bounded integer parameter synthesis of parametric timed automata with an LTL specification. The proposed framework allows the avoidance of the explicit enumeration of all possible parameter valuations. 38 LTL model checking of Parametric Timed Automata In this paper we have used the parametric extension of a difference bound matrix called a constrained parametric difference bound matrix. To be able to employ a zone-based method successfully we introduced a finite abstraction called the kp-extrapolation. At the final stage of the parameter synthesis process, the cycle detection itself is performed by the introduced Cumulative NDFS algorithm which is an extension of the well-known NDFS algorithm. As for future work we plan to introduce different finite abstractions and compare their influence on the state space size. Other area that can be investigated is the employment of different linear specification logics, e.g. Clock-Aware LTL [13] which enables the use of clock-valuation constraints as atomic propositions. References 1. Clarke, E., Grumberg, O., Peled, D.: Model Checking. MIT press (1999) 2. Alur, R., Dill, D.L.: A Theory of Timed Automata. Theor. Comput. Sci. 126(2) (1994) 183–235 3. Daws, C., Tripakis, S.: Model checking of real-time reachability properties using abstractions. In: Tools and Algorithms for the Construction and Analysis of Systems. Springer (1998) 313–329 4. Behrmann, G., David, A., Larsen, K.G., Möller, O., Pettersson, P., Yi, W.: Uppaal - present and future. In: Proc. of 40th IEEE Conference on Decision and Control, IEEE Computer Society Press (2001) 5. Alur, R., Henzinger, T.A., Vardi, M.Y.: Parametric real-time reasoning. In: Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, ACM (1993) 592–601 6. Hune, T., Romijn, J., Stoelinga, M., Vaandrager, F.: Linear parametric model checking of timed automata. The Journal of Logic and Algebraic Programming 52 (2002) 183–220 7. Jovanovic, A., Lime, D., Roux, O.H.: Synthesis of Bounded Integer Parameters for Parametric Timed Reachability Games. In: Automated Technology for Verification and Analysis (ATVA 2013). Volume 8172 of LNCS., Springer (2013) 87–101 8. Vardi, M., Wolper, P.: An automata-theoretic approach to automatic program verification (preliminary report). In: Proceedings, Symposium on Logic in Computer Science (LICS’86), IEEE Computer Society (1986) 332–344 9. Bezděk, P., Beneš, N., Barnat, J., Černá, I.: LTL Model Checking of Parametric Timed Automata. CoRR abs/1409.3696 (2014) 10. Tripakis, S., Yovine, S., Bouajjani, A.: Checking timed büchi automata emptiness efficiently. Formal Methods in System Design 26(3) (2005) 267–292 11. Dill, D.L.: Timing assumptions and verification of finite-state concurrent systems. In: Automatic verification methods for finite state systems, Springer (1990) 197– 212 12. Bouyer, P.: Forward analysis of updatable timed automata. Formal Methods in System Design 24(3) (2004) 281–320 13. Bezděk, P., Beneš, N., Havel, V., Barnat, J., Černá, I.: On Clock-Aware LTL properties of Timed Automata. In: International Colloquium on Theoretical Aspects of Computing (ICTAC). Volume 8687 of LNCS., Springer (2014) 39 FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks Tomáš Čejka1 , Lukáš Kekely1 , Pavel Benáček2, Rudolf B. Blažek2, and Hana Kubátová2 1 CESNET a. l. e. Zikova 4, Prague, CZ cejkat,[email protected] 2 CTU in Prague – FIT Thakurova 9, Prague, CZ benacekp,rblazek,[email protected] Abstract. The aim of this paper is a hardware realization of a statistical anomaly detection method as a part of high-speed monitoring probe for computer networks. The sequential Non-Parametric Cumulative Sum (NP-CUSUM) procedure is the detection method of our choice and we use an FPGA based accelerator card as the target platform. For rapid detection algorithm development, a high-level synthesis (HLS) approach is applied. Furthermore, we combine HLS with the usage of Software Defined Monitoring (SDM) framework on the monitoring probe, which enables easy deployment of various hardware-accelerated monitoring applications into high-speed networks. Our implementation of NP-CUSUM algorithm serves as hardware plug-in for SDM and realizes the detection of network attacks and anomalies directly in FPGA. Additionally, the parallel nature of the FPGA technology allows us to realize multiple different detections simultaneously without any losses in throughput. Our experimental results show the feasibility of HLS and SDM combination for effective realization of traffic analysis and anomaly detection in networks with speeds up to 100 Gb/s. 1 Introduction Computer networks are getting larger and faster, and hence the volume of data captured by network monitoring systems increases. Therefore, there is a need to analyze more data for detection of network attacks and traffic anomalies. This paper deals with real-time detection of attacks suitable for high-speed computer networks thanks to the direct deployment of detection methods in hardware monitoring probe. Today, monitoring systems usually consist of several probes that capture and preprocess huge amounts of network traffic at wire speed, and one or more collector servers that collect and store network traffic information from these probes. Analysis of network data is traditionally also realized at the collectors. In this 40 FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks paper, we propose a different approach, where anomaly detection is shifted directly into the monitoring probes. The aim of this approach is to enable real-time analysis even in very large networks with speeds up to 100 Gb/s per Ethernet port and to reduce the latency of anomaly detections. It is virtually impossible to process all network data from the 100 Gb/s link in software using only commodity hardware. The main limitations lay in insufficient bandwidth of communication paths between the network interface card and the software components [1] and in limited performance of the processors. Therefore, hardware acceleration must be used for high-speed networks in order to avoid transferring and processing of all the data in the software. In this paper, we utilize a special network interface card mounted with FPGA chip for hardware acceleration of network traffic processing as a basis for our high-speed probe. The FPGA on the card allows us to realize more advanced data processing features (e.g. anomaly detection methods that use packet level statistics) directly on the card, thus reducing the data load for the software. To demonstrate this approach, we concentrate on a real-time sequential ChangePoint Detection (CPD) method that is designed to minimize the average detection delay (ADD) for a prescribed false alarm rate (FAR) [2,3]. As the basis for the FPGA firmware, we use Software Defined Monitoring (SDM). SDM is a novel monitoring approach proposed in [4], that can be used as a framework for hardware acceleration of various monitoring methods. SDM combines hardware and software modules into a tightly bound co-design that is able to address challenges of monitoring from data link to application layer of the ISO/OSI model in modern network environments at the speeds up to 100 Gb/s. The main contribution of this paper is the evaluation of a statistical realtime detection methods implemented in hardware. The detection methods are extensions of a hardware accelerated monitoring probe designed for 40 Gb/s and 100 Gb/s Ethernet lines. The resulting device is able to analyze unsampled highspeed network traffic without loss. The rest of this paper is organized in the following way. Introduction to the implemented sequential non-parametric change-point detection method (NPCUSUM) can be found in Sec. 2. The used SDM concept is briefly described in Sec. 3. Sec. 4 describes created hardware implementation of detection method. Evaluation of the developed system and the achieved results are presented in Sec. 5. Related work and main differences between existing projects and our implementation are presented in Sec. 6. Sec. 7 summarizes the results presented in this paper and outlines our future work. 2 Change-Point Detection Network attacks, intrusions, or anomalies appear usually at unpredictable points in time. The start of an attack is mostly observable as a change of some statistical properties of the network traffic or its specific part. Therefore, methods based on sequential Change-Point Detection theory are suitable for intrusion detection. CPD methods detect the point in time where the distribution of some 41 T. Čejka et al. perpetually observed variables changes. In network security settings, these variables correspond to some relevant, directly observed or calculated network traffic characteristics. The main problem of such approach is the lack of precise knowledge about the statistical distributions of these traffic characteristics. Ideally, the distributions should be known both, before and after the distribution change that corresponds to the anomaly or attack. Therefore, we use a non-parametric CPD method NP-CUSUM that was developed in [2,3] and that does not require precise knowledge about these statistical distributions. NP-CUSUM is inspired by Page’s CUSUM algorithm that is proven to be optimal for detection of a change in the mean (expectation) when the distributions of the observed random variables are known before and after the change [5]. The typical optimality criterion in CPD is to minimize the average detection delay (ADD) among all algorithms whose average false alert rate (FAR) is below a prescribed low level. Page’s CUSUM procedure, which is based on the log-likelihood ratio, can for i.i.d. (independent and identically distributed) random variables Xn be rewritten [5] as: p1 (Xn ) , U0 = 0, (1) Un = max 0, Un−1 + log p0 (Xn ) Where p0 and p1 are the densities of Xn before and after the change, respectively. The formulation in (1) is the inspiration for the NP-CUSUM method procedure [2,3]. The procedure is applicable to non–i.i.d. data with unknown distributions (i.e. the method is non-parametric). First, the Page’s CUSUM procedure was generalized as Sn = max{0, Sn−1 + f (Xn )} with some function f . Changes in the mean value of Xn can be detected using sequential statistic: Sn = max{0, Sn−1 + Xn − µ̂ − εθ̂}, S0 = 0, (2) Where µ̂ is an estimate of the mean of Xn before the attack, θ̂ is an estimate of the mean after the attack started, and ε is a tuning parameter for optimization. It has been shown in [3] that with optimal value of ε the NP-CUSUM procedure (2) is asymptotically optimal as FAR decreases. That is, for small prescribed rate of false alarms, other procedures will have longer detection delays. In fact, the delays can theoretically be exponentially worse [3]. As the input of the NP-CUSUM algorithm, we can use various features Xn of the observed network traffic. To basically evaluate our hardware implementation of the method, we have chosen for Xn the ratio of SYN and FIN packets of the Transmission Control Protocol (TCP) in a short time window [6]. During “normal” operation of the network, each connection is opened using two SYN packets, and closed using two FIN packets (one in each direction). Therefore, we expect the ratio of SYN and FIN packets to be on average close to 1 or at least constant. Sudden and consistent change of the ratio is suspicious and can be caused by some sort of attacks (e.g. SYN or FIN packet flood) [6]. To demonstrate the scalability and power of our hardware implementation using SDM, we raise the number of observed statistics and add some more NPCUSUM blocks in parallel. The added statistics utilize information about ICMP 42 FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks and RST TCP packets. All measured values are used in form of ratios in order to avoid the dependency on trends and traffic volumes that could increase the number of false alerts. Finally, thanks to parallelism, observation of multiple statistics simultaneously does not negatively affect the processing throughput. 3 Software Defined Monitoring System Software Defined Monitoring (presented in [4,7]) forms a basis for our hardware implementation of detection methods in a monitoring probe. In this section we briefly describe the main architecture of the SDM system and the changes needed to accommodate the implementation of NP-CUSUM monitoring system. An SDM system consists of two main parts: firmware for the FPGA on hardware accelerator and software for general processors. The hardware and software components are connected via a PCI-Express bus. Both parts are tightly coupled together to allow precise software control of hardware processing. The software part of the SDM system consists of monitoring applications and a controller. The monitoring applications can perform advanced monitoring tasks (such as analysis of application protocols) or also export information (alerts) to the collector. The controller manages the hardware module by dynamically removing and inserting processing rules into its memory (see Fig. 1). The instructions contained in the rules tell the hardware what actions to perform for each input packet with some characteristics. These rules are defined by the monitoring Applications, which inserts them to the Hardware via the Controller. Due to aforementioned facts, the monitoring application can not only use data coming from the hardware, but it can also manage the details of hardware processing of network traffic as well. The offloading of traffic processing into the hardware saves both, the bandwidth of communication interface (PCIe) and the CPU processing time. The hardware module can pass information to the software in the form packet metadata from a single packet, or as aggregated records computed from multiple subsequent packets with common features (such as NetFlow [8] aggregation). Whole received packets or their parts can be also sent to the software for further (deeper) analysis. Graphical representation of the SDM concept is shown in Fig. 1. Processing of an incoming network packet in the SDM hardware starts with the extraction of its protocol headers. The extracted data are used to search adequate rule in memory that specifies the desired processing possibly supplemented by address of a record. The selected rule and metadata for each given packet are then passed to the SDM Update block, which is the heart of the SDM concept making that idea strong. This block contains a routing table that is used to forward the incoming processing request to the appropriate update (instruction) blocks, for execution. Each of these instruction blocks can perform a specific update operation (realize a specific aggregation type) on the record. Each update operation is delimited by two memory operations: reading the stored record values, and writing back the updated values. Also, new types of updates (aggregations) can be specified, simply by implementing the new instruction block 43 T. Čejka et al. SW Controller Apps Aggr. Data Rules Packets Metadata HW Packets Fig. 1. Software Defined Monitoring (SDM) abstract architecture and plugging it into the existing Update block infrastructure. A special type of processing action is an export into the software of the processed packet data, metadata, or stored values from a selected record, optionally followed by clearing of that record. Records can be exported when some special condition is met or in periodical manner. 4 4.1 Implementation The CPD hardware block Our hardware implementation of CPD method is realized as hardware plug-in for the SDM system. More precisely, it is available as a new instruction block for the SDM Update module that is described in the previous section. The SDM design supports access to arbitrary data records stored in memory for instruction blocks. Although, the available data size of a record is limited due to memory block size that can be read or written on each clock cycle – the block size is equal to 288 b. Usage of bigger data records than 288 b would cause unwanted latency increase and lower throughput of the whole monitoring hardware. One CPD instruction block uses available space in memory to store: previous historical value, 2 parameters of the NP-CUSUM algorithm, and 1 threshold value that is used for alerting purposes. Memory should also contain counters with observed features such as the number of SYN or FIN packets, and the packet counter that starts the ratio and NP-CUSUM computation. The data stored in memory is accessible from software and therefore all of the thresholds and parameters can be changed on the fly. The source code of the instruction block allows us to specify the data type size of all values stored in memory. The choice of data type sizes implies the number of hardware blocks that can work in parallel in the same clock cycle with the same memory block. However, the decrease of data type size lowers the value precision and data ranges. The NP-CUSUM parameters, the previous historical value and the threshold are represented as 16 b decimal numbers. The counters are set to 8 b. For one block that analyzes SYN/FIN ratio, the implementation 44 FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks works with 88 b of memory for one record in total. Configuration with 4 NPCUSUM blocks uses 5 counters (SYN, FIN, RST, ICMP, packet counter) and 4 sets of fixed-point values. In total, 4 NP-CUSUM blocks would use 296 b of memory. Therefore the size of decimal number data type was shortened to 15 b and the total used memory size was decreased to available 280 b. We use a high-level synthesis (HLS) approach [9], to implement the CPD method from Sec. 2 for the FPGA as an instruction block inside the SDM system. The structure of the implemented block is shown at Fig. 2. The main advantage of using HLS approach is faster implementation of new hardware accelerated monitoring and detection methods with minimal loss of efficiency in comparison to traditional coding of FPGA firmware using Hardware Description Languages (HDL) such as VHDL or Verilog. Following the requirements for the SDM instruction block interfaces and general behavior, we have developed the CPD hardware block in the C++ language. Implementation of the CPD hardware block brings a several issues to solve. The most important one is the choice of decimal numbers representation. We try two of the standard approaches: fixed-point and floating-point representation. The main advantage of the floating-point approach is the ability to represent a greater range of values. But on the other hand, hardware realization of floatingpoint arithmetic is very complicated and considerably slower. Therefore, the usage of fixed-point arithmetic can be favored by better performance and lower resource usage of the instruction block. From the HLS point of view, the most important parameter for our design goals is the achievable Initiation Interval (II). This parameter represents the number of clock cycles needed for initialization of a new request in the instruction block. Ideally, we require the II to be equal to one so that a new request can be accepted in each clock cycle and the instruction block is able to achieve full throughput. During our experiments, we have discovered that the effect of decimal numbers representation on the II is following: floating-point version of the instruction block has II of 11 clock cycles whereas the fixed-point version has II of 1. Another very important performance-related parameter of our implementation is latency. It is required to be as small as possible because high latency can lead to delays between repeated processing of the same instruction caused by the fact that records in the memory need to be locked in order to achieve atomic processing. In the end, our experimental timing and performance results indicate that the created implementation is able to handle network traffic at 100 Gb/s Ethernet line. More detailed results regarding our synthesis and FPGA requirements are discussed in Sec. 5. Apart from CPD instruction block creation, another important part of the implementation is connection of the new instruction block to the existing SDM Update block. Thanks to the by design extensibility of SDM Update block, this task is simple and straightforward. All that needs to be done is to wrap the translated HLS implementation of the new block in a VHDL envelope that is responsible for adapting the behavior of all predefined interface signals. The 45 T. Čejka et al. Reservation and memory interface Update pipeline HLS dened interface Instruction Read and reservation module CPD module (C++ desc.) SDM interface adaptor Data to write Output SDM format generator Write and release module Reservation and memory interface Data to export SDM output data Fig. 2. Implementation of the CPD Instruction wrapping process is depicted in Fig. 2. The gray blocks are parts of the SDM designated for connecting of a new instruction blocks. The SDM can thus be viewed as some kind of a framework that brings the possibility to create new hardware modules for rapid network monitoring acceleration. To finish the implementation of the Change-Point Detection method in the SDM system, a software monitoring application needs to be created. The application communicates with an SDM Controller daemon to manage the detection details in hardware module (see Fig. 1) and also receives detected alerts. The main task of the monitoring application is to control the detection process and present its results to human operators. 5 Evaluation Correct functionality of the created implementation of the CPD block was verified using referential software application. The referential application is written in the plain C language and is not meant to be highly optimized for the HLS. Its main purpose is only to validate the functionality of the hardware implementation. In addition, the software application is extended and serves as the base for the measuring and detection application [10] that can be used in slower networks or for estimation of configurable parameters for the CPD block. We have implemented the hardware-based prototype of the NP-CUSUM detection method as an instruction block for the SDM Update block in an SDM monitoring probe. The prototype is developed for the network interface card with a 100 Gb/s Ethernet port and a Virtex-7 H580T FPGA, which is the main core for the implemented detection functionality. A detailed list of all FPGA resources needed for the implementation of one CPD instruction block, which observes one feature, is shown in Tab. 1. In the 46 FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks table there are also results for other constellations of the CPD blocks that contain more computational blocks with 1, 2, or 4 instances of the NP-CUSUM algorithm and observe more features in parallel. The total number of available resources on used chip is 725 600 Flip-Flops (FF) and 362 800 Look-up tables (LUT). The number of utilized LUTs and FFs for CPD instruction block itself, therefore, accounts only for less than 1% of the available FPGA resources. Table 1. FPGA resources used for the CPD instruction block in different configurations. Name 1 block FF LUTs Expression 0 458 Instance 280 252 Multiplexer - 1842 Register 2253 ShiftMemory 0 806 Total 2533 3358 2 blocks FF LUTs 0 496 560 504 - 1868 2377 0 816 2937 3684 4 blocks FF LUTs 0 479 560 504 - 2130 2593 0 814 3178 3982 Performance results for the CPD instruction blocks are shown in Tab. 2 and Tab. 3, whereas Tab. 3 shows detailed information about the fixed-point implementation. An Initiation Interval is required to be equal to one in order to support processing of 100 Gb/s network traffic at full wire-speed (see Sec 3). This requirement is not satisfied only by the floating-point implementation. Vivado HLS version 2013.2 was used for high-level C to VHDL synthesis. Xilinx ISE version 14.7 with enabled synthesis optimization was used for VHDL to FPGA netlist synthesis. Enabling the optimization such as register duplication leads to a higher clock frequency achieved for the final implementation and also to a higher resources consumption. The tables illustrate that after the optimization all performance requirements from Sec. 3 have been met by the fixed-point implementation. Table 2. Comparison of timing results for the synthesized CPD instruction blocks. Parameter Reached Reached Required Fixed-point Floating-point Clock period 4.08 ns 16.48 ns 5 ns Frequency 245 MHz 60.679 MHz 200 MHz Latency 12 11 Initiation Interval 1 12 1 Bus Width 512 b 512 b 512 b Achieved Throughput 125 Gb/s 2.5 Gb/s 100 Gb/s 47 T. Čejka et al. Table 3. Performance results for the CPD instruction blocks in different configurations. Parameter Reached 1 block Clock period 4.08 ns Frequency 245 MHz Latency 12 Initiation Interval 1 Bus Width 512 b Achieved Throughput 125 Gb/s Reached Reached Required 2 blocks 4 blocks 4.20 ns 4.20 ns 5 ns 238 MHz 238 MHz 200 MHz 12 12 1 1 1 512 b 512 b 512 b 121 Gb/s 121 Gb/s 100 Gb/s Finally, Tab. 4 shows the total number of FPGA resources required for the whole synthesized SDM system with one CPD hardware plug-in. The table shows that about 87% of the Virtex-7 H580T resources are still available. Therefore, it is feasible to include several CPD hardware plug-ins in the SDM system for parallel detection of various anomalies without significant latency increase nor throughput loss. Table 4. FPGA resources of the SDM system with one CPD hardware plug-in ( FPGA xc7vh580thcg1155-2). Resource Name Used Resources [-] Utilization Percentage LUTs 47731 13 % Registers 21089 2% BRAMS 107 11 % 6 Related Work We present a brief overview of related work with regard to the differences of our work. This section can be divided into two main domains. The first domain is related to the hardware accelerated detectors and the second domain is related to the detection methods. From the hardware point of view, there are two interesting projects somehow similar to our – Gorilla and Snabb Switch. The Gorilla project [11] is the closest comparable solutions that we found. Gorilla is a methodology for generating FPGA-based solutions especially wellsuited for data parallel applications. The main goal of Gorilla is the same as our goal in SDM Update – to make the hardware design process easier and faster. Our solution is however specially designed for the stateful processing of network packet data. Furthermore, SDM is able to work with L2–L7 layers of ISO/OSI model. In addition, the resource consumption of Gorilla is higher than our solution. The Snabb Switch project [12] shows different approach of network packets processing. This approach uses modified drivers for faster transfer of network 48 FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks packets from the network interface card to computer’s memory. Transferred data are then processed by network applications. There is also available a Snabb Lab with an accessible platform for measuring. This platform consists of the Supermicro motherboard with dual Xeon-E5 and 20x10 GbE (Intel 82599ES) network cards. This configuration allows to process network traffic at speed of 200 Gb/s. Massive usage of this platform is complicated due to large number of network cards. Our solution is able to process network traffic at speed of 100 Gb/s on one Ethernet line (2 ports allows to achieve 200 Gb/s). Our work is focused on full hardware acceleration of network traffic processing using the only one 100 Gb/s Ethernet port. From the detection method point of view, there are various existing approaches of anomaly detection from many authors. Detection of SYN flood attacks have been studied and well described in many papers. However, this issue is currently still relevant because of increase of network traffic volumes. Detection based on NP-CUSUM is used in [13] by Wang et al., where the authors present their observation about SYN-FIN pairs in network traffic under normal condition: (1) there is a strong positive correlation between the SYN and RST packets; (2) the difference between the number of SYN and FIN packets is close to the number of RST packets. The authors bring experimental evaluation of flood detection using NP-CUSUM, however they mention a possible disadvantage of aggregated counting of packets that can be spoofed by emission of mixed packet types by attacker. Siris et al. in [14] compare a straightforward adaptive threshold algorithm, which can bring satisfactory performance for attacks with high intensity, and algorithm base on cumulative sum (CUSUM). Adaptive threshold algorithm uses a difference from moving average value computed e.g. by EWMA algorithm. An alarm is signalized when measured value is higher then moving average in last k consecutive intervals. The CUSUM variant of detection algorithm is influenced by seasonality and trends of network traffic (weekly and daily variations, trends and time correlations). The authors propose to use some prediction method to remove non-stationary behavior before applying the CUSUM algorithm. However, because of time-consuming calculations with minor gains compared to simpler approaches, the authors used simpler approach based on application of CUSUM on difference between measured value and result of Exponential Weighted Moving Average (EWMA) [15] algorithm. Smoothing of the data signal is important for minimizing the number of false alarms that can be caused by high peaks in data. Therefore, the data are usually preprocessed to avoid short-time deviations to detect long-time anomalies. There are various approaches to smooth the signal and the possible way is to exploit some prediction method such as Moving average, EWMA, Holt-Winters [16], or Box-Jenkins (ARIMA) [17] methods. However, dependency of an algorithm on historical and current measured values can be dangerous and can lead to overlooking of an attack. The issue of self-learning and self-adaptive approach is being studied in our current and future work, however, it is out of the scope of this paper. 49 T. Čejka et al. Salem et al. presented the currently used methods of the network anomaly detection in [18]. The paper evaluates the usage of extended NP-CUSUM called Multi-chart NP-CUSUM, proposed by Tartakovsky et al. in [19], in combination with Count Min Sketch and Multi-Layer Reversible Sketch (sketching method is proposed eg. in [20]) for data aggregation and anomaly detection. This paper is focused on the hardware implementation of the detection method, whereas other authors usually more or less rely on software processing of aggregated data. Our solution allows the detection method to be real-time and independent on overloaded software part of system. 7 Conclusions In this paper we present implementation and evaluation of the CPD algorithm (NP-CUSUM) as hardware plug-in for the Software Defined Monitoring system. We achieve easy and rapid development of detection hardware blocks for the FPGA thanks to the usage of high-level synthesis. Also, creation of monitoring probe utilizing newly implemented detection method is very simple and straight forward thanks to the utilization of SDM as the platform for high-speed packet processing. Moreover, we show frequency and FPGA resource evaluation of the hardware implementation for the Virtex-7 H580T FPGA, which is large enough and fast enough to accommodate complex network processing. Results presented in this paper show that our implementation of NP-CUSUM is capable of processing network traffic at the speed up to 100 Gb/s. The firmware of the whole monitoring probe consumes only 13 % of the available resources of the target FPGA and thus leaves space for several additional CPD (NP-CUSUM) hardware plug-ins that can be used for parallel detection of multiple kinds of network anomalies concurrently. In addition, other existing detection methods can potentially be easily implemented in the similar way – as hardware SDM plug-ins for detection of abrupt changes of network traffic characteristics. The limiting factor for deploying detection hardware plug-ins into a monitoring probe is the consumption of FPGA resources. Generally, detection methods with low data storage requirements can be fully implemented as a hardware plug-ins. Moreover, SDM allows creation of hardware-software co-design where only the most critical parts of the more complex detection algorithm can be accelerated. This partially hardware-accelerated approach can reduce the FPGA resource requirements of advanced detection methods with moderate performance loss. Acknowledgment This work is partially supported by the “CESNET Large Infrastructure” project no. LM2010005 funded by the Ministry of Education, Youth and Sports of the Czech Republic and the project TA03010561 funded by the Technology Agency of the Czech Republic. 50 FPGA Accelerated Change-Point Detection Method for 100 Gb/s Networks References 1. Santiago del Rio, P.M., Rossi, D., Gringoli, F., Nava, L., Salgarelli, L., Aracil, J.: Wire-speed statistical classification of network traffic on commodity hardware. In: Proceedings of the 2012 ACM Conference on Internet Measurement Conference. IMC ’12, New York, NY, USA, ACM (2012) 65–72 2. Blažek, R.B., Kim, H., Rozovskii, B., Tartakovsky, A.: A novel approach to detection of “denial–of–service” attacks via adaptive sequential and batch–sequential change–point detection methods. In: Proc. 2nd IEEE Workshop on Systems, Man, and Cybernetics, West Point, NY. (2001) 3. Tartakovsky, A.G., Rozovskii, B.L., Blažek, R., Kim, H.: A novel approach to detection of intrusions in computer networks via adaptive sequential and batchsequential change-point detection methods. IEEE TRANSACTIONS ON SIGNAL PROCESSING 54(9) (2006) 3372–3382 4. Kekely, L., Puš, V., Kořenek, J.: Software defined monitoring of application protocols. In: INFOCOM 2014. The 33rd Annual IEEE International Conference on Computer Communications. (2014) 1725–1733 5. Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2) (1954) 100–115 6. Wang, H., Zhang, D., Shin, K.: Detecting syn flooding attacks. In: INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE. Volume 3. (2002) 1530–1539 7. Puš, V.: Monitoring of application protocols in 40/100gb networks. In: Campus Network Monitoring and Security Workshop, Prague, CZ, CESNET (2014) 8. Claise, B.: Cisco Systems NetFlow Services Export Version 9. RFC 3954 (2004) 9. Feist, T.: Vivado design suite. White Paper (2012) 10. Čejka, T.: Fast TCP Flood Detector. http://ddd.fit.cvut.cz/prj/FTFD (2014) 11. Lavasani, M., Dennison, L., Chiou, D.: Compiling high throughput network processors. In: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. FPGA ’12, New York, NY, USA, ACM (2012) 87–96 12. Gorrie, L.: Snabb switch. http://www.snabb.co (2014) 13. Wang, H., Zhang, D., Shin, K.: Detecting syn flooding attacks. In: INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE. Volume 3. (2002) 1530–1539 14. Siris, V.A., Papagalou, F.: Application of anomaly detection algorithms for detecting SYN flooding attacks. Computer communications 29(9) (2006) 1433–1442 15. Ye, N., Borror, C., Zhang, Y.: Ewma techniques for computer intrusion detection through anomalous changes in event intensity. Quality and Reliability Engineering International 18(6) (2002) 443–451 16. Brutlag, J.D.: Aberrant behavior detection in time series for network monitoring. In: LISA. (2000) 139–146 17. Box, G., Jenkins, G., Reinsel, G.: Time Series Analysis: Forecasting and Control. Wiley Series in Probability and Statistics. Wiley (2013) 18. Salem, O., Vaton, S., Gravey, A.: A scalable, efficient and informative approach for anomaly-based intrusion detection systems: theory and practice. International Journal of Network Management 20(5) (2010) 271–293 00019. 19. Tartakovsky, A.G., Rozovskii, B.L., Blažek, R.B., Kim, H.: Detection of intrusions in information systems by sequential change-point methods. Statistical Methodology 3(3) (2006) 252–293 20. Krishnamurthy, B., Sen, S., Zhang, Y., Chen, Y.: Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, ACM (2003) 234–247 51 Hardware Accelerated Book Handling with Unlimited Depth Milan Dvorak, Tomas Zavodnik, and Jan Korenek Brno University of Technology, Brno, Czech Republic, [email protected], [email protected], [email protected] Abstract. Strong competitiveness between market participants on electronic exchanges calls for continuing reduction of latency of trading systems. Recent efforts are focused on hardware acceleration using FPGA technology and running trading strategies directly in hardware to eliminate high latency of system bus and software processing. For any trading system, book handling is an important time-critical operation, which has not been accelerated using the FPGA technology yet. Therefore we propose a new hybrid hardware-software architecture that processes messages from the exchange and creates the book with best buy and sell prices. Based on the analysis of transactions on the exchange, we propose to store only the most frequently accessed price levels in hardware and keep the rest in the host memory managed by software. This enables handling half of the whole stock universe (4 000 instruments) in a single FPGA. Update of price levels in hardware takes only 240 ns, which is two orders of magnitude faster than recent software implementations. Throughput of the hardware unit is 75 million messages per second, which is 140 times more than current peak rates of the market data. 1 Introduction Electronic trading is dominating in today financial markets. Market participants communicate with the exchange by sending messages via a computer network. Techniques of algorithmic and high frequency trading (HFT) are widely adapted. Traders no longer focus on specific trades, but rather on tweaking parameters of algorithm, which is responsible for the trading itself. HFT traders are utilizing the latest network technologies to achieve advantage against the rest of the market. There is a strong competitiveness even between the individual traders who compete with each other to achieve the lowest latency of their systems. This is a key factor for making their profit. Therefore, a significant effort is being put into accelerating the systems for electronic trading by both academic and commercial institutions. First efforts in acceleration of trading systems were focused on latency of data transfers from network interface to the processor using designated acceleration card [1] [2]. Further reduction of the latency was achieved by accelerating the decoding of messages from the exchange [3] [4]. The latest efforts aim to realize 52 Hardware Accelerated Book Handling with Unlimited Depth the whole system in the FPGA chip [5]. This eliminates the latency of system data bus transfers and ensures the lowest latency possible. Still many parts of trading systems have not been accelerated using the FPGA yet. For instance, Lockwood [5] does not address the book handling, which is crucial for processing the data feed from the exchange. An architecture for handling aggregated book with limited depth is proposed in [6]; however, many exchanges (including the stock markets) are using so called book with unlimited depth (see section 2), which has not been yet accelerated using the FPGA technology. This paper presents a hybrid hardware-software architecture that enables handling of the book with unlimited depth. Only the best N price levels are stored in hardware. These levels can be updated with latency only 240 ns, which is two orders of magnitude faster than recent software implementations. Software manages the complete book with all price levels. A synchronization protocol is used to ensure the data consistency in hardware. Further, we discuss a trade-off between the number of price levels stored in hardware, the risk of underflow and the number of messages transferred on the system bus. The architecture was synthesized to Virtex-7 technology running at 150 MHz. Using two QDR SRAM modules with total capacity of 144 Mbits, it is possible to store the book for up to 4 thousands financial instruments, which corresponds to a half of the whole NASDAQ (National Association of Securities Dealers Automated Quotations) stock exchange. Only two cards are needed to handle the whole exchange. This paper is divided into six sections. After this introduction follows the problem statement in which we discuss how the financial exchanges work and what is the order book. The third section presents analysis of memory requirements of the book handling with unlimited depth. Hardware architecture for acceleration of this problem is described in the fourth section. After that we show the experimental results. The last section contains a conclusion. 2 Problem Statement Financial exchange is an institution which provides services for trading various financial instruments, for example stock, derivatives or commodities. Current price of traded instruments is usually determined by mutual auction between supply (sell) and demand (buy) side. Market participants send trading orders to the exchange. These orders define what instrument, at which price and what quantity they want to trade. Example of such order can be buy 50 shares of Apple stock for 91 dollars or sell 30 shares of Microsoft stock for 42 dollars. For every new order, exchange tries to find corresponding buy and sell orders and execute a transaction. If no corresponding order is found, the new order is stored in so called book. The book contains all open trading orders for financial instruments. The exchange needs to inform all users about the current state of the market. Usually, the exchange simply forwards the information about individual trading orders to users. Thus, if a trader places a new order, which is not immediately executed, the exchange assigns to the order a unique identificator and sends an ADD 53 M. Dvořák, T. Závodnı́k, and J. Kořenek message to all users. This message represents addition of new order to the book. It usually consists of the order identificator, instrument identificator, required price, quantity and an indication whether it is buy or sell order. If trader decides to change his/her current order, the exchange generates a MODIFY message. This message usually consists of order identificator, changed price and changed quantity. But it does not contain instrument identificator, previous price and quantity, because this information was included in the previous ADD message. The last message type is a DELETE message. It is created if user cancels his/her order or if the order is matched and executed. DELETE messages contain only order identificator, because the other information is known from the previous ADD and MODIFY messages. Information about individual orders in the book is not essential for the traders. The trading algorithms usually use the values of best prices at which the relevant instruments are traded. Therefore system for processing messages from the exchange needs to convert the information about individual orders and create an aggregated information about the best prices. The main principle of this processing is to join the orders with the same price, accumulate their quantities and sort the resulting price levels according to the price. This is how the aggregated book is created [6]. The number of price levels is called the book depth. This number is unlimited, because the prices are set by users. Therefore the book is called a book with unlimited depth. Example of an order book is shown in Table 1. Only two instruments (stocks) are shown in the table. Real book can have thousands of instruments. Each instrument has its unique ID (AAPL and MSMT in our case). The table shows list of buy and sell orders for both instruments. Each order has unique order ID (not shown in the table), price and quantity. Orders are sorted according to the price. We can see that multiple orders can have the same price. If we join these orders and sum up their quantities, we get the aggregated price level. For instance, first level on the buy side of AAPL consists of three orders with price 91.05 and total quantity 15 (8+2+5). Since some values are omitted in the MODIFY and DELETE messages, it is necessary to store for each order a record with the information from the ADD message to be able to update the price levels accordingly. For each order, we need to store its identificator (64 bits), price (32 bits), quantity (32 bits), instrument identificator (15 bits) and a buy/sell flag (1 bit). It means that every order needs 144 bits of memory. Aggregated information for each price level contains price (32 bits), accumulated quantity (32 bits) and count of accumulated orders (16 bits), which is 80 bits in total. Total memory requirements for book handling depends on the number of orders placed by users during the day and number of price levels. Major stock exchange, which needs book with unlimited depth, is the American stock exchange NASDAQ. Therefore we provide detailed analysis of real data from NASDAQ exchange. 54 Hardware Accelerated Book Handling with Unlimited Depth Order Book AAPL (ID for Apple) Buy Orders Price 91.05 91.05 91.05 91.00 90.95 ... Quantity 8 2 5 12 32 ... MSMT (ID for Microsoft) Sell Orders Buy Orders Sell Orders Price Quantity 91.10 21 91.15 10 91.15 15 91.20 85 ... ... ... ... Price Quantity 42.30 16 42.30 11 42.28 31 42.28 10 ... ... ... ... Price Quantity 42.40 28 42.43 14 42.43 30 42.43 5 42.45 44 ... ... Table 1. Order Book example 3 Analysis The goal of the analysis is to determine memory requirements of book handling operation. To provide precise analysis, we use real data from the NASDAQ exchange, which was captured on 3rd October, 2013. The exchange manages almost 8 000 stocks (instruments). Maximum number of orders in the book during the day is over 1.5 million. With 144 bits per order, at least 206 Mbits are required to store all instruments in the book. This amount of data cannot be be stored in the on-chip memory. The largest FPGAs have only 66 Mbits of on-chip memory. All orders can be stored only in external memory. Almost 350 000 price levels on buy and sell side were created from these orders, which is 700 000 price levels in total. That is 88 price levels for each instrument on average; however, the maximum book depth is 3 000. With 80 bits for each price level, we need 54 Mbits to store the whole book. This amount of data uses most of the on-chip memory capacity even with the largest available FPGA chips. No memory would be left for the trading strategy and other parts of the trading system. The analysis of total memory requirements and the length of the list of price levels imply that it is not possible to handle complete book with unlimited depth in the FPGA. We can store and update only a few best price levels in the FPGA and keep remaining levels in the host memory, where they can be simply managed by software with higher latency. This principle is supported by the typical behavior of traders, who usually use only a few best price levels to make their trading decisions. To evaluate the feasibility of this idea, we performed an analysis of accesses to the list of price levels. 55 ... M. Dvořák, T. Závodnı́k, and J. Kořenek We created a histogram of a price level indexes, which were accessed by each of the ADD, MODIFY and DELETE operations during one day. The characteristics of the accesses were similar for all operations and also for the buy and sell side. Accumulated histogram for all operations and both buy and sell sides is shown in the Fig. 1. Accesses percentage 100 80 60 40 20 0 1–8 9 – 16 17 – 24 25 – 32 33 – 40 41 – 3000 Price levels range Fig. 1. Histogram of access distribution among the price levels The histogram shows that there is a strong locality of accesses to the price level list. More than 94 % of updates are at top 24 levels and more than 97 % of updates are covered by 32 price levels. Only 1.5 % of updates are referenced to levels 41 to 3 000. It has to be noted; however, that the histogram does not provide information about the movement of price levels in time. Individual price levels are added and deleted during the trading day. Therefore, it is possible that the current top price level in the list was in much lower position only a couple of microseconds earlier. Locality of price level accesses therefore supports the idea to store only the best price levels in the hardware. Due to the dynamic nature of the data structure, it is necessary to deal with a possible underflow when records from lower positions of the list are moved to the top of the list. 4 Architecture Based on the analysis of transactions on the exchange in the previous section, we propose to store only the most frequently accessed price levels in hardware and keep the rest in the host memory managed by software. The FPGA operates as a hardware cache. It provides the best price levels to the trading strategy as fast as possible. Relevant messages from the exchange are used to update these price levels with low latency. Software processes all messages from the exchange and stores the complete book. Thus, the software is able to detect possible underflow 56 Hardware Accelerated Book Handling with Unlimited Depth in hardware caused by deletion of some price levels. Then a special message is generated and sent via the system bus to provide the missing information to the FPGA. Hardware architecture of the book handling consists of three steps illustrated by Fig. 2. Messages Instrument Table Messages + addr Order Table Price Level Updates Price Level Table Top Price Levels Fig. 2. High level architecture of book handling with unlimited depth The first step is conversion of instrument identificator to internal address. The internal address is used in the Price level table to store data for each instrument. Set of instrument identificators is known in advance and does not change during the processing. Therefore the instrument identificators can be represented by static dictionary and implemented by architecture published in [6]. The dictionary provides instrument address as a result. Then the address is passed to the next step of processing together with the message from the exchange. The second step is a management of order table, which stores all orders from the exchange. It is a dynamic table because individual orders can be created and deleted during the day. Fast look-up in a large number of orders is required. To achieve low latency and high throughput, some hashing scheme has to be used. We propose to use Cuckoo hashing [7] that provides fast look-up together with efficient memory utilization [8] [9] [10]. The Order table component processes ADD, MODIFY and DELETE messages from the exchange. Depending on the message type, an order is added to the table, modified, or deleted from the table. As discussed in session 2, we need to convert information about individual orders to price levels at which the instruments are traded on the exchange. Each order message generates an update for the price level table. ADD messages always increase quantity on corresponding price level. Size of this increase depends on the quantity in the new order. Similarly, DELETE messages decrease quantity on corresponding price level. MODIFY message can cause either increase or decrease of quantity. The results depends on how the order was modified. The last step of proposed architecture is an update of price level table. It processes the updates from the order table and stores the price level data. The table can store up to N price levels for every instrument, where N is configurable and has direct impact on the performance of the architecture. The performance is discussed in details in section 5. When an update from the order table is received, all price levels of the instrument are read from the memory. The address of the record was computed in the instrument table. The one-bit buy/sell flag is added 57 M. Dvořák, T. Závodnı́k, and J. Kořenek to the address in this component, because the price levels are stored separately for buy and sell side. Updates from the order table can cause one of the following operations on the price level table: – Modification of price level, when the updated price level is already in the table. The order quantity is added to or subtracted from the existing level. – Insertion of price level, when the price level to be increased does not exist. This requires shifting the lower price levels down. – Deletion of price level, when the updated price level is decreased to zero quantity. This requires shifting the lower price levels up. Update operations are implemented in parallel by processing elements (PE) at each price level. In the following text, we denote the price levels P Li and the corresponding processing elements P Ei for 1 ≤ i ≤ N . Processing element P Ei has 4 data inputs, P Li−1 , P Li , P Li+1 and the new price level P Lnew , which is calculated from the input message. Each element also has control input OP , which denotes the type of update operation, and control output cmpi , which is the result of comparison between the current (P Li ) and the new (P Lnew ) price level. Detailed design of a processing element is shown in Fig. 3. Block (CMP) compares the input price level P Li with the new price level P Lnew and creates the signal cmpi . Block MODIFY implements the increase or decrease of the price level quantity if the new and the current prices are equal, otherwise it just forwards the new price level. The comparison result and the type of update OP are also used in SEL LOGIC block to determine the select signal of the output multiplexer MX. Type of update determines the shift direction, comparison result determines whether the price level is below the inserted/deleted level and thus needs to be shifted. The multiplexer then simply selects one of its inputs to implement the required update operation. Interconnection of the processing elements and the memory is shown in Fig. 4. Each element fetches corresponding price level from the memory and sends it to both neighbors (inputs P Li−1 and P Li+1 ). New price level input is shared by all processing units and compared with corresponding price level P Li . All comparison results cmpi are processed by the control logic block to determine the type of operation OP (modification, insertion or deletion). Processing elements use the operation type to select the output price level, which is then written back to the memory. The top price levels are also passed to the trading algorithm (not shown in the figure). 5 Experimental results The proposed architecture has been implemented in VHDL and tested on FPGA acceleration card COMBO-80G. The card has fast PCIe interface, eight 10GE ports and it is equipped by Virtex-7 XC7VX690T chip and two QDR-II+ SRAM memory modules with total capacity of 144 Mbits. 58 Hardware Accelerated Book Handling with Unlimited Depth PL_new cmp_n OP PE_N CMP PL_new SEL LOGIC MODIFY PL_n sel PL_n+1 PL_n−1 MX PL_n PL_n PL_n − updated PL_n Fig. 3. Architecture of processing element Control logic OP cmp_2 cmp_1 cmp_n PL_new OP PL_2 PL_1 zeros PE_1 PL_1 OP OP PL_2 PL_1 − updated PE_2 PL_3 PL_n−1 PE_N PL_n zeros PL_n PL_2 PL_2 − updated PL_n − updated Memory Fig. 4. Architecture of price level update block 59 M. Dvořák, T. Závodnı́k, and J. Kořenek The VHDL implementation was synthesized by Xilinx Vivado tool version 2013.4 with 165.5 MHz as a maximum achievable frequency. Nevertheless, all tests on the exchange data were performed at 150 MHz. The architecture is pipelined and multi-clock cycle constraint is used for the parallel update at all price levels. It means that each update takes 2 clock cycles and the throughput of the unit is 75 million messages per second, which is 140 times more than the peak rate from used data set. The latency is only 4 clock cycles, because the architecture needs 1 cycle for memory read, 2 cycles for update and 1 cycle for write back. Thus, the latency of the price level update is 27 ns. The biggest impact on the overall latency has the Order table component (see Fig. 2). Each operation in the Order table requires a read access to the QDR memory, which takes 180 ns. The overall latency of the whole full order book architecture is 240 ns. The card is equipped with QDR-II+ SRAM memory, which has only limited capacity (144 Mbits). Therefore, it was not possible to store orders for all traded instruments. To process the whole NASDAQ exchange, we needed to use two cards. Each was handling half of the instruments (4 000). We also measured memory requirements for the FPGA on-chip memory. Total amount of consumed memory is affected by two parameters: number of instruments and number of stored price levels N . Effect of those two parameters on the resource consumption is shown in Table 2. Number of price levels 8 16 24 32 4096 instruments Register LUT BRAM 740 (0 %) 5551 (1 %) 242 (16 %) 844 (0 %) 8441 (1 %) 482 (32 %) 680 (0 %) 11951 (2 %) 722 (49 %) 806 (0 %) 15310 (3 %) 962 (65 %) 8192 instruments Register LUT BRAM 783 (0 %) 5600 (1 %) 483 (32 %) 862 (0 %) 10646 (2 %) 963 (65 %) 680 (0 %) 13393 (2 %) 1443 (98 %) 911 (0 %) 15411 (3 %) 1923 (130 %) Table 2. Comparison of resource consumption for different number of symbols and price levels It can be seen that number of used registers and LUTs is very low even for 8192 instruments and 32 price levels. On-chip memory consumption increases linearly with both number of instruments and number of price levels. Up to 32 price levels can be stored for 4096 instruments, but only 16 price levels are possible for 8192 instruments. Required memory for 8192 instruments and 32 price levels exceeds the capacity of the chip, hence the 130 % value in the table. We also performed evaluation of the synchronization process between hardware and software on the same captured data from the exchange as in section 2. We analysed number of messages between acceleration card and the software. Further, we observed the deepest underflow in the hardware (minimal number of 60 Hardware Accelerated Book Handling with Unlimited Depth valid price levels) for different number of price levels in hardware and how often the underflow reached threshold value. We set the threshold value to 5 because this is the usual number of price levels in aggregated book. Results of the evaluation are shown in Table 3. The number of synchronization messages generated by software is decreasing with the number of price levels. This can be explained by the fact that the higher number price levels corresponds to the higher number of instruments, which can be fully stored in hardware, and do not need synchronisation messages. The risk of underflow also decreases with higher number of price levels. Having only 8 price levels is not enough because the underflows occur very often. Underflows reaching the threshold level are still happening even for N = 16; however, minimum number of valid levels is 4. There are no underflows in case of 24 and 32 levels and more than 50 % of price levels in hardware are valid all the time. Our analysis suggests that higher number of price levels in hardware can help to reduce the risk of underflow and also decrease the utilization of the system bus. The key factor is therefore the amount of memory on the chip. It is possible to increase the possible number of price levels by reducing the number of symbols and vice versa. No. of levels 8 16 24 32 Messages from SW 887270 327581 152218 63251 Lowest valid level 0 4 13 21 Threshold reached 42 487 88 0 0 Table 3. Analysis of underflow risk and system bus utilization for different number of price levels 6 Conclusions We introduced a hybrid hardware-software architecture to accelerate book handling with unlimited depth for low-latency trading systems. The proposed architecture processes messages from the exchange and creates the book with best buy and sell prices. Based on the analysis of transactions on the exchange, we propose to store only the most frequently accessed price levels in hardware and keep the rest in the host memory managed by software. The software is able to detect a possible underflow in hardware and provide the missing information to hardware via the system bus. Moreover, we analysed how number of price levels in hardware has impact on the system bus utilization and the risk of underflow. To our best knowledge, this is the first published FPGA architecture for book handling with unlimited depth. The proposed architecture has latency 61 M. Dvořák, T. Závodnı́k, and J. Kořenek only 240 ns, which is two orders of magnitude faster than recent software implementations. Throughput of the hardware architecture is 75 million messages per second, which is 140 times more than current peak rates in available market data. Acknowledgment This work was supported by the IT4Innovations Centre of Excellence CZ.1.05/1.1.00/02.0070 and the BUT project FIT-S-14-2297. References 1. G. W. Morris, D. B. Thomas, and W. Luk, ”Fpga accelerated low-latency market data feed processing”. In Symposium on High-Performance Interconnects, 2009, vol. 0, pp. 83–89. 2. H. Subramoni, F. Petrini, V. Agarwal, and et al., ”Streaming, low-latency communication in on-line trading systems”. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010, pp. 1–8. 3. C. Leber, B. Geib, and H. Litz, ”High frequency trading acceleration using fpgas”. In 2011 International Conference on Field Programmable Logic and Applications (FPL), 2011, pp. 317–322. 4. R. Pottathuparambil, J. Coyne, J. Allred, W. Lynch, and V. Natoli, ”Low-latency fpga based financial data feed handler”. In 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2011, pp. 93–96. 5. J. W. Lockwood, and et al., ”A Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT)”. In IEEE 20th Annual Symposium on HighPerformance Interconnects, 2012, pp. 9–16. 6. M. Dvorak, and J. Korenek, ”Low Latency Book Handling in FPGA for High Frequency Trading”. In 2014 IEEE 17th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), 2014, pp. 175-178. 7. R. Pagh, and F. F. Rodler, ”Cuckoo hashing”, Journal of Algorithms, vol. 51, no. 2, pp. 122–144, May 2004. 8. L. Kekely, M. Zadnik, J. Matousek, and J. Korenek, ”Fast Lookup for Dynamic Packet Filtering in FPGA”. In 2014 IEEE 17th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), 2014. 9. T. Tran, and S. Kittitornkun, ”Fpga-based cuckoo hashing for pattern matching in nids/nips,” in MNGNS, ser. LNCS, 2007. 10. A. Kirsch, M. Mitzenmacher, and U. Wieder, ”More robust hashing: Cuckoo hashing with a stash,” in ESA, ser. LNCS. Springer, 2008. 62 Composite Data Type Recovery in a Retargetable Decompilation Dušan Kolář1 and Peter Matula1 DIFS FIT BUT Brno, Božetěchova 1/2, 612 66 Brno, Czech Republic {kolar, imatula}@fit.vutbr.cz Abstract. Decompilation is a reverse engineering technique performing a transformation of a platform-dependent binary file into a High Level Language (HLL) representation. Despite its complexity, several decompilers have been developed in recent years. They are not yet advanced enough to serve as a standalone tool, but combined with the traditional disassemblers, they allow much faster understanding of analysed machine code. To achieve the necessary quality, many advanced analyses must be performed. One of the toughest, but most rewarding, is the data type reconstruction analysis. It aims to assign each object with a high level type, preferably the same as in the original source code. This paper presents the composite data type analysis used by the retargetable decompiler developed within the Lissom project at FIT BUT. We design a whole new address expression (AE) creation algorithm, which is both retargetable and suitable for operating on code in Static Single Assignment (SSA) form. Moreover, we devise a new AE aggregation rules that increase the quality of recovered data types. 1 Introduction In recent years, there has been a growing threat of a new malware attacking wide range of intelligent devices other than personal computers. Nowadays, our smartphones, routers, televisions, or gaming consoles are no longer safe from computer criminals. Washing machines, refrigerators, or microwave ovens are likely to follow in the near future. Since these devices usually have some specific purpose, they are often using dedicated processor architectures or operating systems. Because of this, it has become increasingly more difficult to analyse potentially dangerous binaries. The solution may be a new tool, a retargetable decompiler, capable of translating platform-dependent executables into some common high level language. Main tasks of such a program are reconstruction of high-level control flow, functions, objects and data types. Data type reconstruction can be further divided into two sub-tasks: simple and composite data type recovery. In this article, we present an advanced composite type recovery algorithm implemented by the Lissom project’s retargetable decompiler. Our approach has several advantages over the state of the art competition. This paper is organised as follows. In Section 2, we discuss related work, which has been done on the subject. Section 3 briefly presents the Lissom project’s 63 D. Kolář and P. Matula decompiler. The basic scheme of our data type recovery system is described in Section 4. The main subject of this paper, composite data type recovery analysis, is explained in detail by Section 5. Section 6 experiments with our approach and finally, Section 7 draws a conclusion and outlines our future work. 2 Related Work Mycroft [1] presents an unification-based simple and composite data type reconstruction algorithm that lays down the basic principles used by most of its successors. Emmerik [2] is using a similar technique optimised for the SSA form. It minimises memory requirements by storing only the object definitions’ types. All object’s occurrences and derived objects only refer to the original definition. Type Inference on Binary Executables [3] is one of the latest papers on the subject. It presents state of the art approach capable of high-level data type reconstruction using the unification of type terms extended to support subtypes. It emphasises the convergence, precision and conservativeness and is able to exploit the results of both static and dynamic analysis. From existing decompilers, the Hex-Rays [4] plugin to the IDA Pro disassembler [5] is the most used, and arguably the most advanced reverse engineering tool today. It has both array and structure reconstruction capabilities. It is however questionable, if it performs any state of the art type aggregation. Our work is closest to the composite data type recovery in the SmartDec decompiler [6, 7], which is briefly described in the remainder of this section. Since SmartDec produces output in the C programming language, the composite type recovery aims for structure and array detection. Unions are not taken into account. The algorithm is based on the memory access analysis that is divided into two main steps. (1) Memory access addresses are expressed according to Equation 1. Base b is either a global memory address, or an object name holding such address during the execution. Offset o is a distance of an accessed element from the base. The rest is a multiplicative component (MC) typical for array iterations. Single element represents one multiplication, where Ci is a constant value multiplying the iterator xi . Multiple elements indicate multidimensional accesses. The same concept can be expressed in a form of an Address Expression (AE) in Equation 2. Multiplication pairs are arranged in a list (denoted as [ ]) in a descending ordered by the value of Ci . Non-relevant MC can be represented by the symbol m. [7] computes such AEs for every machine code object using the fix-point dataflow analysis. Most results are thrown away and only those used in memory access operations are further processed. b+o + n X ! Ci xi (1) i=0 AE = (b, o, [(C1 , x1 ), . . . , (Cn , xn )]) 64 (2) Composite Data Type Recovery in a Retargetable Decompilation (2) At the beginning, each AE gets its own label ae. Then, the algorithm tries to apply aggregation rules to construct equivalence classes. Each class represents one composite object. The first rule in Equation 3 merges two AEs, if their bases are the same. The second one in Equation 4 performs an array aggregation. Both bases are the same and the offset difference is lower than the maximal constant C. This indicates an iteration performed over a complex type (i.e. array of structures). ae1 = (b1 , o1 , m1 ), ae2 = (b2 , o2 , m2 ), b1 ≡ b2 (3) ae1 ≡ ae2 ae1 = (b, o1 , [(C, x11 ), . . . ]), ae2 = (b, o2 , [(C, x21 ), . . . ]), |o1 − o2 | < C (4) ArrayAggregation(ae1 , ae2 ) 3 Lissom Project’s Retargetable Decompiler The Lissom project’s [8] retargetable decompiler [9] (available online at [10]) is independent of any target architecture, file format, operating system, or compiler. It is using ADL processor models and extensive preprocessing to translate the machine-code into a HLL representation. Currently, the decompiler supports MIPS, ARM, x86, PIC32, and PowerPC architectures using Windows PE or Unix ELF object file formats (OFF). The decompilation process (Figure 1) starts with a preprocessing phase [11]. It detects an input’s OFF, compiler, and a potential packer. If the file was indeed packed, it applies known unpack routines. Next, a plugin-based converter translates platform-dependent OFF file to the internal Common-Object-FileFormat (COFF). Finally, the generator automatically creates the instruction decoder from the platform’s ADL model. The decompiler’s core performs actual reverse compilation. It consists of the three main parts built on top of the LLVM Compiler System [12]. (1) Front-end takes an input COFF file and using the generated instruction decoder, it creates an LLVM intermediate representation (IR). Each machine code instruction is translated into a sequence of LLVM IR operations characterising its behaviour in a platform-independent way. The abstraction level of such representation is further lifted by the sequence of static analyses, which recover local/global variables, functions, parameters/returns, data types, etc. (2) Middle-end takes an LLVM IR code and applies LLVM built-in optimizations along with our own passes. Our routines include loops optimisation, constant propagation or control-flow simplification. (3) Back-end converts optimised IR to the back-end IR (BIR). Then, it runs few more analyses like a HLL control-flow identification or object name assignment. Finally, an output code is generated. Currently, we support C and Python-like language, plus assembly code, call graph and control-flow graph. 65 D. Kolář and P. Matula 4 Data Type Recovery Infrastructure The type inference analysis (Figure 2) is part of the decompiler’s front-end. It operates on the LLVM IR code, in which functions, global variables and local variables have already been detected. Its goal is to associate every object (i.e. register, global variable, local variable, function parameter, or return) with a data type, preferably the same as in the original source code. Furthermore, it has to modify program’s code to reflect changes made to objects’ types. Even though LLVM IR is in the SSA form, form’s condition holds only for the temporary variables used by LLVM micro-operations. Other objects (i.e. global/local variables and registers) are manipulated by the load/store operations, violating the SSA’s single assignment rule. For this reason, the type recovery is using results of the reaching definition analysis, which provides definition-use (DU) and use-definition(UD) chains shown in Equation 5. All object identifiers in this paper are in fact uses or definitions—pairs of object ID ant its position in the program. To minimise memory requirements, type recovery is also using a sparse object representation inspired by Emmerik’s [2] SSA-optimised algorithm. Our previous article [13], dealing with the data-flow inference analysis, described the core of the whole type reconstruction system. Using the type propagation equations and the type propagation rules derived from instruction semantics, it was able to reconstruct objects’ simple data types. In [11], we showed how to utilise sources of precise type information such as debugging data, known library function calls, or dynamic profiling information to increase the quality of simple data types, or even recover composite data objects. This paper presents the composite data type analysis algorithm capable to recover composite objects (arrays, structures) without the use of precise type information, which may not be available for every executable. Technique is based on analysis of memory access operations and their aggregation into possible composite objects. Defs(use) = {def 1 , . . . , def n }, Uses(def ) = {use 1 , . . . , use n } Fig. 1: The concept of the Lissom project’s retargetable decompiler. 66 (5) Composite Data Type Recovery in a Retargetable Decompilation 5 Composite Data Type Analysis An object used as an address in a memory access operation is tagged as a pointer by the simple data type inference. Such object is then passed to the composite recovery analysis. It tries to build an address expression for the address calculation and aggregate it with other similar address expressions to form a composite object with the complex data type. Same as the procedure described in Section 2, it does so in two steps. 1. Address expression (AE, Equation 1) creation. 2. Address expression aggregation—composite object/type construction. 5.1 Running Example The following sections are illustrating their contents on the example shown in Figure 3 and associated Equations 6 through 18 (equations are explained during the illustration). Because the composite objects structure and a2d are declared as global variables, address bases are used throughout example AEs. If the objects were local variables or allocated pointers, as is ls on line 11, it would not be possible to statically determine their locations (addresses). In such cases, the symbolic bases are used as is depicted in Equation 12. However, the presented procedure is generally the same. Fig. 2: Data type recovery scheme. 67 D. Kolář and P. Matula 1 struct s2 { int a2 [10]; int e2 ; }; // size = 44 B struct s1 { int e1 ; struct s2 a1 [10]; }; // size =444 B 2 3 4 5 6 7 8 9 10 11 struct s1 structure ; int a2d [10][30]; // starts at address 1000 // starts at address 1444 a2d [ i ][ j ] = X ; structure . a1 [ i ]. e2 = X ; structure . a1 [4]. e2 = X ; // for : i =0..9 ; j =0..19 // for : i =0..9 // random array access struct s1 * ls = ( struct s1 *) malloc ( sizeof ( struct s1 ) ) ; Fig. 3: Running example program used to illustrate described principles. ((120 ∗ (0 + r1 )) + 1444) JOIN (r2 + 4) (6) (JOIN (+ (∗ (120) (+ (0) (r1 ) ) ) (1444) ) (+ (r2 ) (4) ) ) (7) ((120 ∗ r1 ) + 1444) + (r2 ∗ 4) (8) (+ (+ (∗ (120) (r1 ) ) (1444) ) (∗ (r2 ) (4) ) ) (9) (1444, 0, [ (120, r1 ), (4, r2 ) ]) (10) (1444, 0, [M1 , M2 ]), M1 = [(120, r1 ), (4, r2 )], M2 = [(120, r3 ), (4, r4 )] (11) ae1 = (ls, 0, [ ]) ae2 = (ls, 4, [(44, r1 ), (4, r2 )]) ae3 = (ls, 44, [(44, r1 )]) (12) ae1 = (1000, 0, [ ]) (13) ⇒ (1000, [0, 0], [ ]) (16) ae2 = (1000, 4, [(44, r1 ), (4, r2 )]) (14) ⇒ (1000, [4, 0], [(44, r1 ), (4, r2 )]) (17) ae3 = (1000, 44, [(44, r1 )]) ⇒ (1000, [4, 40], [(44, r1 )]) 5.2 (15) (18) Address Expression Creation The proposed AE construction used by the SmartDec decompiler introduced in Section 2 does not suit our needs. It employs a fix-point analysis to compute every object’s AE and further processes only those used at memory access operations. Because of a huge number of objects implied by the SSA form of our LLVM IR, such approach would be highly ineffective. For this reason, we designed a novel approach using the symbolic interpretation of address calculation. Instead of running a full fix-point analysis on all IR objects, it evaluates (builds computation trees) only those objects, which are involved in memory access operations. 68 Composite Data Type Recovery in a Retargetable Decompilation Such trees are then transformed to equivalent AEs and further aggregated into composite objects. The main idea is to process only the necessary objects, not all IR objects. Symbolic Interpretation by Binary Trees When the first pass encounters a memory access operation, it tags object addr containing an address as a pointer and passes it to the symbolic interpreter. In the following text, (X) represents the unary node where X is either a value or a symbolic object name. Binary node for the operation X Y is represented as ( (X) (Y )), where (X) and (Y ) are unary or binary nodes. At the beginning, the tree consists of a single symbolic node (addr ). Then, the algorithm recursively modifies nodes according to their objects’ definitions, until the binary tree is not expressing the whole computation of addr ’s value. The operation is either addition, multiplication, or so called JOIN . Other instructions (e.g. shifts) are expressed using only the operations. If it is not possible, the symbolic interpretation fails. The JOIN operation is used to merge two object definitions for a single use. It is typical for array iterations, where the first operand (definition) represents an initial iterator value, and the second one is a value added each iteration to get to the array’s next element. Running example: Address computation for an array access on line 7 may look like Equation 6 (infix notation). Equation 7 expresses the same calculation in binary tree notation. Note that the original iterators i and j were replaced by registers r1 , r2 . Binary Tree Simplification To make the address expression creation easier, the algorithm repeatedly applies various kinds of binary tree simplification rules like simple arithmetic evaluation or multiplication distribution. The most important procedure is the JOIN operation removal. Since the node (JOIN (X) (+ (Y ) (Z))) denotes an incrementation of initial iterator value X by iterator value Y plus constant increment Z, it can be transformed to (+ (X) (∗ (Y ) (Z))). Running example: Simplified infix and binary tree notations of the example used in previous subsection are depicted in Equation 8 and 9. Binary Tree to AE Conversion The final step is to transform a binary tree into an AE. It is done by associating each node with its AE and running a bottom-up propagation. At first, bottom nodes are initialised with simple AEs equivalent to their values. Then, parent nodes are merging their childrens’ AEs into a single AE, until the root node does not contain address expression representing the whole binary tree. Running example: The resulting address expression created for line 7 array access operation is shown in Equation 10. Value 1444 is AE’s global memory address base. 69 D. Kolář and P. Matula 5.3 Address Expression Aggregation AE’s associated with the root nodes are further processed in the second phase of the composite type reconstruction. The goal is to group AE’s together to represent a single complex object. Conversion of such objects to actual data types is straightforward and it is not presented in this paper. The aggregation rules used by the Lissom decompiler follows. Those adapted from the existing techniques refer to Equations in Section 2. AE Equivalence Aggregation Two AEs are equal (access the same element), if their bases and offsets are equal. The multiplicative component (MC) only determines which iterators are used in array accesses. We redefine original AE definition (Equation 2) to contain a list of multiplicative components according to Equation 19. This allows to aggregate two equal AE’s into one, preserving both MCs. Running example: An access with AE identical to Equation 10 (except using another set of iterators—r3 , r4 ) can be aggregated with Equation 10, forming the AE in Equation 11. AE = (b, o, [M1 , . . . , Mn ]) , Mi∈{1,...,n} = [ (Ci1 , xi1 ), . . . , (Cim , xim ) ] (19) Base Equivalence Aggregation Composite object (CO) in Equation 20 aggregates AEs with the common base. Its maximum size is upperBnd . AEs are contained in AETypeMap ordered list of pairs, which associates AE with its simple data type. In the following text, mathematical structures may be accessed in the C programming language manner (e.g. obj .upperBnd ). The list ordering is based on predicate : (ae 1 < ae 2 ⇔ ae 1 .o < ae 2 .o). COs with address bases are stored (in an ascending order by base) in a single global container GlobComposite. COs with symbolic bases are in containers associated to the currently processed function. When algorithm gets AE from the binary tree root node, it finds the corresponding CO based on the base, or creates a new one. AE is inserted in the correct position in AETypeMap (aggregated with equal AE if needed), and associated type is included in the simple data type propagation. Described system implements original aggregation Equation 3, and performs an additional element ordering into the correct composite object layout. Running example: The address expressions in Equations 13 through 15 form an object with the address base 1000. It represents the structure structure. CO = (base, upperBnd , AETypeMap) AETypeMap = [ (AE 1 , type 1 ), . . . , (AE n , type n ) ] (20) High Quality Types Aggregation The composite type recovery is necessary, even if precise data types are inferred from an additional type source (e.g. debug info, known library function call). Object’s type itself is useless, if it is unknown 70 Composite Data Type Recovery in a Retargetable Decompilation where and how it is accessed. This information is provided by the composite objects’ address expressions, which in return initialise their types from the precise source. Set Upper Object Bound Since global objects are in an ascending ordered, their upper bound is determined by the distance to the closest subsequent object. Note: similar principle is used in all papers presented in Section 2. Running example: Object for structure is at address 1000, object for a2d is at 1444. structure’s upper bound is (1444 − 1000) = 444. Composite Object Aggregation All elements of a local composite object are always accessed using the same symbolic base object, or its copy. Therefore, all elements’ AEs are correctly aggregated according to subsection Base Equivalence Aggregation. However, an element of a global composite object may be accessed directly by its real address, instead of CO’s base address and element’s offset. Such access creates a separate composite object and the address based aggregation is not possible. This situation can sometimes be solved by the aggregation based on the upper bound of memory region occupied by any ae1 from the obj 1 object. If there exists object obj 2 whose first AE belongs to this region, then obj 2 is merged with obj 1 . Bound is determined by ae1 ’s base address, offset, multiplication constant and the maximum iterator value. The last one is computed by a value range analysis on loop’s induction variable. Utilisation of a value range analysis is a novel technique introduced by this paper. Running example: AE created for access on line 9 is ae 2 = (1220, 0, [ ]). However, there is ae 3 from Equation 15. Based on line 8, the value range analysis determines the maximum iterator value equals to 9 and that ae 3 in fact contains ae 2 . The old AE is discarded and (1000, 220, [ ]) is added to the structure object. Random Array Access Aggregation So far, we assumed that every array access uses iteration. However, there also might be a number of random access operations. Ideally, iterator and random accesses should be joined together, since they are accessing the same composite object member. Following the logic of the previous aggregation, it is possible to use value range analysis once again to detect that the random access is in fact part of some previous array address expression. Running example: Picking ae3 = (1000, 44, [(44, r1 )]) from Equation 15, the address expression (1000, 220, [ ]) created in previous subsection is further aggregated into ae3 = (1000, 44, [ [(44, r1 )], [(44, 4)] ]). Now it is clear that the original AE is in fact a random array access into the forth element of structure.a1 member. Multidimensional Array Aggregation The most complex step is generalisation of original Equation 4 for multidimensional arrays shown in Equation 21. So 71 D. Kolář and P. Matula far, offsets were relative to the object base—the first address expression. In this aggregation, each AE’s offset o is replaced by the offset list OL = [o, 0, . . . , 0]. Its size is n and it contains o as the first element. Another list CL = [C1 , . . . , Cn ], which contains all the unique multiplication constants from the AE’s multiplicative component in the descending order, is also created. Application of Equation 21 on ae1 (associated lists: CL1 , OL1 ) fills the subsequent ae2 ’s offset list OL2 with values relative to ae1 . If two offsets o2i , o1i in OL1 and OL2 at the same position i are equal, then ae1 and ae2 are part of the same structure at level i. However, they are not the same element, so they must differ on some other position greater than i. ∀i ∈ {1, . . . , n} : o1i ∈ OL1 , o2i ∈ OL2 , C1i ∈ CL1 , o2i − o1i < C1i (21) o2i+1 = o2i − o1i , o2i = o1i Running example: Equations 16 through 18 depict ae1 ,ae2 and ae3 after the multidimensional array aggregation. Offset values o are replaced by OL lists. The first offset in both ae2 and ae3 is the same, which indicates that these two elements are part of the same nested structure located at offset 4 in the composite object at address 1000. The second offset distinguishes them in this structure. The fist iterator associated with the constant 44 is used to index the nested array of structures named a1. The second iterator in ae2 indicates that a2 is in fact an array of 4 byte elements. Array Bounds At the end, when none of the above aggregations can be applied, the algorithm infers array bounds in several different ways. If multiple bounds are computed for a single array, the maximum value is preferred. In this case, it is safer to over-approximate, since under-approximation may cause out of range array accesses. 1. If AE’s multiplication constant list CL = [C1 , . . . , Cn ] have at least two elements, and AE is a multidimensional array (not a nested structure), the nested arrays’ bounds are: ∀ i ∈ {2, . . . , n} : bound associated with Ci is equal to Ci−1 /Ci . Running example: a2d’s second bound is (C1 /C2 ) = (120/4) = 30. 2. The first dimension C1 of ae1 is inferred from the subsequent ae2 that is not part of the same nested structure as ae1 . If such AE exists, then the bound associated with C1 is (ae2 .o − ae1 .o)/C1 . Running example: structure.a1[X].a2 bound is (44 − 4)/4 = 10. 3. If an array expression ae1 is the last in global object obj1 , then the object’s upper bound gives an upper array size estimate: (obj1 .upperBnd −ae1 .o)/C1 . Running example: structure.a1 bound is (444 − 4)/44 = 10. 4. Initialised global arrays have values stored in data sections. The algorithm iterates over memory slots from the object’s start, and based on the element type it determines, if the current slot still belongs to the array. This is feasible only for data types with distinguishable values such as pointers or strings, otherwise it would not be possible to identify array’s end. 72 Composite Data Type Recovery in a Retargetable Decompilation 5. Local arrays are generally initialised by one of these two methods. (1) Short arrays have their elements filled one by one using direct assignment instructions. Such sequence can be detected and used to infer array’s size. (2) Larger arrays are initialised by the memory copy routine. It is possible to compute the size of the copied data and the number of array elements it represents. 6 Experimental Results In this section, we present an accuracy evaluation of the presented composite data type recovery algorithm. We test our solution on a set of 10 programs, which contains reasonably complicated complex data types. All the programs were originally written in the C programming language, and we decompile them to C as well. Our results are compared with outputs of the most used reverse compiler today—the Hex-Rays decompiler (version 1.9, IDA disassembler version 6.6). Each program was compiled for three different architectures (MIPS, x86, ARM), two optimisation levels (-O0 and -O2), and ELF object file format. Used compilers and their versions are: psp-gcc version 4.3.5 for MIPS; gcc version 4.7.2 for x86; and gnuarm-gcc version 4.4.1 for ARM. We compiled our test suite with enabled debugging information, but used it only for the automatic type comparison. Detailed results are shown in Table 1. Figure 4 summarises the overall success rate for each combination of architecture and optimisation level. Since Hex-Rays supports only x86 and ARM decompilation, MIPS samples were reversed only by our retargetable decompiler. Composite object count takes into account any local or global object defined as an array or structure, and most of the pointers (those that were processed by the composite type analysis). The results were verified automatically using our similarity comparison tool. We can see, that the quality of our type recovery analysis is comparable with the state of the art Hex-Rays decompiler. However, there is a notable accuracy decline from MIPS through x86 to ARM, and between -O0 and -O2 optimisation levels. It is caused by the increased complexity of the analysed LLVM IR code. ARM and x86 processor models are more complicated than the MIPS model, which causes that the decoder generates harder-to-analyse LLVM IR code. The same thing is caused by more aggressive optimisation levels. The solution is to further improve our symbolic interpreter, since the failure to build a correct binary tree is the most common cause of failed composite data type recovery. 7 Conclusion This paper presents a composite data type recovery technique based on (1) the symbolic interpretation, and (2) address expression aggregation of memory access addresses. It presents a novel approach for the first task, and significantly expands existing methods of aggregation for the second task. The design is suitable for analysis of LLVM IR code in the SSA form, which contains a huge number of temporary objects. Presented techniques were implemented in the 73 D. Kolář and P. Matula Table 1: Composite data type recovery experiment’s results. Tests cat.c and wc.c are sources of the well-known UNIX utilities. All the other tests are from the MiBench embedded benchmark suite [14]. LD stands for the Lissom project’s retargetable decompiler, HR for the Hex-Rays decompiler. Test name C source Original Reconstructed obj. count line composite MIPS x86 ARM count obj. count O0 O2 O0 O2 O0 O2 LD LD LD HR LD HR LD HR LD HR aes.c 830 15 13 12 12 11 11 11 11 11 9 10 cat.c 256 3 3 3 3 3 3 3 3 3 2 2 crc 32.c 84 2 2 2 2 1 2 1 2 2 1 1 175 8 6 5 6 6 5 6 5 6 5 6 dijkstra large.c md5sum.c 619 24 19 17 17 16 14 16 15 17 14 16 pbmsrch large.c 2757 8 7 7 6 7 5 6 7 8 6 7 pgsort.c 492 15 13 12 13 13 11 11 13 12 12 12 sha2.c 260 14 12 11 10 11 8 10 9 11 8 10 2993 18 15 14 15 13 13 12 13 14 12 13 stringsearch 2.c wc.c 270 4 3 3 3 3 2 3 2 3 1 3 P 111 93 86 87 84 74 79 80 87 70 80 Success rate [%] 84 77 78 76 67 71 72 78 62 72 100 100 80 80 60 60 40 40 20 20 0 M IP M S- O0 IP S- O2 x8 6-O 0 x8 6-O 2 AR M -O 0 AR Success rate (%) Success rate (%) Lissom Retargetable Decompiler Hex-Rays Decompiler 0 M -O 2 Architecture and optimisation level Fig. 4: Composite data type recovery success rates comparison between the Lissom project’s retargetable decompiler and the Hex-Rays decompiler. Lissom project’s retargetable decompiler and tested on the set of programs compiled for multiple processor architectures. Our solution is capable to reconstruct composite data types in the following cases: – Alternately nested (arrays and structures are alternating) global objects in data sections, whose addresses can be statically computed. – Alternately nested dynamically allocated objects on the heap. 74 Composite Data Type Recovery in a Retargetable Decompilation – Local arrays (and structures/arrays nested in them) on the stack, if they are accessed in iteration. It is however not able to reconstruct the following situations: – Directly nested structures, since they were inlined by the compiler and their elements use the same base as the parent’s structure members. – Local structures on stack, since the whole stack is behaving like a single structure. All of its elements are accessed using the frame pointer as the common base. Recovery of such constructs is possible only if an additional information like debug data or known library function signatures is utilised. In future, we plan to further improve our success rates, especially on x86 and ARM architectures and higher optimisation levels. With improved symbolic interpretation, we should be able to reach MIPS architecture accuracy rates across all of the supported platforms. We also plan to implement a reconstruction of high-level composite data types of the C++ object oriented language (e.g. classes and related mechanisms). Acknowledgements This work was supported by the BUT FIT grant FIT-S-14-2299, and by the European Regional Development Fund in the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070). References 1. Mycroft, A.: Type-based decompilation. In: Programming Languages and Systems, 8th European Symposium on Programming, Amsterdam, The Netherlands, 22-28 March, 1999, Proceedings. Volume 1576 of Lecture Notes in Computer Science., Springer (1999) 208–223 2. Emmerik, M.V.: Static single assignment for decompilation. (2007) 3. Lee, J., Avgerinos, T., Brumley, D.: Tie: Principled reverse engineering of types in binary programs. In: NDSS, The Internet Society (2011) 4. Hex-Rays Decompiler. www.hex-rays.com/products/decompiler/ (2013) 5. IDA Disassembler. www.hex-rays.com/products/ida/ (2012) 6. Dolgova, E.N., Chernov, A.V.: Automatic reconstruction of data types in the decompilation problem. Program. Comput. Softw. 35(2) (2009) 105–119 7. Troshina, K., Derevenets, Y., Chernov, A.: Reconstruction of composite types for decompilation. In: Proceedings of the 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation. SCAM ’10 (2010) 179–188 8. Lissom. http://www.fit.vutbr.cz/research/groups/lissom/ (2013) 9. Ďurfina, L., Křoustek, J., Zemek, P., Kolář, D., Hruška, T., Masařı́k, K., Meduna, A.: Design of a retargetable decompiler for a static platform-independent malware analysis. In: 5th International Conference on Information Security and Assurance (ISA’11). Volume 200 of Communications in Computer and Information Science., Berlin, Heidelberg, DE, Springer-Verlag (2011) 72–86 75 D. Kolář and P. Matula 10. Retargetable Decompiler. http://decompiler.fit.vutbr.cz/ (2014) 11. Křoustek, J., Matula, P., Kolář, D., Zavoral, M.: Advanced preprocessing of binary executable files and its usage in retargetable decompilation. International Journal on Advances in Software 2014(1) (2014) 1–11 12. The LLVM Compiler Infrastructure. http://llvm.org/ (2013) 13. Matula, P., Kolář, D.: Reconstruction of simple data types in decompilation. In: 4th International Masaryk Conference for Ph.D. Students and Young Researchers (MMK’13). (2013) 14. MiBench version 1.0. http://www.eecs.umich.edu/mibench/ (2012) 76 Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA Vlastimil Košař and Jan Kořenek IT4Innovations Centre of Excellence Faculty of Information Technology Brno University of Technology Božetěchova 2, Brno, Czech Republic {ikosar, korenek}@fit.vutbr.cz Abstract. Regular expression matching is a time critical operation for any network security system. The NFA-Split is an efficient hardware architecture to match a large set of regular expressions at multigigabit speed with efficient FPGA logic utilization. Unfortunately, the matching speed is limited by processing only single byte in one clock cycle. Therefore, we propose new multi-stride NFA-Split architecture, which increases achievable throughput by processing multiple bytes per clock cycle. Moreover, we investigate efficiency of mapping DU to the FPGA logic and propose new optimizations of mapping NFA-Split architecture to the FPGA. These optimizations are able to reduce up to 71.85 % of FPGA LUTs and up to 94.18 % of BlockRAMs. 1 Introduction Intrusion Detection Systems (IDS) [1–3] use Regular Expressions (RE) to describe worms, viruses and network attacks. Usually, thousands of REs have to be matched in the network traffic. Current processors don’t provide enough processing power for wire-speed RE matching at multigigabit speed [4]. Therefore, many hardware architectures have been designed to accelerate this time critical operation [5–7]. Usually, hardware architectures are able to achieve high speed only for small sets of REs due to the limited FPGA resources or capacity of available memory. Hardware architectures based on Deterministic Finite Automata (DFA) [4, 8, 5] are limited by the size and speed of the memory, because the determinisation of automaton significantly increases the number of states and size of the transition table. Architectures based on Nondeterministic Finite Automata (NFA) [9, 6, 10] are limited by the size and capacity of FPGA chips since the transition table is mapped directly into the FPGA logic. With the growing amount of attacks, worms and viruses, security systems have to match more and more REs. It means that the amount of required FPGA logic increases not only due to the increasing speed of network links, but also due to the growth of RE sets. Therefore, it is important to reduce the amount of consumed FPGA resources to support more REs. A lot of work has been done in this direction. FPGA resources have been decreased by a shared character 77 V. Košař and J. Kořenek decoder [6], infix and suffix sharing [7], by better representation of the counting constraint [10] and by the NFA reduction techniques [11, 12]. High reduction has been achieved by NFA-Split architecture [13, 14], which splits NFA to deterministic and nondeterministic parts in order to optimize mapping to the FPGA. The NFA-Split architecture reduces FPGA logic at the cost of on-chip memory (BlockRAMs). As some kinds of REs can increase the size of transition table and require a lot of on-chip memory, we have recently introduced optimization [15], which uses a k-inner alphabet to reduce on-chip memory requirements. NFA-Split is designed to process only one byte of input stream in every clock cycle. The matching speed can be increased by increasing the operating frequency, but for the FPGA the frequency is limited to hundreds of megahertz. Consequently, the NFA-Split architecture cannot scale the throughput to tens of gigabits. To achieve higher matching speed, it is necessary to improve the architecture to support multi-stride automaton and to accept multiple bytes per clock cycle. Therefore, we propose an NFA-Split architecture for multi-stride automata, which requires significantly less FPGA resources in comparison to other multi-stride architectures. For the largest Snort backdoor module, the proposed architecture was able to reduce the amount of FPGA lookup tables (LUTs) by 58 %. Moreover, we investigate the efficiency of mapping the DU to the FPGA logic and propose new optimizations of mapping deterministic and nondeterministic parts of a NFA to FPGA. Both optimizations are able to reduce up to 71.85 % of FPGA LUTs and up to 94.18 % of FPGA BlockRAMs. The paper consists of six sections. Brief summary of the related work is described after the introduction. Then NFA-Split architecture for multi-stride automata is introduced in the third section. Optimizations of the NFA-Split architecture are described in the section four, experimental results are presented in the section five and conclusions in section six. 2 Related Work One of the first methods of mapping the NFA to the FPGA was published by Sidhu and Prasanna [9]. A dedicated character decoder was assigned to each transition. Clark and Schimmel improved the architecture by shared decoders of input characters and sharing of prefixes [16]. Lin et al. created an architecture for sharing infixes and suffixes, but did not specify a particular algorithm to find them [7]. Sourdis et al. published [10] an architecture that allows sharing of character classes, static subpatterns and introduced components for efficient mapping of constrained repetitions to the FPGA. Current efficient solutions for regular expression matching on common processors (CPUs), graphics processing units (GPUs) and application-specific integrated circuits (ASICs) are based on NFAs. An NFA based architecture for ASICs was recently introduced in [17]. It is capable processing input data at 1 Gbps. A solution for GPUs capable of processing rule-sets of arbitrary complexity and size is based on NFA. [18]. However, this architecture has unpredictable performance (950 Mbps - 3.5 Gbps for 8-stride NFA). A NFA based solution for 78 Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA CPUs was introduced in [19]. It provides considerable best-case performance on high-end CPUs (2 - 9.3 Gbps on two Intel Xeon X5680 CPUs with total of 12 cores running on 3.33 GHz). Algorithms based on DFA seek various ways to limit the impact of state explosion of the memory needed to store the transition table. Delay DFA introduced in [20] extended the DFA by default transitions. The default transitions limited a redundancy caused by similarity of output transitions of different states. Content Addressed Delayed Input DFA [5] improved the throughput of the previous methodology by content addressing. The concept of Delay DFA is further refined in [8]. Extended Finite Automaton [21] extends the DFA by a finite set of variables and instructions for their manipulations. Hybrid methods combine DFA and NFA parts to use the best of their respective properties. Becchi introduced hybrid architecture [22] that splits the automaton to a head-DFA and tail-NFAs. The head-DFA contains frequently used states, while the tail-NFA contains the others. NFA-Split architecture [13, 14] is designed for FPGA technology. It utilizes properties of REs in IDS systems and significantly reduces FPGA resources in comparison to other NFA based architectures. As the NFA created from REs has usually only a small subset of states that can be active at the same time, the architecture splits the NFA into several DFA parts and one NFA part. The DFA parts contain only states that cannot be active at once. Therefore, these parts can be efficiently implemented as a standard DFA in a Deterministic Unit (DU) with binary encoded states. States in the NFA part are mapped to Nondeterministic Unit (NU), where every state is represented by a dedicated logic (register and next state logic). Therefore, new state value can be computed in parallel in every clock cycle. We have improved the NFA-Split architecture by k-inner alphabet in [15], which decreases the on-chip memory requirements for matching RE with character classes. Character classes can be specified in REs to define set of characters. In the automaton, the transition on character class has to be represented by a set of transitions on individual characters. This can significantly increase the number of transitions and thus the size of a memory to store the automaton. The k-inner alphabet allows representing a character class by only one internal symbol and only one transition. Thus, less memory is needed to store the automaton. 3 NFA-Split Architecture for Multi-Stride Automata Even while the NFA-Split architecture is highly optimized, requires only reasonable memory and provides high matching speed, it still doesn’t support multi-stride automata to match multiple characters in a single clock cycle. Consequently, it cannot scale its matching speed well. Therefore, we propose an extension of the NFA-Split architecture to support multi-stride automata. Moreover, we provide optimizations of mapping the DFA and NFA parts to further reduce the FPGA logic in order to map larger set of RE to the FPGA. 79 V. Košař and J. Kořenek We propose the necessary modifications of the NFA-Split architecture to support multi-stride automata (SNFA-Split) and to make the matching speed scalable to tens of gigabits. The method of creating the multi-stride automaton is based on performing all consecutive transitions from one state to all states reachable in the number of steps equal to the desired stride. Symbols along these consecutive transitions are merged into one multi-stride symbol. Multi-stride automata have usually more transitions due to the larger number of symbols [23, 11]. Algorithm 1: Compute all pairs of simultaneously active states. The algorithm uses intersection operation ∩k . Input: NFA M = (Q, Σ, δ, s, F ) Output: Set of pairs of simultaneously active states concurrent = {(p, q)|p, q ∈ Q, p 6= q)} 1 2 3 4 5 6 7 8 9 10 11 12 normalize(q1 , q2 ) = (q1 < q2 ) ? ((q1 , q2 ) : ((q2 , q1 ); concurrent = {(s, s)}; workplace = {(s, s)}; while ∃(q1 , q2 ) ∈ workplace do workplace = workplace\{(q1 , q2 )}; foreach q3 ∈ δ(q1 , a) do foreach q4 ∈ δ(q2 , b) do if a ∩k b 6= (∅, ∅, ..., ∅) then if ((q5 , q6 ) = normalize(q3 , q4 )) 6∈ concurrent then concurrent = concurrent ∪ {(q5 , q6 )}; workplace = workplace ∪ {(q5 , q6 )}; return concurrent\{(p, p)|p ∈ Q} To accept multiple characters at once, we have to change the construction of the NFA-Split architecture. First, it is necessary to modify algorithm [15], which is able to identify the simultaneously active states in the NFA. The algorithm tests whether two symbols are equal. To perform this operation, character classes have to be expanded to individual characters. Then the number of transitions can be increased up to 2n times, where n is data width of input characters (usually n = 8). The situation is even worse for the multi-stride automaton. The amount of transitions can be increased up to 2kn times, where n is data width of one input character and k is the number of input characters accepted at once. To avoid this transition growth, we can preserve character classes in symbols and replace the exact comparison by intersection operation ∩k . The inputs of the operation ∩k are two multi-stride symbols defined as k-tuples (A1 , A2 , ..., Ak ) and (B1 , B2 , ..., Bk ), where items Ai , Bi are subsets of input alphabet Ai , Bi ⊆ Σ. The subsets Ai , Bi can represent individual character or a character class. The result of the operation is k-tuple C = (C1 , C2 , ..., Ck ), which is defined by the Eq. 1. 80 Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA (C1 , C2 , ..., Ck ) = (A1 ∩ B1 , A2 ∩ B2 , ..., Ak ∩ Bk ) (1) The k-tuple (C1 , C2 , ..., Ck ) contains set of input symbols that are included in both input k-tuples. If any item Ci is equal to ∅ (empty set), then both input symbols A and B cannot be expanded to the same k-tuple of characters. This means that any pair of expanded symbols from k-tuples A and B is not equal. We denote this situation as ∩k = (∅, ∅, ..., ∅) in the Algorithm 1, which is a modified algorithm to detect the simultaneously active states in multi-stride automaton. Then identification of deterministic and nondeterministic parts in NFA can be the same as in the original NFA-Split architecture. The DU architecture must be modified to support multi-stride automaton. In this paper, we consider the architecture of the DU introduced in [15] which utilizes k-inner alphabets in order to reduce memory requirements. This architecture can be easily extended to support multi-stride automata. As can be seen in Fig. 1, the architecture remains the same except for the first component, which transforms input symbols to k inner alphabets. The component is marked by dotted line and is able to join input symbols. For the automaton A = (Q, Σ, δ, q0 , F ), two symbols a, b ∈ Q can be joined only if ∀q ∈ Q : δ(q, a) = δ(q, b). The proposed architecture uses BlockRAMs [23] as tables that provide efficient transformation of input symbols to k-inner alphabet. Fig. 1. Overview of DU architecture for SNFA-Split with k-inner alphabets and n input symbols accepted at once. Dotted line is used to mark the new component to support multi-stride automata. The nondeterministic part of the NFA-Split architecture for multi-stride automata is based on shared decoder architecture for multi-stride NFA as has been introduced in [6]. 81 V. Košař and J. Kořenek 4 Optimizations of the NFA-Split Architecture In the previous paper [15], we have presented reduction of memory and time complexity of the NFA-Split architecture. In this section, we consider optimizations of DUs and NUs in terms of efficiency of FPGA resource utilization. First, we analyze the efficiency of state representation in DU and NU. Then we propose an optimization of mapping the states to DU and several optimizations of mapping NU to the FPGA. 4.1 Optimization of Deterministic Parts of NFA In this paper, we investigate the efficiency of mapping the DU to FPGA logic. The main factor influencing the resource utilization of the DU is the size of input encoding and output decoding logic. The size of the logic depends on the number of transitions to/from the DU. To analyze the number of transitions to/from the DU, we define continuous parts of the automaton as sets of states if: 1. All states are represented by the DU. 2. All states are reachable from some input state of the DU. 3. All states together with input/output transitions form a continuous graph of transitions. As NFA-Split doesn’t use continuous parts to derive mapping of states to DU and NU, we have to define a procedure how to identify continuous parts. First, we have to select an input state of the DU and then traverse along the transitions until a final state or a transition to NU is reached. If some state has more than one input transition, we have to traverse backwards until input transition to the DU is reached. Similarly, if some state has more than one output transition, then we have to traverse forward through all output transition until final state or transition to NU is reached. This procedure is finished if no new state can be added. The input states in already recognized continuous parts are not used for detection of next continuous parts. We have analyzed the size of continuous parts for Snort backdoor rule set. The result is a histogram in Fig. 2. The x-axis represents the size of continuous parts and the y-axis represents the number of parts of that size. It can be seen that many continuous parts are very small (usually 1 to 3 states). Resource utilization for input encoders and output decoders associated with those very small continuous parts can be larger than the amount of logic resources needed for implementation of those parts in NU. This holds also for continuous part of any size with large number of inputs and outputs. Therefore, we define the Eq. 2 to have a simple condition when it is better to include continuous part pi in the DU. costinputs (pi ) + costoutputs (pi ) < costN U (pi ) (2) This means that the cost of input/output encoding has to be lower than the cost of implementation in the NU. The cost function costinputs (pi ) computes 82 Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA 50 Original 50 Eliminated 25 40 20 30 30 15 20 20 10 10 10 5 Count 40 Optimized 00 10 20 30 40 50 60 00 10 20 30 40 50 60 00 10 20 30 40 50 60 Size of continuous part Fig. 2. Histogram of continuous parts in DU for Snort backdoor rule set. Distribution according to the size is provided for all parts in DU (Original), parts in DU after optimization (Optimized) and parts removed from DU because of inefficiency (Eliminated). number of LUTs necessary to implement one-hot to binary encoding for input transitions. The cost function costoutputs (pi ) computes number of LUTs necessary to implement binary to one-hot encoding for output transitions and the cost function costN U (pi ) computes number of LUTs needed to map the continuous part pi into NU. Application of the Eq. 2 on the DU has direct impact to efficiency of the DU. Therefore, continuous parts violating Eq. 2 are better to be kept in NFA and represented by NU. Sizes of eliminated and optimized continuous parts are shown on histograms in Fig. 2. Eliminated parts are removed from the DU to the NU. Optimized continuous parts remain in the DU. Characteristics of continuous parts of DU for various sets of RE before and after the DU optimization as well as characteristics of eliminated parts are shown in Table 1. Column Original contains characteristics before the optimization, column Optimized contains characteristics after the optimization and column Eliminated contains characteristics of continuous parts removed by the optimization. Three characteristics were measured: Number of continuous parts (Parts), Average size of continuous part (AS) and Average ratio between inputs/outputs and size of continuous part (AIOS). The sets of REs come from the Snort IDS [1] modules and from the L7 decoder [24]. Optimized DU has larger average size of continuous part and smaller average ratio between inputs/outputs and size of continuous part. 83 V. Košař and J. Kořenek Table 1. Characteristics of continuous parts of DU for various sets of REs before and after the DU optimization. RE set Original Optimized Eliminated Parts AS AIOS Parts AS AIOS Parts AS AIOS [-] [-] [-] [-] [-] [-] [-] [-] [-] L7 selected 53 10.66 0.28 37 14.51 0.21 16 1.75 1.57 L7 all 369 9.09 0.85 192 13.05 0.11 177 4.79 3.04 backdoor 284 9.96 0.32 208 12.37 0.13 76 3.39 2.30 web-php 21 10.67 0.30 19 11.68 0.14 2 1.00 18.00 ftp 42 3.55 1.26 19 4.89 0.39 23 2.43 2.70 netbios 40 7.72 0.28 13 20.46 0.08 27 1.59 1.53 voip 61 14.23 0.29 37 21.68 0.11 24 2.75 2.53 web-cgi 16 35.25 0.09 9 61.67 0.03 7 1.29 3.89 4.2 Efficient Encoding of the Nondeterministic Part The encoding of the nondeterministic part into the NU utilizes the shared decoder architecture [25]. As we have shown in the previous chapter, some specific continuous parts of the DU can be moved into the NU to improve efficiency of mapping. Therefore, it is possible to utilize the properties of eliminated continuous parts and use more efficient encoding of the NU. We propose to use the At-most two-hot encoding (AMTH) introduced in [26], because it implements small 3 states parts efficiently with two LUTs and two flip-flops (FFs). The mapping of constrained repetitions in the FPGA architecture also requires optimization. In the current NFA-Split architecture the constrained repetitions can be encoded either in the DUs or in the NU, depending on the type of the repetition and the NFA structure. The encoding of the Perl compatible RE (PCRE) into the DU is inefficient, because the counting constraint is represented by many states and transitions and size of the automaton is increased significantly. For example, PCRE /^[abc]{100}/ in DU needs about 300 rows of the transition table. The NU architecture with a shared decoder is also not efficient. It consumes 100 FFs and 100 LUTs for the same example. The usage of the AMTH improves the efficiency (67 FFs and 67 LUTs). However, special subcomponents for constrained repetitions introduced in [10] are more efficient in logic utilization. Therefore, we propose to represent constrained repetition by this dedicated component in NU. 5 Evaluation We performed the evaluation mizations on a selected set of decoder [24]. All used sets of Netbench framework has been 84 of proposed SNFA-Split architecture and optiSnort IDS [1] modules and set of REs from L7 REs come from Netbench framework [27]. The used to implement the proposed architecture to- Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA gether with optimizations and to make a comparison with other FPGA based multi-stride architectures. Table 2. FPGA logic utilization of SNFA-Split and Clark multi-stride architectures. The results are for multi-stride automata with two and four input characters accepted at once. Statistics Rules L7 selected backdoor web-cgi misc ftp L7 selected backdoor web-cgi misc ftp Stride Clark SNFA-Split Inner REs Symbols LUT FF LUT FF BRAM Alphabets [-] [-] [-] [-] [-] [-] [-] [-] 29 2 2744 673 1884 182 8 4 154 2 7178 4383 3004 815 20 6 10 2 3456 1332 2688 738 9 2 17 2 3200 1294 2551 944 10 4 35 2 3774 1944 3104 1595 10 4 29 4 4776 678 3631 192 24 8 154 4 13509 4881 6137 1007 44 11 10 4 5608 1360 4598 738 12 4 17 4 5322 1331 4219 951 20 6 35 4 6060 2081 4166 1601 20 6 First, we have evaluated FPGA logic utilization of SNFA-Split architecture. The results for two and four input characters accepted at once are compared in Table 2 to the multi-stride architecture with shared decoder of input characters. The amount of utilized LUTs, FFs and 18 Kb BlockRAMs was estimated for the Xilinx Virtex-5 architecture. However, the SNFA-Split architecture is suitable for any FPGA. Column Statistics presents the number of REs in particular set of REs. Column Clark shows the estimated utilization for the multi-stride architecture with shared decoder of input characters. Column SNFA-Split indicates the estimated utilization for the SNFA-Split architecture. It can be seen that the SNFA-Split architecture is able to reduce the amount of utilized LUTs by 58 % for the largest backdoor module. The table also indicates how many inner alphabets were used, because the utilization of FPGA resources depends on the number of inner alphabets. We have also evaluated both proposed optimizations of NFA-Split architecture. Table 3 shows results of DU and NU optimizations. The amount of utilized FPGA resources is estimated. Column Statistics presents the number of REs in particular set of REs. Column Original shows the estimated utilization for the original NFA-Split architecture. Column Reduction indicates the reduction of FPGA resources by the proposed optimizations. It can be seen that the optimizations were able to achieve significant reduction of BlockRAMs and FPGA logic: 71.85 % LUTs for the nntp module and 94.18 % BlockRAMs for the voip module. This reduction is caused primarily by relocation of constrained repe85 V. Košař and J. Kořenek titions from the DU into the optimized NU. The dedicated subcomponents for the constrained repetitions are more efficient. It can be seen in the results that the reduction mainly depends on the presence of counting constraints (e.g., L7 does not contain any, while voip does and the big ones were placed in the DU) and structure of the automaton. The last row of the Table 3 presents average reduction of utilized resources for 22 sets of REs from both Snort IDS and L7 project. Table 3. Reduction of FPGA logic utilization by optimized DU and NU for in the NFA-Split Architecture. Statistics Original REs LUT FF BlockRAM Rules [-] [-] [-] [-] L7 selected 29 1003 182 4 L7 all 143 8035 2945 8 backdoor 154 1696 727 10 dos 3 803 119 2 ftp 35 2284 1590 2 misc 17 1651 941 2 nntp 12 3133 2483 2 web-cgi 10 1651 736 4 voip 38 1936 834 34 22 RE Sets 548 33421 12726 94 Reduction LUT FF BlockRAM [%] [%] [%] 1.10 -2.74 50 15.08 2.11 25 7.05 1.15 20 -4.36 12.61 0 51.16 89.62 0 37.72 88.42 0 71.85 96.69 0 39.83 88.59 50 35.18 77.46 94.18 21.93 56.67 42.55 Four-stride SNFA-Split architecture running at 150 MHz has worst-case (Malicious network traffic) throughput of 4.8 Gbps. It outperforms GPU based solution presented in [18]. Even single stride architecture with throughput of 1.2 Gbps outperforms the GPU solution for rule-set L7 all. The efficient CPU solution [19] outperforms four-stride SNFA-Split architecture when running on high-end CPUs. However, the results in [19] are measured for best-case situation (Regular network traffic). 6 Conclusion The paper has introduced the NFA-Split architecture optimization for multi-stride automata. The proposed architecture is able to process multiple bytes in one clock cycle. Therefore, RE matching speed can be increased despite frequency limits of current FPGAs. As can be seen in the Results section, the proposed multi-stride architecture utilizes up to 58 % less LUTs than multi-stride FPGA architectures with shared decoder. Consequently, additional REs can be supported. Moreover, we have proposed several optimizations of the NFA-Split architecture in order to further reduce FPGA resources. First optimization is focused on 86 Multi-Stride NFA-Split Architecture for Regular Expression Matching Using FPGA the overhead of encoding logic in DU. The optimization moves states from DU to NU, if the cost of encoding logic is higher than the cost of logic in the NU. The second proposed optimization is focused on NU mapping to the FPGA. At-most two-hot encoding and specific subcomponents for constrained repetitions are used to represent states and transitions relocated from DU to NU. Both optimizations are able to reduce up to 71.85 % of LUTs and up to 94.18 % of BlockRAMs. As future work, we want to investigate efficient pattern matching on 100-Gigabit Ethernet. Acknowledgment This work was supported by the IT4Innovations Centre of Excellence CZ.1.05/1.1.00/02.0070 and the BUT project FIT-S-14-2297. References 1. Snort: Project WWW Page. http://www.snort.org/ (2014) 2. The Bro Network Security Monitor: Project WWW Page. http://www.bro.org/ (2014) 3. Koziol, J.: Intrusion Detection with Snort. Sams, Indianapolis, IN, USA (2003) 4. Becchi, M., Crowley, P.: Efficient Regular Expression Evaluation: Theory to Practice. In: ANCS ’08: Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ACM (2008) 50–59 5. Kumar, S., Turner, J., Williams, J.: Advanced Algorithms for Fast and Scalable Deep Packet Inspection. In: ANCS ’06: Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, ACM (2006) 81–92 6. Clark, C.R., Schimmel, D.E.: Scalable Pattern Matching for High Speed Networks. In: FCCM ’04: Proceedings of the 12th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, IEEE Computer Society (2004) 249– 257 7. Lin, C.H., Huang, C.T., Jiang, C.P., Chang, S.C.: Optimization of Pattern Matching Circuits for Regular Expression on FPGA. IEEE Trans. Very Large Scale Integr. Syst. 15(12) (2007) 1303–1310 8. Becchi, M., Crowley, P.: A-DFA: A Time- and Space-Efficient DFA Compression Algorithm for Fast Regular Expression Evaluation. ACM Transactions on Architecture and Code Optimization 10(1) (2013) 4:1–4:26 9. Sidhu, R., Prasanna, V.K.: Fast Regular Expression Matching Using FPGAs. In: FCCM ’01: Proceedings of the 9th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, IEEE Computer Society (2001) 227– 238 10. Sourdis, I., Bispo, J., Cardoso, J.M.P., Vassiliadis, S.: Regular Expression Matching in Reconfigurable Hardware. Journal of Signal Processing Systems 51(1) (2008) 99–121 11. Becchi, M., Crowley, P.: Efficient Regular Expression Evaluation: Theory to Practice. In: Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems. ANCS ’08, New York, NY, USA, ACM (2008) 50–59 87 V. Košař and J. Kořenek 12. Košař, V., Žádnı́k, M., Kořenek, J.: NFA Reduction for Regular Expressions Matching Using FPGA. In: Proceedings of the 2013 International Conference on Field Programmable Technology, IEEE Computer Society (2013) 338–341 13. Kořenek, J., Košař, V.: Efficient Mapping of Nondeterministic Automata to FPGA for Fast Regular Expression Matching. In: Proceedings of the 13th IEEE International Symposium on Design and Diagnostics of Electronic Circuits and Systems DDECS 2010, IEEE Computer Society (2010) 6 14. Kořenek, J., Košař, V.: NFA Split Architecture for Fast Regular Expression Matching. In: Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, Association for Computing Machinery (2010) 2 15. Košař, V., Kořenek, J.: On NFA-Split Architecture Optimizations. In: 2014 IEEE 17th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS), IEEE Computer Society (2014) 274–277 16. Clark, C., Schimmel, D.: Efficient Reconfigurable Logic Circuits for Matching Complex Network Intrusion Detection Patterns. In: Field Programmable Logic and Application, 13th International Conference, Lisbon, Portugal (2003) 956–959 17. Dlugosch, P., Brown, D., Glendenning, P., Leventhal, M., Noyes, H.: An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing. IEEE Transactions on Parallel and Distributed Systems PP(99) (2014) 18. Cascarano, N., Rolando, P., Risso, F., Sisto, R.: iNFAnt: NFA Pattern Matching on GPGPU Devices. SIGCOMM Comput. Commun. Rev. 40(5) (2010) 20–26 19. Valgenti, V.C., Chhugani, J., Sun, Y., Satish, N., Kim, M.S., Kim, C., Dubey, P.: GPP-Grep: High-speed Regular Expression Processing Engine on General Purpose Processors. In: Proceedings of the 15th International Conference on Research in Attacks, Intrusions, and Defenses. RAID’12, Berlin, Heidelberg, Springer-Verlag (2012) 334–353 20. Kumar, S., Dharmapurikar, S., Yu, F., Crowley, P., Turner, J.: Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection. In: SIGCOMM ’06: Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, ACM (2006) 339–350 21. Smith, R., Estan, C., Jha, S., Kong, S.: Deflating the Big Bang: Fast and Scalable Deep Packet Inspection With Extended Finite Automata. SIGCOMM Comput. Commun. Rev. 38(4) (2008) 207–218 22. Becchi, M., Crowley, P.: A Hybrid Finite Automaton for Practical Deep Packet Inspection. In: Proceedings of the 2007 ACM CoNEXT Conference. CoNEXT ’07, New York, NY, USA, ACM (2007) 23. Brodie, B.C., Taylor, D.E., Cytron, R.K.: A Scalable Architecture For HighThroughput Regular Expression Pattern Matching. SIGARCH Computer Architecture News 34(2) (2006) 191–202 24. L7 Filter: Project WWW Page. http://l7-filter.sourceforge.net/ (2014) 25. Kořenek, J.: Fast Regular Expression Matching Using FPGA. Information Sciences and Technologies Bulletin of the ACM Slovakia 2(2) (2010) 103–111 26. Yun, S., Lee, K.: Optimization of Regular Expression Pattern Matching Circuit Using At-Most Two-Hot Encoding on FPGA. International Conference on Field Programmable Logic and Applications 0 (2010) 40–43 27. Pus, V., Tobola, J., Kosar, V., Kastil, J., Korenek, J.: Netbench: Framework for Evaluation of Packet Processing Algorithms. Symposium On Architecture For Networking And Communications Systems (2011) 95–96 88 Computational Completeness Resulting from Scattered Context Grammars Working Under Various Derivation Modes Alexander Meduna and Ondřej Soukup Brno University of Technology, Faculty of Information Technology Centre of Excellence, Božetěchova 1/2, 612 66 Brno, Czech Republic [email protected], [email protected] Abstract. This paper introduces and studies a whole variety of derivation modes in scattered context grammars. These grammars are conceptualized just like classical scattered context grammars except that during the applications of their rules, after erasing n nonterminals, they can insert new substrings possibly at different positions than the original occurrence of the erased nonterminal. The paper concentrates its attention on investigating the generative power of scattered context grammars working under these derivation modes. It demonstrates that all of them are computationally complete– that is, they characterize the family of recursively enumerable languages. Keywords: scattered context grammars; alternative derivation modes; generative power; computational completeness. 1 Introduction The present section informally sketches scattered context grammars working under various new derivation modes and explains the reason why they are introduced. This section also describes how the paper is organized. At present, processing information in a discontinuous way represents a common computational phenomenon. Indeed, consider a process p that deals with information i. Typically, during a single computational step, p (1) reads n pieces of information, x1 through xn , in i, (2) erases them, (3) generate n new pieces of information, y1 through yn , and (4) inserts them into i possibly at different positions than the original occurrence of x1 through xn , which was erased. To explore computation like this systematically and rigorously, computer science obviously needs formal models that reflect it in an adequate way. Traditionally, formal language theory has always provided computer science with language-defining models to explore various information processors mathematically, so it should do so for the purpose sketched above as well. However, the classical versions of these models, such as grammars, work on words so they 89 A. Meduna and O. Soukup erase and insert subwords at the same position, hence they can hardly serve as appropriate models of this kind. Therefore, a proper formalization of processors that work in the way described above needs an adaptation of some classical well-known grammars so they reflect the above-described computation more adequately. At the same time, any adaptation of this kind should conceptually maintain the original structure of these models as much as possible so computer science can quite naturally base its investigation upon these newly adapted grammatical models by analogy with the standard approach based upon their classical versions. Simply put, while keeping their structural conceptualization unchanged, these grammatical models should work on words in newly introduced ways, which more properly reflect the above-mentioned modern computation. The present paper discusses this topic in terms of scattered context grammars, which definitely represent important language-generating grammatical models of computation. Indeed, the paper introduces a whole variety of derivation modes in scattered context grammars so they reflect the above-sketched computation in a more adequate way than the standard derivation mode. Recall that the notion of a scattered context grammar G represents a languagegenerating rewriting system based upon an alphabet of symbols and a finite set of rules. The alphabet of symbols is divided into two disjoint subalphabets—the alphabet of terminal symbols and the alphabet of nonterminal symbols. In G, a rule r is of the form (A1 , A2 , . . . , An ) → (x1 , x2 , . . . , xn ), for some positive integer n. On the left-hand side of r, the As are nonterminals. On the right-hand side, the xs are strings. G can apply r to any string u of the form u = u0 A1 u1 . . . un−1 An un where us are any strings. Notice that A1 through An are scattered throughout u, but they occur in the order prescribed by the left-hand side of r. In essence, G applies r to u so (1) it deletes A1 , A2 , . . . , An in u, after which (2) it inserts x1 , x2 , . . . , xn into the string resulting from the deletion (1). By this application, G makes a derivation step from u to a string v of the form v = v0 x1 v1 . . . vn−1 xn vn Notice that x1 , x2 , . . . , xn are inserted in the order prescribed by the right-hand side of r. However, they are inserted in a scattered way—that is, in between the inserted xs, some substrings vs occur. This paper partially introduces the results of larger study, which is currently being in progress and will be hopefully published soon. In this study, 9 derivation modes of scattered context grammars are defined, however, due to shortage of space, only a few selected modes are presented here. For consistence, their 90 Computational Completeness Resulting from Scattered Context Grammars numbering is preserved. The chosen modes are mutually dual or complementary to the others. (1) Mode 1 requires that ui = vi for all i = 0, . . . , n in the above described derivation step. (3) Mode 3 obtains v from u so it changes u by performing (3a) through (3c), described next: (a) A1 , A2 , . . . , An are deleted; (b) x1 and xn are inserted into u0 and un , respectively; (c) x2 through xn−1 are inserted in between the newly inserted x1 and xn . (5) In mode 5, v is obtained from u by (5a) through (5e), given next: (a) A1 , A2 , . . . , An are deleted; (b) a central ui is nondeterministically chosen, for some 0 ≤ i ≤ n; (c) x1 and xn are inserted into u0 and un , respectively; (d) xj is inserted between uj−2 and uj−1 , for all 1 < j ≤ i; (e) xk is inserted between uk and uk+1 , for all i + 1 ≤ k < n. (7) Mode 7 obtains v from u performing the steps stated below: (a) A1 , A2 , . . . , An are deleted; (b) a central ui is nondeterministically chosen, for some 0 ≤ i ≤ n; (c) xj is inserted between uj−2 and uj−1 , for all 1 < j ≤ i; (d) xk is inserted between uk and uk+1 , for all i + 1 ≤ k < n. This paper is organized as follows. Section 2 gives all the necessary notation and terminology to follow the rest of the paper. Then, Section 3 formally introduces all the new derivation modes in scattered context grammars. After that, Section 4 demonstrates that scattered context grammars working under any of the newly introduced derivation modes are computationally complete–that is, they characterize the family of recursively enumerable languages. 2 Preliminaries We assume that the reader is familiar with formal language theory (see [1, 2]). For a set W , card(W ) denotes its cardinality. Let V be an alphabet (finite nonempty set). V ∗ is the set of all strings over V. Algebraically, V ∗ represents the free monoid generated by V under the operation of concatenation. The unit of V ∗ is denoted by ε. Set V + = V ∗ − {ε}. Algebraically, V + is thus the free semigroup generated by V under the operation of concatenation. For w ∈ V ∗ , |w| and reversal(w) denote the length of w and the reversal of w, respectively. For L ⊆ V ∗ , reversal(L) = {reversal(w) | w ∈ L}. The alphabet of w, denoted by alph(w), is the set of symbols appearing in w. For v ∈ Σ and w ∈ Σ ∗ , occur(v, w) equals the number of occurrences of v in w. Let % be a relation over V ∗ . The transitive and transitive and reflexive closure of % are denoted %+ and %∗ , respectively. Unless explicitly stated otherwise, we write x % y instead (x, y) ∈ %. The families of regular languages, context-free languages, and recursively enumerable languages are denoted by REG, CF, and RE, respectively. Recall that scattered context grammars characterize RE (see [3]). 91 A. Meduna and O. Soukup 3 Definitions In this section, we define scattered context grammars and the following new derivation modes in scattered context grammars. Then, we illustrate them by examples. Definition 1. A scattered context grammar (an SCG for short) is a quadruple G = (V, T, P, S) where • V is an alphabet; • T ⊂V; • set N = V − T ; ∞ S ∗ ∗ ∗ • P ⊆ N1 × N2 × · · · × Nm × V1 × V2 × · · · × Vm m=1 is finite, where each Nj = N , Vj = V , 1 ≤ j ≤ m; • S ∈ N. V , T and N are called the total alphabet, the terminal alphabet and the nonterminal alphabet, respectively. P is called the set of productions. Instead of (A1 , A2 , . . . , An , x1 , x2 , . . . , xn ) ∈ P where Ai ∈ N , xi ∈ V ∗ , for 1 ≤ i ≤ n, for some n ≥ 1, we write (A1 , A2 , . . . , An ) → (x1 , x2 , . . . , xn ) S is the start symbol. t u Definition 2. Let G = (V , T , P , S) be an SCG, and let % be a relation over V ∗ . Set L(G, %) = {x | x ∈ T ∗ , S %∗ x} L(G, %) is said to be the language that G generates by %. Set SC(%) = {L(G, %) | G is an SCG} SC(%) is said to be the language family that SCGs generate by %. t u Definition 3. Let G = (V , T , P , S) be an SCG. Next, we define the following direct derivation relations 1⇒ through 9⇒ over V ∗ . First, let (A) → (x) ∈ P and u = w1 Aw2 ∈ V ∗ . Then, w1 Aw2 i⇒ w1 xw2 , i ∈ {1, 2, . . . , 9} Second, let (A1 , A2 , . . . , An ) → (x1 , x2 , . . . , xn ) ∈ P and u = u0 A1 u1 . . . An un , z, z 0 , ui , vi , wi ∈ V ∗ , for all 0 ≤ i ≤ n and 1 ≤ j ≤ n − 1, for some n ≥ 2, and u0 u1 . . . un = v0 v1 . . . vn . Then, 92 Computational Completeness Resulting from Scattered Context Grammars (1) u0 A1 u1 A2 u2 . . . An un 1⇒ u0 x1 u1 x2 v2 . . . xn un ; (3) u0 A1 u1 A2 u2 . . . An un 3⇒ v0 x1 v1 x2 v2 . . . xn vn , where u0 = v0 z, un = z 0 vn ; (5) u0 u1 A1 u2 A2 . . . ui−1 Ai ui Ai+1 ui+1 . . . An un z 5⇒ u0 x1 u1 x2 u2 . . . xi ui−1 ui ui+1 xi+1 . . . un xn z; (7) u0 A1 u2 A2 . . . ui−1 Ai ui Ai+1 ui+1 . . . An un 7⇒ u0 x2 u2 . . . xi ui−1 ui ui+1 xi+1 . . . un ; t u To illustrate the above-introduced notation, let G = (V , T , P , S) be an SCG; then, L(G, 5⇒) = {x | x ∈ T ∗ , S 5⇒∗ x} and SC(5⇒) = {L(G, 5⇒) | G is an SCG}. To give another example, SC(1⇒) denotes the family of all scattered context languages. 4 Generative Power In this section, for each defined derivation mode we investigate the generative power of SCGs using this mode. Lemma 1. Let L ⊆ Σ ∗ be any recursively enumerable language. Then, L can be represented as L = h(L1 ∩ L2 ), where h : T ∗ → Σ ∗ is a morphism and L1 and L2 are two context-free languages. For a proof, see [4]. 4.1 Mode 1 We prove that SCGs with mode 1 derivations characterize the family of recursively enumerable languages. Theorem 1. [3] SC(1⇒) = RE. Since SC(1⇒) ⊆ RE follows directly from the Church-Turing thesis, we only have to prove the opposite inclusion. Proof. Construction. Recall Lemma 1. By the closure properties of context-free languages, there are context-free grammars G1 and G2 that generate L1 and reversal(L2 ), respectively. More precisely, let Gi = (Vi , T, Pi , Si ) for i = 1, 2. Let T = {a1 , . . . , an } and 0, 1, $, S ∈ / (V1 ∪ V2 ∪ Σ) be the new symbols. Without any loss of generality, assume that V1 ∩ V2 = ∅. Define the new morphisms (4) f : ai 7→ h(ai )c(ai ); (1) c : ai 7→ 10i 1; ∗ (5) t : Σ ∪ {0, 1, $} → Σ, (2) C : V ∪ T → V ∪ Σ ∪ {0, 1} , 1 1 1 a 7→ a, a ∈ Σ, A 7→ A, A ∈ V1 , A→ 7 ε, A ∈ / Σ; a 7→ f (a), a ∈ T ; 0 ∗ (6) t : Σ ∪ {0, 1, $} → {0, 1}, (3) C : V ∪ T → V ∪ {0, 1} , 2 2 2 a 7→ a, a ∈ {0, 1}, A 7→ A, A ∈ V2 , A 7→ ε, A ∈ / {0, 1}. a 7→ c(a), a ∈ T ; Finally, let G = (V, Σ, P, S) be SCG, with V = V1 ∪ V2 ∪ {S, 0, 1, $} and P containing the rules 93 A. Meduna and O. Soukup (1) (2) (3) (4) (S) → ($S1 1111S2 $); (A) → (Ci (w)), for all A → w ∈ Pi , where i = 1, 2; ($, a, a, $) → (ε, $, $, ε), for a = 0, 1; ($) → (ε). Claim 1. L(G, 1⇒) = L. Proof. Basic idea. First the starting rule from (1) is applied. The starting nonterminals S1 and S2 are inserted into the current sentential form. Then, by using the rules from (2) G simulates derivations in both G1 and G2 and generates the sentential form w = $w1 1111w2 $. Suppose S 1⇒∗ w, where alph(w) ∩ (V1 ∪ V2 ) = ∅. If t0 (w1 ) = reversal(w2 ), then t(w1 ) = h(v), where v ∈ L1 ∩ L2 and h(v) ∈ L. In other words, w represents a successful derivation of both G1 and G2 , where the both grammars have generated the same sentence v, therefore G must generate the sentence h(v). The rules from (3) serve to check, whether the simulated grammars have generated the identical words. Binary codings of the generated words are erased while checking the equality. Each time the leftmost and the rightmost symbols are erased, otherwise some symbol is skipped. If the codings do not match, some 0 or 1 cannot be erased and no terminal string can be generated. Finally, the symbols $ are erased with the rule from (4), and if G1 , G2 , respectively, generated the same sentence and both codings were successfully erased, then the G has generated the terminal sentence h(v). t u For a rigorous proof, see [3]. Since L is an arbitrary recursively enumerable language, by Claim 1 the proof of Theorem 1 is completed. t u 4.2 Mode 3 In this section, we prove the family of languages generated by SCGs with mode 3 derivations coincides with the family of recursively enumerable languages. Theorem 2. SC(3⇒) = RE. Since SC(3⇒) ⊆ RE follows directly from the Church-Turing thesis, we only have to prove the opposite inclusion. Proof. Let G = (V, Σ, P, S) be the SCG constructed in the proof of Theorem 1. Next, we modify G to a new SCG G0 such that L(G, 1⇒) = L(G0 , 1⇒). Finally, we prove L(G0 , 3⇒) = L(G0 , 1⇒). Construction. Let G0 = {V, Σ, P 0 , S} be SCG with P 0 containing (1) (2) (3) (4) 94 (S) → (S1 11$$11S2 ); (A) → (Ci (w)) for A → w ∈ Pi , where i = 1, 2; (a, $, $, a) → ($, ε, ε, $), for a = 0, 1; ($) → (ε). Computational Completeness Resulting from Scattered Context Grammars We establish the proof of Theorem 2 by the following two claims. Claim 2. L(G0 , 1⇒) = L(G, 1⇒). Proof. G0 is closely related to G, only the rules from (1) and (3) are slightly modified. As a result the correspondence of the sentences generated by the simulated G1 , G2 , respectively, is not checked in the direction from the outermost to the central symbols but from the central to the outermost symbols. Again, if the current two symbols do not match, they can not be erased both and the derivation blocks. t u Claim 3. L(G0 , 3⇒) = L(G0 , 1⇒). Proof. Without any loss of generality, we can suppose the rules from (1) and (2) are used only before the first usage of the rule from (3). The context-free rules work unchanged with mode 3 derivations. Then, for every derivation S 1⇒∗ w = w1 11$$11w2 generated only by the rules from (1) and (2), where alph(w) ∩ (V1 ∪ V2 ) = ∅, there is the identical derivation S 3⇒∗ w and vice versa. Since w 1⇒∗ w0 , w0 ∈ Σ ∗ if and only in t0 (w1 ) = reversal(w2 ), we can complete the proof of the previous claim by the following one. Claim 4. Let the sentential form w be generated only by the rules from (1) and (2). Without any loss of generality, suppose alph(w) ∩ (V1 ∪ V2 ) = ∅. Consider S 3⇒∗ w = w1 11$$11w2 Then, w 3⇒∗ w0 , where w0 ∈ Σ ∗ , if and only if t0 (w1 ) = reversal(w2 ). For better readability, in the next proof we omit all symbols of w1 from Σ— we consider only nonterminal symbols, which are to be erased. Basic idea. The rules from (3) are the only with 0s and 1s on their left hand sides. These symbols are simultaneously erasing to the left and to the right of $s checking the equality in reverse. While proceeding from the center to the edges, when there is any symbol skipped, which is remaining between $s, there is no way, how to erase it, and no terminal string can be generated. Consider the mode 3 derivations. Even when the symbols are erasing one after another from the center to the left and right, the derivation mode can potentially shift left one $ to the left and right one $ to the right skipping some symbols. Also in this case the symbols between $s can not be erased anymore. 95 A. Meduna and O. Soukup Proof. If. Recall w = 10m1 110m2 1 . . . 10mo 111$$1110mo 1 . . . 10m2 110m1 1 Suppose the check works properly not skipping any symbol. Then w 3⇒∗ w0 = $$ and twice applying the rule from (4) the derivation finishes. t u Proof. Only if. If w1 6= reversal(w2 ), though the check works properly, w 1⇒∗ w0 = w10 x$$x0 w20 and x, x0 ∈ {0, 1}, x 6= x0 . Continuing the check with application of the rules from (3) will definitely skip x or x0 . Consequently, no terminal string can be generated. We showed, that G0 can generate the terminal string from the sentence form w, if only if t0 (w1 ) = reversal(w2 ), and the claim holds. t u Since S 1⇒∗ w, w ∈ Σ ∗ , if and only if S 3⇒∗ w, Claim 3 holds. t u We proved L(G, 1⇒) = L, L(G0 , 1⇒) = L(G, 1⇒) and L(G0 , 3⇒) = L(G0 , 1⇒), therefore L(G0 , 3⇒) = L holds. Since L is an arbitrary recursively enumerable language, the proof of Theorem 2 is completed. t u 4.3 Mode 5 This section investigates mode 5 derivations. It proves the family of languages SCGs with mode 5 derivations generates corresponds to the family of recursively enumerable languages. Theorem 3. SC(5⇒) = RE. Since SC(5⇒) ⊆ RE follows directly from the Church-Turing thesis, we only have to prove the opposite inclusion. Proof. Let G = (V, Σ, P, S) be the SCG constructed in the proof of Theorem 1. Next, we modify G to a new SCG G0 so L(G, 1⇒) = L(G0 , 5⇒). Construction. Introduce four new symbols—D,E,F and ◦. Set N = {D,E,F ,◦}. Let G0 = (V 0 , Σ, P 0 , S) be SCG, with V 0 = V ∪ N and P 0 containing the rules (1) (2) (3) (4) (5) (6) 96 (S) → ($S1 1111S2 $ ◦ E ◦ F ); (A) → (Ci (w)) for A → w ∈ Pi , where i = 1, 2; (F ) → (F F ); ($, a, a, $, E, F ) → (ε, ε, $, $, ε, D), for a = 0, 1; (◦, D, ◦) → (ε, ◦E◦, ε); ($) → (ε), (E) → (ε), (◦) → (ε). Computational Completeness Resulting from Scattered Context Grammars Claim 5. L(G, 1⇒) = L(G0 , 5⇒). Proof. Context-free rules are not influenced by the derivation mode. The rule from (3) must generate precisely as many F s as the number of applications of the rule from (4). Context-sensitive rules of G0 correspond to context-sensitive rules of G, except the special rule from (5). We show, the construction of G0 forces context-sensitive rules to work exactly in the same way as the rules of G do. Every application of the rule from (4) must be followed by the application of the rule from (5), to rewrite D back to E, which requires the symbol D between two ◦s. It ensures the previous usage of context-sensitive rule selected the center to the right of the rightmost affected nonterminal and all right hand side strings changed their positions with the more left ones. The leftmost right hand side string is then shifted randomly to the left, but it is always ε. The derivation mode has no influence on the rule from (5). ¿From the construction of G0 , it works exactly in the same way as G does. t u L(G, 1⇒) = L(G0 , 5⇒) and L(G, 1⇒) = L, therefore L(G0 , 5⇒) = L. Since L is an arbitrary recursively enumerable language, the proof of Theorem 3 is completed. t u 4.4 Mode 7 This section investigates mode 7 derivations and proves SCGs with this mode derivations are Turing-complete. Theorem 4. SC(7⇒) = RE. Since SC(7⇒) ⊆ RE follows directly from the Church-Turing thesis, we only have to prove the opposite inclusion. Proof. Let G = (V, Σ, P, S) be the SCG constructed in the proof of Theorem 1. Next, we modify G to a new SCG G0 so L(G, 1⇒) = L(G0 , 7⇒). Construction. Introduce four new symbols—E,F ,G and |. Set N = {E,F ,G,|}. Let G0 = (V 0 , Σ, P 0 , S) be SCG, with V 0 = V ∪ N and P 0 containing the rules (1) (2) (3) (4) (5) (6) (S) → (F GS1 11$|$11S2 ); (A) → (Ci (w)) for A → w ∈ Pi , where i = 1, 2; (F ) → (F F ); (a, $, $, a) → (ε, E, E, ε), for a = 0, 1; (F, G, E, |, E) → (G, $, |, $, ε); ($) → (ε), (G) → (ε), (|) → (ε). Claim 6. L(G, 1⇒) = L(G0 , 7⇒). 97 A. Meduna and O. Soukup Proof. The behaviour of context-free rules remains unchanged under mode 7 derivations. Since the rules of G0 simulating the derivations of G1 , G2 , respectively, are identical to the ones of G simulating both grammars, for every derivation of G S 1⇒∗ $w1 1111w2 $ = w where w was generated only using the rules from (1) and(2) and alph(w) ∩ (V1 ∪ V2 ) = ∅, there is S 7⇒∗ F Gw1 11$|$11w2 = w0 in G0 , generated by the corresponding rules from (1) and (2), and vice versa. Without any loss of generality, we can consider such a sentence form in every successful derivation. Additionally, in G w 1⇒∗ v, v ∈ Σ ∗ if and only if t0 (w1 ) = reversal(w2 ). Note, then v = t(w). Therefore, we have to prove w0 4⇒∗ v 0 , v 0 ∈ Σ ∗ if and only if t0 (w1 ) = reversal(w2 ). Then obviously v 0 = v and we can complete the proof by the following claim. Claim 7. In G0 , for S 7⇒∗ w = F i Gw1 $|$w2 E where w was generated only using the rules from (1) through (3) and alph(w) ∩ (V1 ∪ V2 ) = ∅. Then w 7⇒∗ w0 where w0 ∈ Σ ∗ , if and only if t0 (w1 ) = reversal(w2 ), for some i ≥ 1. The new rule from (3) may potentially arbitrarily multiply the number of F s to the left. Then, F s are erasing using the rule from (5). Thus, without any loss of generality, suppose i equals the number of the future usages of the rule from (5). For better readability, in the next proof we omit all symbols of w1 from Σ—we consider only nonterminal symbols, which are to be erased. Proof. If. Suppose w1 = reversal(w2 ), then w 7 ⇒∗ ε. We prove this by the induction on the length of w1 , w2 , where |w1 | = |w2 | = k. Then, obviously i = k. By the construction of G0 , the least k equals 2, but we prove the claim for all k ≥ 0. Basis. Let k = 0. Then By the rules from (6) and the basis holds. 98 w = G$|$ G$|$ 7⇒∗ ε Computational Completeness Resulting from Scattered Context Grammars Induction Hypothesis. Suppose there exists k ≥ 0 such that the claim holds for all m, where w = F m Gw1 $|$w2 , |w1 | = |w2 | = m, 0 ≤ m ≤ k Induction Step. Consider G0 generates w, where w = F k+1 Gw1 $|$w2 , |w1 | = |w2 | = k + 1 Since w1 = reversal(w2 ) and |w1 | = |w2 | = k + 1, w1 = w10 a, w2 = aw20 . The symbols a can be erased by application of the rules from (4) and (5) under several conditions. First, when the rule from (4) is applied, the center for interchanging right hand side strings must be chosen between the two $s, otherwise both Es appear on the same side of the symbol | and the rule from (5) is not applicable. Next, no 0 or 1 may be skipped, while proceeding in the direction from center to the edges. Finally, when the rule from (5) is applied, the center must be chosen to the left of F , otherwise G is erased and the future application of this rule is excluded. F k+1 Gw10 a$|$aw20 7⇒ F k+1 Gw10 D|Dw20 7⇒ F k Gw10 $|$w20 = w0 By induction hypothesis w0 7⇒∗ ε, which completes the proof. Only if. Suppose w1 6= reversal(w2 ), then, there is no w0 , where w 7⇒∗ w0 and w0 = ε. Since w1 6= reversal(w2 ), w1 = uav, w2 = va0 u0 and a 6= a0 . Suppose both vs are correctly erased and no symbol is skipped producing the sentential form F i Gua$|$a0 u0 Next the rule from (4) can be applied to erase innermost 0s or 1s. However, since a 6= a0 , even if the center if chosen properly between the two $s, there is 0 or 1 between inserted Es and thus unable to be erased, which completes the proof. We showed, that G0 can generate the terminal string from the sentence form w, if only if t0 (w1 ) = reversal(w2 ), and the claim holds. t u We proved S 1 ⇒∗ w, w ∈ Σ ∗ , in G, if and only if S 7 ⇒∗ w in G0 , hence L(G, 1⇒) = L(G0 , 7⇒) and the claim holds. t u Since L(G, 1⇒) = L(G0 , 7⇒), L(G, 1⇒) = L and L is an arbitrary recursively enumerable language, the proof of Theorem 4 is completed. t u 5 Conclusion The modern trend in information processing is parallel access to typically distributed data, however, traditionally, in formal languages theory automata and 99 A. Meduna and O. Soukup grammars process information in continuous and often sequential way. Such models are not entirely suitable for the study of modern approaches in the data processing, where the data are frequently simultaneously read from and written to the different parts of the memory space, sometimes physically separated. For modelling the parallel data processing the usage of the scattered context grammars that have been investigated already in a long series of studies, which brought a number of important results—especially their computational completeness—, seems to be appropriate. Though, even the scattered context grammars are not a perfect model of the modern data processing and the whole variety of suitable modifications can be established. However, the aim of this study was not the attempt to modify the model itself, only the way it generates the terminal string. The main motivation was to break the usual approach of the rewriting and try to divide the process into deletion and insertion, which may not necessarily take place at the same part of the sentence form. The mutual relation of these two now separated actions is then defined by the constraints resulting from the definition of the used derivation mode. Despite the fact that the additional nondeterminism is brought into the computational process, it has been proven that the generative power of the model is not reduced and it is still as powerful as Turing machines. Acknowledgments This work was supported by the following grants: BUT FIT FIT-S-14-2299, MŠMT CZ1.1.00/02.0070, and TAČR TE01010415. References 1. Rozenberg, G., Salomaa, A., eds.: Handbook of Formal Languages, Vol. 1: Word, Language, Grammar. Springer, New York (1997) 2. Salomaa, A.: Formal Languages. Academic Press, London (1973) 3. Fernau, H., Meduna, A.: A simultaneous reduction of several measures of descriptional complexity in scattered context grammars. Information Processing Letters 86(5) (2003) 235–240 4. Harrison, M.A.: Introduction to Formal Language Theory. 1st edn. AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA (1978) 5. Meduna, A., Zemek, P.: Regulated Grammars and Their Transformations. Faculty of Information Technology, Brno University of Technology (2010) 6. Meduna, A., Techet, J.: Scattered Context Grammars and their Applications. WIT Press (2010) ISBN: 978-1-84564-426-0. 100 Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations Jana Pazúriková1 and Luděk Matyska1,2 1 2 Faculty of Informatics, Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic Institute of Computer Science, Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic {pazurikova,ludek}@ics.muni.cz Abstract. Parallel and distributed computations based on the spatial decomposition of the problem are beginning to fail to saturate large supercomputers with their limited strong scalability. An application of the high performance computing, a molecular dynamics simulation, shows this limit especially in experiments with longer simulation times. When the parallelism in space does not suffice, the parallelism in time could cut the wallclock time at the expense of additional computational resources. The parareal algorithm that decomposes the temporal domain has been extensively researched and already applied to molecular dynamics simulations, however with rather modest results. We propose a novel modification that uses the coarse function with a simpler physics model, not a longer timestep as in state-of-the-art methods. The evaluation with a prototype implementation indicates that our method provides a rather long time window for the parallel-in-time computation, a reasonable convergence and stable properties of the simulated system. 1 Introduction High performance computing largely relies on the parallel and distributed computation. The decomposition divides the problem almost always along the spatial domain, i.e. into subspaces or subsets. Therefore, the scalability of computation depends on the problem size. When solving the fixed-size problem on an increasing amount of computational resources, one can observe the limit in the strong scaling3 : after a certain number of resources, adding more does not shorten the wallclock time of a computation. This limit makes it impossible to cut the time to result down at the expense of the computational power that 3 Strong scaling is a function that maps the increasing number of computational resources that solve the problem of the fixed size to the wallclock time of a computation; it shows how the wallclock time (usually) decreases as the resources grow but the problem remains the same. Weak scaling is a function that maps the increasing number of computational resources that solve the problem of the increasing size to the wallclock time of a computation; it shows how the wallclock time ideally remains the same as the resources grow and the problem grows. 101 J. Pazúriková and L. Matyska is becoming more and more available. The key is to increase the level of parallelism. Many simulations capture changes in space over time. Parallel-in-time methods simultaneously calculate the results in several time points [1–3]. One of the common methods, the parareal method [2], first approximates the results with a coarse function and then iteratively corrects them with a fine function in parallel while enforcing the continuity. Molecular dynamics simulations [4, 5] require approaches of the high performance computing due to the large number of computationally expensive steps. These in silico experiments offer a high resolution view on many scientifically relevant natural processes such as the protein folding [6], the drug and nanomaterial interaction [7, 8] or a phenomenon occurrence [9]. Their implications reach to chemistry, biology, pharmacy, medicine, even advanced materials. The relevance of the simulated process often increases with the longer timescale of the simulation (or both the size of the system and the longer timescale). Parallel implementations of MD codes all rely on the spatial decomposition [10] although a few attempts of the temporal decomposition have been made [11–14]. With the parareal method applied the simulation speeds up rather poorly. The speedup of the parareal method depends on two conditions [15]. First, the coarse function has to be significantly cheaper than the fine function. In almost all published experiments, authors have chosen the method with a longer integration timestep, further referred to as the longer-timestep method, that has a quite small cost ratio. We propose to found the coarse function on the simpler physics model that, according to our assessments, should provide a much higher speedup. As for the second condition, the number of iterations required for the convergence has to be significantly lower than the number of time points. We evaluated the convergence of our modification before the implementation of the parallel version and evaluation of the speedup to determine if it presents an approach worth further researching. We present the results of the convergence evaluation in this paper. Four more sections follow. First, we shortly introduce molecular dynamics and describe the parareal method. Second, we present our application of the parareal method on molecular dynamics. Third, we evaluate the convergence of our method and finally, we discuss the limitations of the current implementation and suggestions to the future work. 2 2.1 Background Molecular Dynamics Simulations Molecular dynamics, a tool of computational chemistry, computes movements of particles due to their interactions over time [4, 5]. In the context of this work, we consider the model of molecular mechanics that approximates atoms as electrically charged points of mass and their interactions as empirically determined functions. The model represents an electrostatic N-body problem. The simulation takes input data—types of atoms, the topology, charges qi , positions ri and velocities vi —and then iteratively repeats the following steps: 102 Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations 1. calculate the potential Uall with the empirical functions and evaluate the force Fi exerted on an atom i ∂Uall Fi = − (1) ∂ri 2. move the particles Fi = − ∂Uall v0 r00 = mi ai = mi i = mi i2 ; ∂ri dt dt (2) 3. update the time, optionally generate an output. [5] The output of the simulation includes the trajectories of particles, forces and the energy of the system. Other properties of the system can be processed from the output data by applying statistical mechanics and other functions from physics and chemistry. Simulations of molecular dynamics have rather high computational needs due to two reasons: the large number of demanding steps. The typical timestep of the integration scheme of the equation (2) is 2 fs (2.10−15 s), in comparison, the timescale of the villin headpiece folding process takes up to 500 µs (500.10−9 s). Production simulations range from pico- to microseconds, from thousands to billions of steps. The bottleneck of each step lies in the calculation of the potential, especially the long-range, electrostatic potential [16]. Standard MD packages [17–20] decompose the spatial domain for their parallel run [10], all show almost perfect weak scaling. The highest limit in the strong scaling with only parallel-in-space computation has been reached in two experiments. Andoh et al. [21] have conducted a simulation with 107 atoms on over half a million cores, one step was evaluated in 5 ms. Richards et al. [22] have achieved 85% efficiency in the simulation with ∼ 109 particles on up to 294000 cores, one step evaluated in 550 ms. 2.2 Parallel-in-Time Computation The obvious sequentiality of time has been overcome by several methods that reveal the possibility of parallelism along the temporal domain. The projects Copernicus [11] and Folding@Home [23] build the Markov model of process’s different metastable states that gradually explores the whole state space by many simultaneous short simulations. They use a highly distributed framework and achieve near-linear strong scaling and fine problem granularity, e.g. in the simulation of ∼ 104 -atom system on ∼ 5000 AMD cores. Protein folding presents an appropriate use case for this form of the coarse-grained time parallelism as it consists of many metastable states connected with short transitions. Yu et al. in [12] apply the data from prior related simulations to guide the system and predict its changes, however it heavily depends on an almost perfect success rate of the prediction algorithm. The experiment simulated a nanotube reacting to an external force. 103 J. Pazúriková and L. Matyska Since 1960s, mathematicians have developed various methods to calculate parallel-in-time in the fine granularity: they compute results of a time dependent differential equation in different time points simultaneously [3]. The first such method by Nievergelt, 1964, [1] later became known as the multiple shooting method. The time-parallel approach to iterative methods for solving partial differential equations with implicit integration schemes [24] were followed by applying the multigrid methods for an acceleration [25–27]. In the last years, the parareal method [2] is gaining the popularity. 2.3 Parareal Algorithm Lions, Maday and Turinici devised the parareal in time method [2] in 2001, since then it has been extensively researched [3, 15, 28, 29] and applied to diverse simulations [30–33]. Most notably, Speck et al. [30] have developed a modification of the parareal algorithm that made it possible to run a gravitational N-body simulation with notable strong scaling. In the traditional, sequential-in-time computation, the accurate yet usually expensive function F determines the results λt+1 in time t + 1 by known results λt in time t < T , as Figure 1 shows. t1 λ1 = v t2 F λ2 t3 F λ3 t4 F λ4 t5 F λ5 t6 F λ6 t7 F λ7 t8 F λ8 t9 F λ9 t10 F λ10 Fig. 1. Sequential-in-time computation The parareal method requires the second method: the coarse, yet computationally cheap function G. It can be based on a longer timestep, a coarser spatial decomposition or a simpler model. The parareal scheme shifts the sequential nature of time from the expensive function F to the cheap G and approximations made by the coarse function iteratively improves with the fine function in parallel. It can be viewed as a form of the predictor-corrector scheme [34, 35]. The function G roughly assesses the initial approximation of the results. The difference between the results from the precise calculation and from the coarse calculation on the same data presents the error that is included into the calculation in the next iteration. The continuity of the corrected results is again enforced sequentially by G. The parareal method running for T time points and K iterations builds in parallel the sequence λkt that rapidly converges to λt as k increases and each λ is calculated as: k+1 k k k+1 k λn+1 = G(λk+1 n ) + F(λn ) − G(λn ) = G(λn ) + ∆n (3) Figure 2 depicts the sequential calculation of cheap G (horizontal, successive arrows) and the parallel calculation of expensive F (vertical arrows without 104 Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations data dependencies). The speedup of the parareal method relies on the significant difference between the computational complexity of F and G and the fast convergence so that K T . The convergence depends on the chosen functions, the correction term and the problem properties. After λ210 from Figure 2 is calculated, the computational window shifts to the right on the time axis and λ210 presents the new initial condition. t1 k=0 λ01 = v t2 G F λ11 = v ∆02 G F λ21 = v λ12 G ∆12 G λ22 t4 G λ03 F ∆03 F ∆11 k=2 G λ02 F ∆01 k=1 t3 λ13 G F λ23 G λ04 F ∆04 ∆13 G t5 λ14 G F λ24 G λ05 F ∆05 ∆14 G t6 λ15 G F λ25 G λ06 F ∆06 ∆15 G t7 λ16 G F λ26 G λ07 F ∆07 ∆16 G t8 λ17 G F λ27 t10 λ08 F ∆08 ∆17 G t9 λ18 G F ∆18 G λ28 λ19 F ∆19 G λ29 G λ210 Fig. 2. Computational flow of the parareal algorithm 3 Parareal Scheme in Molecular Dynamics Simulations The parareal algorithm has been applied to the molecular dynamics simulations in several experiments [13, 14, 36, 37]. The coarse function based on a longer timestep in the integration scheme failed to provide reasonable convergence and speedup. As Bulin in [14] states, more appropriate are coarse functions based on a simpler model. We have set the fine function as one step of MD integrated with Verlet scheme and the long-range potential evaluated with the multilevel summation method (MSM) [38]. We have examined coarse functions built upon several concepts in terms of the theoretical speedup and the ability to run in parallel [39]. The first concept, further simplification of the model reduces the cost by an abstraction from physics of the problem. The discrete or coarse-grained MD would produce completely different trajectories than the MSM. It would be challenging to deal with the large correction term that would probably lead to an instability. The cheap and inaccurate methods for the evaluation of demanding long-range electrostatics, such as the cutoff method or Wolf summation method, are worth researching. 105 J. Pazúriková and L. Matyska The second concept, different parameters of the method for the evaluation of long-range interactions (such as less iterations in Fourier methods or a coarser grid in MSM) offer only small theoretical speedup, as they are still quite accurate thus not cheap. And finally, the different parameters of the integration scheme (such as a longer timestep or different scheme) do not have any promising results. The longer-timestep method has been evaluated many times without success. As for a different integration scheme, MD simulation usually apply Verlet or leapfrog integration schemes that have proven their suitability. Simpler Euler method would introduce large error, Runge-Kutta method would rapidly increase the cost [19]. The theoretical speedup of the parareal method depends on the cost ratio between the fine and the coarse function QF /G and the ratio between the number of time points T and the number of iterations K: speedup = QF /G 1+ K T QF /G (4) We set F as one step of MD with Verlet integration scheme and MSM for longrange interactions (a = 12 Å, h = 2 Å, h∗ = 1 Å, m = 2, p = 3 as described in [39]); G as one step of MD with Verlet integration scheme and the cutoff method for long-range interactions (rcutof f = 12 Å). For our choice of F and G, the QF /G reaches 60, we suppose we can achieve the speedup of an order of magnitude. With the coarse function based on a longer timestep, the cost ratio equals the ratio between the timestep in G and the timestep in F. However the timestep longer than a few femtoseconds quickly leads to the simulation’s blowup, preventing the ratio to get higher than 5. Apart from the choice of the fine and the coarse function, the definition of the correction term ∆kn = F(λkn ) − G(λkn ) in context of MD simulation also determines the behavior of the parareal method. We have set ∆kn as an absolute difference between two results’ atomic positions and velocities. We have implemented a prototype of the parareal method and the correction term in C. Both fine and coarse functions are evaluated by LAMMPS [20] in one-step NPT simulation (with constant number of atoms, pressure and temperature). We have verified the functionality by running an experiment with T time points and K = T iterations. By definition of the parareal method, the results in the last time point of the last iteration should be the same as in sequentialin-time experiment. Apart from a rounding error, we have obtained the same results. 4 Convergence and Stability Evaluation We evaluated the convergence of the parareal method applied on a molecular dynamics simulation through our prototype implementation. The experiment followed this procedure. First, we ran a sequential-in-time simulation of a 32000-atom solvated protein rhodopsin [40]. CHARMM force field [41] determined the parameters of all atoms and functions for potential evaluation. 106 Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations The close-range potentials included all standard bonded interactions and van der Waals interactions by Lennard-Jones potential. The electrostatic interactions were approximated by MSM with maximum relative force error 10− 4. The simulation ran for T timesteps. Second, we ran our parareal simulation for T timesteps and K iterations on the same input. The function F corresponded exactly to one step of sequential-in-time simulation. The function G differed from F in the evaluation of electrostatic potential: it used smoothed cutoff potential defined by CHARMM. In the convergence evaluation, we examined three aspects: (i) the longest possible computational window, (ii) the number of iterations needed for the reasonable convergence and (iii) the difference between the trajectories. The computational window represents the number of time points we can calculate parallelin-time. In too long windows the thermostat and the barostat may not be able to keep the system in a viable state. As a common production simulation consists of more than a few computational windows, we need the results of the sequential-intime (λT , as in Figure 1) and the parallel-in-time (λkT for several k, as in Figure 2) simulation after T time steps to be almost the same. We suppose that for the reasonable convergence of the whole production simulation, the root mean square distance (RMSD) should be less than 0.1 Å as the atomic positions of two such close results virtually do not differ. Frames of obtained trajectories are compared also by RMSD. In the stability evaluation, we examined two aspects: the temperature and the pressure of the system. Any steep changes may suggest that the correction term is causing instabilities. Our modification of the parareal method can have the computational window long as much as 40 steps, i.e. 80 fs. We found that with T time points computing in parallel, we need roughly T /4 or T /3 iterations for close results. Figure 3 shows how the RMSD between λT and λkT decreases as the number of iterations, k, increases. The first k that gets RMSD below 0.1 Å is marked by full square. The distance of λ0t , the initial G approximation, increases with the number of timesteps in the computational window. The RMSD decreases with the increasing number of iterations in an asymptotically linear manner. In Figure 4, we can see the trajectories of the sequential-in-time simulation and the parareal simulation with T = 40 after different iterations. The error’s uptake after the first correction takes place from early timepoints, not just in the end. The results of the 15th iteration almost perfectly copy those of the sequential simulation not only in the latest time point, but along the whole way. The temperature of the system in the last time point over iterations also gradually decreases, as Figure 5 shows for the simulation with T = 40. It converges to 300 K set by NPT environment around the 15th iteration, the same as the results converge to the accurate ones. The pressure of the system also gradually decreases, although the initial uptake is much higher than for the temperature. 107 J. Pazúriková and L. Matyska T=10 T=20 T=30 T=40 1 RMSD [A] 0.8 0.6 0.4 0.2 0.1 0 2 5 6 10 11 15 20 Iterations Fig. 3. The results of the parallel-in-time computation converge to the results of sequential-in-time computation 1 sequential k=0 k=6 k=12 k=15 0.9 0.8 RMSD [A] 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 Timesteps Fig. 4. The difference between trajectories of sequential-in-time computation (bottom full line) and trajectories of several iterations, T=40 108 Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations 5000 4500 Temperature [K] 4000 3500 3000 2500 2000 1500 1000 500 0 0 5 10 15 20 Iterations Fig. 5. The temperature of the system in the parallel-in-time simulation in the 40th timestep 60000 50000 Pressure [atm] 40000 30000 20000 10000 0 -10000 0 5 10 15 20 Iterations Fig. 6. The pressure of the system in the parallel-in-time simulation in the 40th timestep 109 J. Pazúriková and L. Matyska 5 Discussion and Future Work In this paper, we proposed a modification to the parareal method in the context of molecular dynamics simulations. We found that it leads to a rather satisfactory convergence and stability and it offers a quite long computational window. Therefore, it is worth exploring the main advantage of the simpler-physics over the longer-timestep methods: the speedup. If the calculation proceeds in a pipeline manner, i.e. the computation of F starts immediately after its λ has been computed, the upper bound of theoretical speedup for T = 40, K = 15 is 59.6. The modest convergence may hasten if the correction term can be extrapolated. We will further analyze ∆ and experiment with other possibilities: the difference only between selected atoms, the difference in forces or further correction of the temperature and the pressure. The latter one may prolong the computational window as the simulation explodes after 45 steps due to the sudden pressure and temperature changes. The speedup of the parareal method is limited by the cost ratio between the fine and the coarse function. The efficiency is limited by 1/K, albeit Speck et al. [30] devised a modified parareal method called PFASST that amortizes the cost of the correction steps, thus increases the efficiency. Therefore, to evaluate the speedup of our modification, we want to incorporate it into Speck’s method. The evaluation of the convergence of the parareal method in MD simulations is the first step in what could lead to improving the strong scaling and cutting the wallclock time of molecular dynamics experiments. References 1. J Nievergelt. Parallel methods for integrating ordinary differential equations. Communications of the ACM, 7(12):731–733, 1964. 2. JL Lions, Y Maday, and G Turinici. Résolution d’EDP par un schéma en temps pararéel . Comptes Rendus de l’Académie des Sciences - Series I - Mathematics, 332(7):661–668, April 2001. 3. MJ Gander and S Vandewalle. Analysis of the Parareal Time-Parallel Time-Integration Method. SIAM Journal on Scientific Computing, 29(2):556–578, January 2007. 4. E Lewars. Computational Chemistry: Introduction to the Theory and Applications of Molecular and Quantum Mechanics. Springer, 2nd edition, 2010. 5. F Jensen. Introduction to computational chemistry. John Wiley & Sons Ltd, Great Britain, 2nd edition, 2007. 6. C Lee and S Ham. Characterizing amyloid-beta protein misfolding from molecular dynamics simulations with explicit water. Journal of Computational Chemistry, 32(2):349–355, January 2011. 7. L Boechi, CAF de Oliveira, I Da Fonseca, K Kizjakina, P Sobrado, JJ Tanner, and JA McCammon. Substrate-dependent dynamics of UDP-galactopyranose mutase: Implications for drug design. Protein Science, 22(11):1490–1501, November 2013. 110 Convergence of Parareal Algorithm Applied on Molecular Dynamics Simulations 8. D Lau and R Lam. Atomistic Prediction of Nanomaterials: Introduction to Molecular Dynamics Simulation and a Case Study of Graphene Wettability. IEEE Nanotechnology Magazine, 6(1):8–13, March 2012. 9. G Zhao, JR Perilla, EL Yufenyuy, X Meng, B Chen, J Ning, J Ahn, AM Gronenborn, K Schulten, C Aiken, and P Zhang. Mature HIV-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics. Nature, 497(7451):643–6, May 2013. 10. KJ Bowers, RO Dror, and DE Shaw. Overview of neutral territory methods for the parallel evaluation of pairwise particle interactions. Journal of Physics” Conference Series, 16:300, 2005. 11. S Pronk, P Larsson, I Pouya, GR Bowman, IS Haque, K Beauchamp, B Hess, VS Pande, PM Kasson, and E Lindahl. Copernicus: a new paradigm for parallel adaptive molecular dynamics. In Proceedings of Supercomputing, SC ’11, pages 60:1—-60:10, New York, NY, USA, 2011. ACM. 12. Y Yu, A Srinivasan, and N Chandra. Scalable Time-Parallelization of Molecular Dynamics Simulations in Nano Mechanics. In Conference on Parallel Processing, pages 119–126. Ieee, 2006. 13. L Baffico, S Bernard, Y Maday, G Turinici, and G Zérah. Parallel-in-time molecular-dynamics simulations. Physical Review E, 66:057701:1–057701:4, 2002. 14. J Bulin. Large-scale time parallelization for molecular dynamics problems. Technical report, Royal Institute Of Technology, Stockholm, Stockholm, 2013. 15. Y Maday. The parareal in time algorithm. Technical Report R08030, Université Pierre et Marie Curie, pages 1–24, 2008. 16. P Koehl. Electrostatics calculations: latest methodological advances. Current opinion in structural biology, 16(2):142–151, April 2006. 17. B Hess, C Kutzner, D van der Spoel, and E Lindahl. GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. Journal of Chemical Theory and Computation, 4(3):435–447, 2008. 18. DA Case, TE Cheatham, T Darden, H Gohlke, R Luo, KM Merz, A Onufriev, C Simmerling, B Wang, and RJ Woods. The Amber biomolecular simulation programs. Journal of computational chemistry, 26(16):1668–1688, December 2005. 19. JC Phillips, R Braun, W Wang, J Gumbart, E Tajkhorshid, E Villa, C Chipot, RD Skeel, L Kalé, and K Schulten. Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 26(16):1781–1802, December 2005. 20. SJ Plimpton. Fast Parallel Algorithms for Short Range Molecular Dynamics. Journal of Computational Physics, 117:1–19, 1995. 21. Y Andoh, N Yoshii, K Fujimoto, K Mizutani, H Kojima, A Yamada, S Okazaki, K Kawaguchi, H Nagao, K Iwahashi, F Mizutani, K Minami, S Ichikawa, H Komatsu, S Ishizuki, Y Takeda, and M Fukushima. MODYLAS: A Highly Parallelized General-Purpose Molecular Dynamics Simulation Program for Large-Scale Systems with Long-Range Forces Calculated by Fast Multipole Method (FMM) and Highly Scalable Fine-Grained New Parallel Processing Algorithms. Journal of Chemical Theory and Computation, 9(7):3201–3209, July 2013. 22. DF Richards, JN Glosli, B Chan, MR Dorr, EW Draeger, JL Fattebert, WD Krauss, T Spelce, FH Streitz, MP Surh, and JA Gunnels. Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pages 60:1—-60:12, New York, NY, USA, 2009. ACM. 111 J. Pazúriková and L. Matyska 23. SM Larson, CD Snow, M Shirts, and VS Pande. Folding @ Home and Genome @ Home : Using distributed computing to tackle previously intractable problems in computational biology. Technical report, 2002. 24. A Deshpande, S Malhotra, CC Douglas, and MH Schultz. A rigorous analysis of time domain parallelism. Parallel Algorithms and Applications, 6(1):53–62, 1995. 25. G Horton. The time-parallel Multigrid Method. Communications in Applied Numerical Methods, 8:585–595, 1992. 26. S Vandewalle and E van de Velde. Space-time concurrent multigrid waveform relaxation. Annals of Numerical Mathematics, 1(1-4):335–346, 1994. 27. G Horton and S Vandewalle. A space-time multigrid method for parabolic partial differential equations. SIAM Journal on Scientific Computing, 16(4):848–864, 1995. 28. Y Maday and G Turinici. The Parareal in Time Iterative Solver: a Further Direction to Parallel Implementation. In Domain Decomposition Methods in Science and Engineering, pages 441–448, 2005. 29. E Aubanel. Scheduling of tasks in the parareal algorithm. Parallel Computing, 37(3):172–182, March 2011. 30. R Speck, D Ruprecht, R Krause, M Emmett, M Minion, M Winkel, and P Gibbon. A massively space-time parallel N-body solver. In Proceedings of Supercomputing, pages 92:1–92:11, 2012. 31. D Samaddar, DE Newman, and R Sánchez. Parallelization in time of numerical simulations of fully-developed plasma turbulence using the parareal algorithm. Journal of Computational Physics, 229(18):6558–6573, September 2010. 32. AE Randles. Modeling Cardiovascular Hemodynamics Using the Lattice Boltzmann Method on Massively Parallel Supercomputers. PhD thesis, Harvard University, 2013. 33. A Baudron, J Lautard, Y Maday, and O Mula. The parareal in time algorithm applied to the kinetic neutron diffusion equation. In International Conference on Domain Decomposition Methods, 2013. 34. WL Miranker and W Liniger. Parallel methods for the numerical integration of ordinary differential equations. Mathematics of Computation, 21(99):303–320, 1967. 35. CW Gear. The automatic integration of ordinary differential equations. Communications of the ACM, 14(3):176–179, March 1971. 36. A Srinivasan and N Chandra. Latency tolerance through parallelization of time in scientific applications. Parallel Computing, 31(7):777–796, July 2005. 37. A Nakano, P Vashishta, and RK Kalia. Parallel multiple-time-step molecular dynamics with three-body interaction. Computer Physics Communications, 77:303—-312, 1993. 38. DJ Hardy. Multilevel summation for the fast evaluation of forces for the simulation of biomolecules. PhD thesis, University of Illinois at Urbana-Champaign, 2006. 39. J Pazúriková. Large-Scale Molecular Dynamics Simulations for Highly Parallel Infrastructures. Technical report, 2014. http://arxiv.org/abs/1402.7216. 40. LAMMPS. Rhodopsin Benchmark. http://lammps.sandia.gov/bench.html#rhodo. 41. WD Cornell, P Cieplak, CI Bayly, IR Gould, KM Merz, DM Ferguson, DC Spellmeyer, T Fox, JW Caldwell, and PA Kollman. A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. Journal of the American Chemical Society, 117(19):5179–5197, May 1995. 112 A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods for Job Queuing and Scheduling Šimon Tóth Faculty of Informatics, Masaryk University Botanická 68a, Brno, Czech Republic [email protected] Abstract. Job scheduling for HPC and Grid-like systems, while being a heavily studied subject, suffers from a particular disconnect between theoretical approaches and practical applications. Most production systems still rely on a small set of rather conservative scheduling policies. One of the areas that tries to bridge the world of scientific research and practical application is the study of fairness. Fairness in a system has strong implications on customer satisfaction, with psychological studies suggesting that perceived fairness is generally even more important than the quality of service. This paper provides an overview of different approaches for handling fairness in a job scheduling/queuing system. We start with analytic approaches that rely on statistical modeling and try to provide strong categorization and ordering of various scheduling policies according to their fairness. Following that we provide an overview of recent advancements that rely on simulations and use high resolution analysis to extract fairness information from realistic job traces. As a conclusion to this article, we propose a new direction for research. We propose a novel multifaceted fairness approach, i.e., a combination of different fairness models inside a single system, that could better capture the heterogeneous fairness-related requirements of different users in the system. It could serve as a solution to the shortcomings of some of the methods presented in this paper. 1 Introduction Job scheduling is a very active research field, that has progressed significantly in the last 20 years [15]. One aspect that did not change significantly through the years is a particular disconnect between the theory and practical applications, as has been repeatedly noted in some of the published works through the years [14, 17, 37]. One research topic that tries to bridge the world of theory and the practical applications is the study of fairness. In the context of job scheduling/queuing we are concerned with two main types of fairness. Seniority fairness which takes into account the position of the customer/user in the queue and proportional fairness 113 Š. Tóth which takes into account the length/size of the users’ requests [8]. To demonstrate this distinction, let us consider a model situation from a physical queuing system: “Mr. Short arrives at the supermarket counter with only a single item in his shopping cart. Directly in front of him, in the queue he finds Mrs. Long with the shopping cart completely full.” From one point of view, it would make sense to allow Mr. Short to overtake Mrs. Long in the queue as he only has a single shopping item and will therefore delay Mrs. Long for a very short time. Respectively processing Mrs. Longs shopping cart will delay Mr. Short significantly. This decision is based on the proportional fairness, as we judge the situation in proportion to the length/size of the requests. Second approach takes into account that Mrs. Long has already been waiting in the queue when Mr. Short arrived. As we do not have additional information in regards to how long has Mrs. Long been already waiting, it makes more sense to maintain the order in the queue and do not allow Mr. Short to overtake Mrs. Long. This approach falls into the category of seniority fairness, as we try to maintain the seniority (order) of the users in the system. As these two types of fairness are strictly contradicting1 , determining which type of fairness to include in a scheduling system is a very important step. One possible approach is to experimentally measure the psychological impacts of waiting in queuing systems. These studies show that customers have a strong bias towards the perceived fairness in a system [31, 32], even to the point of preferring a queue configuration which provides worse performance characteristics, i.e., they are willing to wait longer if they perceive that they are being treated fairly in respect to other users in the system. These studies also show that users have a particular distaste for multi-queue configurations where the queues are not processed in a round-robin fashion. This is certainly distressing as multi-queue configuration is present in many production systems. In this paper we provide an overview of methods for the analysis, measurement and classification of fairness in queuing systems. In particular we will concentrate on methods that are applicable for production systems. We base this distinction on our first-hand experiences [23] with the implementation of these methods in the Czech National Grid – MetaCentrum [29]. We conclude this paper with a proposal of a new direction for research. A novel multifaceted approach to fairness management that combines different fairness approaches to accommodate the different requirements of various users in the system. 2 Analytical Approaches to Fairness Analytical approaches to fairness analysis rely on heavily sanitized models of the systems, such as M/M/1 [34] or M/GI/1 [46, 43]. Both models represent a single queue system with job arrivals modeled using a Poisson process [20]. In case of M/M/1 the service times of jobs have exponential distribution, in 1 Seniority and proportionality cannot be maintained at the same time, unless all users arrive in the order dictated by proportional fairness. 114 A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods case of M/GI/1 the service times of jobs have general (unknown) distribution. Under such simplification, analytical approaches using statistical analysis can provide strong categorization and/or ordering of scheduling policies according to the defined fairness model. 2.1 Proportional Fairness Proportional fairness relates to fairness of the system in relation to a particular parameter of a job. Most analytical approaches are concerned with the steadystate response (length) of a job. Wierman [46, 43–45] proposes a criterion based on slowdown S(x) = T (x)/x [16, 9], where x is the size of job and T (x) is the steady state response (length). This criterion classifies scheduling policies into fair and unfair based on the whether the expected slowdown for the class of jobs of size x is proportional to the load of the system ρ under the classified policy: E[S(x)]P = 1/(1 − ρ). Only if the expected slowdown E[S(x)]P is proportional to the system load ρ for all size classes x of jobs, the scheduling policy is considered fair. This rules out any scheduling policies that are either non-preemptive, or do not make decisions based on the jobs size. 2.2 Temporal Fairness Temporal fairness relates to the seniority of the jobs in the system, that is, jobs that arrived earlier should be satisfied before jobs that arrived later. Wierman [44, 45] proposes a politeness criterion as the fraction of the jobs response time during which was the seniority of the job respected. Clearly for the First Come First Served (FCFS) policy the P ol(x)P = 1, as FCFS always respects the seniority of jobs (see Fig. 1). A policy is determined to be polite if E[P ol(x)]P = 1 − ρ. Resc1 Job1 Resc2 Job2 Job3 Job5 Job4 Job6 Job8 Job7 Job9 Fig. 1. An example of a polite schedule (following the FCFS policy), constructed using jobs from Table 1. 2.3 Combined Approaches The Resource-Allocation Queuing Fairness Measure (RAQFM) [34, 7, 33] is based on the notion that all users in the system are equal and therefore they should 115 Š. Tóth Arrival 1 1 1 1 3 3 5 5 5 Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Job9 Runtime 1 1 1 2 1 2 1 1 1 Owner Green (dashed) Green (dashed) Purple (solid) Orange (dotted) Purple (dashed) Orange (dotted) Green (dashed) Green (dashed) Purple (solid) Table 1. Job information from schedule example (see Fig. 1). receive an equal share of resources at each point in time. Based on this notion the measure defines the resulting discrimination for a particular job i as follows: Rd Di = aii (si (t) − 1/N (t))dt. That is, the particular jobs discrimination is defined as the difference between the service received si (t) and service desired 1/N (t) (where N (t) represents the number of concurrent jobs at that point of time), integrated over time (from arrival to departure). For an example of user-aware schedule constructed using RAQFM see Fig. 2. Resc1 1/4 Resc2 1/4 −1/12 −1/3 −1/3 1/6 1/6 −5/6 1/6 Fig. 2. An example of an user-aware schedule constructed using RAQFM. Values inside the jobs are the calculated discrimination amounts. This particular schedule provides reasonably low variance across users: green −1/6, orange 1/12, purple −1/2 (see Section 4.1). The Discrimination Frequency Measure (DF) [36] is based on the two previously mentioned principles of fairness. The seniority principle is captured by one formula: ni = |{j : (aj ≥ ai ∧ dj ≤ di )}|, that is the number of jobs that arrived no earlier than job i but departed no later than job i. The proportionality principle is captured by another formula: mi = |{j : (di ≥ dj > ai ∧ s0j (ai ) ≥ si }|, that is the number of jobs that at arrival of the job i have at least as much remaining service requirement as i and depart no later than job i. The discrimination frequency of a particular job is then DFi = mi + ni (see Fig. 3). The Slowdown Queuing Fairness Measure (SQF) [6] is based on the slowdown metric [16, 9], assuming that a system is only fair when all jobs in the system receive the same slowdown: Ti = cxi , that is response time is equal to the size of the job multiplied by a constant, for all jobs i in the system. The individual 116 A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods Resc1 Job1 Resc2 Job2 Job3 Job5 Job6 Job4 Job9 Job7 Job8 Fig. 3. An example of a schedule generated using discrimination frequency measure. Equivalent to a polite schedule as this schedule does not contain any discriminated jobs. discrimination of job is then the deviation from this ideal state Di = Ti − cxi . PN SQF is then expressed as SQF = i=1 (cxi − Ti )2 (see Fig. 4). Resc1 1 Resc2 1 2 1 1.5 1.5 2 1 2 Fig. 4. An example of a schedule generated using SQF measure. Values represent the computed slowdown for each job. For c = 1.5, this schedule has SQF = 1.75. 2.4 Applicability of Analytical Approaches Analytical approaches provide relatively simple criterions and measures that have very well defined behavior. For systems with fairness requirements fitting one of these measures, they offer an easy implementation option with predictable results. However, as these metrics are based on heavily simplified models of queuing systems, they lack several important features. For one, these metrics are user-agnostic, that is they assume an equality of users and jobs and do not consider the possibility of a single user being represented by multiple jobs in the system. These metrics are also resource agnostic, as they do not consider the possibility of different jobs requesting different amounts of resource, e.g., CPU cores. 3 Trace Based Approaches Job traces represent the record of jobs that were processed by a production system, generally containing information about job arrivals, start times, completion times, amounts of resources requested, etc. These traces can be used for the evaluation of scheduling policies directly in the context of a production system [13]. Job traces can be easily shared [11], which allows researchers to determine how their new scheduling policies perform with respect to various real workloads. 117 Š. Tóth 3.1 Trace Analysis Trace analysis provides an alternative approach to experimental psychological studies. Instead of a complicated and expensive experiments, trace analysis relies on the extraction of session information from the provided traces [48]. A session is an uninterrupted interaction between the user and the resource management system, where the user actively waits for his/hers jobs to finish and then submits more jobs. In this context the users satisfaction correlates with the response time of jobs instead of slowdown [39], which is the generally preferred metric. 3.2 Trace Modeling Utilizing a pre-recorded workload however has its own shortcomings — particularly due to the static nature of the workload it cannot capture any dynamic behavior in the system. Trace modeling offers a solution to this problem with additional benefits, such as the ability to change the load of the system on demand [19, 28]. From the perspective of fairness evaluation, it is very important to create a model that realistically matches the behavior of users in the system. This covers everything from simple daily, weekly and yearly cycles, where users naturally submit less jobs during the nighttime, weekends and holidays [27, 40, 12], to the simulation of user sessions in the system [47]. Including these features in a workload model significantly improves the evaluation precision [38]. 3.3 Applicability of Trace Approaches Trace based analysis of user sessions provides a very different perspective on user behavior, as it essentially divides users into two categories “interactive” and “non-interactive”, with users transitioning from the interactive category as they reach their tolerance for waiting. In such environment it makes sense to prefer users that are still in interactive mode, as they are willing to submit more jobs into the system. Trace modeling on the other hand provides a deceptively simple premise. Its purpose is to improve the quality and flexibility of job traces. While the premise itself is quite simple, the road to this goal quite complicated as it includes both the detection and modeling of workload cycles (daily, weekly, yearly) and session boundaries. 4 Simulation Based Approaches Simulation based approaches try to improve the measurement precision by simulating a real system. This can be facilitated by specialized grid simulators [10, 22] or even by special simulation modes of production schedulers [1, 2]. In this context it is important to mention the commonly implemented fairness measure: fairshare. Fairshare is an ordering policy that dynamically changes the 118 A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods order of jobs based on the historical resource usage of their owners (see Fig. 5 and Fig. 6). Fairshare-based ordering policy is supported across many production resource management systems such as PBS [30], TORQUE [3], Moab [2], Maui [1], Quincy [18] or Hadoop Fair and Capacity Schedulers [5, 4]. Resc1 Job1 Resc2 Job2 Job3 Job6 Job4 Job7 Job5 Job9 Job8 Fig. 5. An example of a schedule generated using fairshare variant where jobs are accounted once completed. Resc1 Job1 Resc2 Job3 Job2 Job5 Job6 Job4 Job9 Job7 Job8 Fig. 6. An example of a schedule generated using fairshare variant where jobs are accounted once started. 4.1 Statistical Approaches One possibility for analyzing fairness in a system is to choose a desired performance metric (wait time, response time, slowdown) and then analyze the statistical properties of this variable across users. In particular we can look at √ the variance E[x2 ] − E 2 [x] / (n − 1), standard deviation V ariance, variance index CV = ST D/E[x], or the fairness index 1/ 1 + CV2 proposed by Vasupongayya [42]. The previously mentioned fairshare, as an ordering policy, is not directly usable for fairness measurement, however there are fairshare inspired metrics. For example the variance based normalized user wait time N U W To = T U W To /T U SAo where T U W To is the total summed wait time for a particular user and T U SAo is the total summed resourcePusage of a particular user. Fairness of a system can 2 then be expressed as F = (U W T − N U W To ) where U W T is the average normalized wait time across all users [21] (see Fig. 7). 119 Š. Tóth Resc1 0 Resc2 0 1 1/2 1 ∞ 1 1/2 1/3 Fig. 7. An example of a schedule generated using N U W T metric. Values represent the N U W To for the owner of the job at the jobs start. A slightly different approach is offered by the Fair Start Time (FST) metric [35, 26]. FST measures the unfairness caused by the later arriving jobs on the start time of currently waiting job by constructing a schedule assuming no later jobs arrive. FST then represents the difference between the computed start time and the actual start time of the job. 4.2 High Resolution Approaches All the methods we have presented until now are generally designed to provide a single comparable value. While this may be sufficient, recent advancements show that there is a significant amount of lost information when such approach is chosen. High resolution analysis [25] has recently shown that this information can be very important for policy decision making and while the results of high resolution analysis are certainly harder to analyze, quantitative and qualitative comparisons are still possible [24]. Even more importantly, high resolution approach enables completely novel measures that would loose their meaning when represented as a single number. One of these measures is the Expected End Time (EET) measure [41]. This measure builds on the notion that each user entering the system has an expectation of the amount of resources user will receive at any time. By modeling this expectation, this measure is capable of computing the Expected End Time for each of the jobs user submitted into the system by that user. EET can then be compared with the actual end time of the particular job and cases where the EET was not achieved can be plotted using a heatmap, for an example of such heatmap, see Fig. 8. 4.3 Applicability of Simulation Approaches Simulation approaches lie on the exact opposite of the spectrum as analytical approaches. Where analytical approaches provide simple to analyze metrics that are however hard to apply to more complex systems, simulation approaches provide very good behavior matching for the evaluated system. Their results can however be very hard to analyze [41]. Even when statistical and high resolution approaches are employed, the post analysis of these results can be highly nontrivial [25, 24]. For example, which approach is better, the one providing a better average but higher variance, or the one with worse average but lower variance? 120 A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods Fig. 8. Heatmaps showing the number of violated EET s, shade represents the number of violated EETs, x-axis represents time, y-axis represents users. Bottom part of the graph contains CPU and Memory utilization histograms. 5 Conclusion In this paper we have presented an overview of methods for classifying, measuring and facilitating the measurement of fairness. Analytical approaches offer the best solution for as-is deployment as they offer very clearly defined semantics and simple evaluation. They are however also based on heavily simplified models of systems which limits their scope. Trace based approaches provide an interesting alternative approach to psychological analysis of fairness and trace modeling provides an important foundation for high-precision simulations by closely modeling the behavior of users in the system. Simulation based approaches represent the most complex approach to fairness analysis, they can be used to precisely evaluate fairness in even the most complicated and dynamic systems. This complexity however comes with its own problems as the results of simulations can be very hard to analyze. 5.1 Case for Multifacet Fairness All the methods presented in this paper have one common shortcoming – they are all designed to either provide, or to facilitate the discovery of an “ultimate” fairness model. While this simplifies the selection of a fitting fairness model for a production system it does not address the issue of dynamically evolving requirements these systems usually face. Each change in the users’ workloads has the possibility to fall into a category that is not well handled by the selected fairness model. In such case, the user can be either heavily penalized for his/hers new workload, or even worse, his/hers new workload may cause quality deterioration for all other users in the system. For this reason, we propose a new research avenue. Instead of concentrating on a single fairness model, we should research the possibility of combining a set of fairness models inside a single scheduler. This would allow the users to either select a fairness model that matches their expectations, or even allow the 121 Š. Tóth scheduler to determine this categorization automatically from the users workload style. In such a model, when a users workload changes, he/she would simply be reassigned into the proper fairness group. Identical behavior would possible for users newly entering the system. If a completely new (currently unsupported) use case would be encountered, one would only have to design a fairness group matching this particular use case and then integrate it into the framework. This would much simplify the current process of completely redesigning the entire fairness model. We invite the reader to seek out our future publications that will be exploring this idea further. Acknowledgments. We highly appreciate the support of the Grant Agency of the Czech Republic under the grant No. P202/12/0306. References 1. Adaptive Computing Enterprises, Inc. Maui Scheduler Administrator’s Guide, version 3.2, January 2014. http://docs.adaptivecomputing.com. 2. Adaptive Computing Enterprises, Inc. Moab workload manager administrator’s guide, version 7.2.6, January 2014. http://docs.adaptivecomputing.com. 3. Adaptive Computing Enterprises, Inc. TORQUE Admininstrator Guide, version 4.2.6, January 2014. http://docs.adaptivecomputing.com. 4. Apache.org. Hadoop Capacity Scheduler, January 2014. http://hadoop.apache. org/docs/r1.2.1/capacity_scheduler.html. 5. Apache.org. Hadoop Fair Scheduler, January 2014. http://hadoop.apache.org/ docs/r1.2.1/fair_scheduler.html. 6. B. Avi-Itzhak, E. Brosh, and H. Levy. Sqf: A slowdown queueing fairness measure. Performance Evaluation, 64(9):1121–1136, 2007. 7. B. Avi-Itzhak, H. Levy, and D. Raz. A resource allocation queueing fairness measure: properties and bounds. Queueing Syst., 56(2):65–71, 2007. 8. B. Avi-Itzhak, H. Levy, and D. Raz. Quantifying fairness in queuing systems. Probability in the Engineering and Informational Sciences, 22(04):495–517, 2008. 9. P. Brucker and P. Brucker. Scheduling algorithms, volume 3. Springer, 2007. 10. R. Buyya and M. Murshed. Gridsim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurrency and Computation: Practice and Experience, 14(13-15):1175–1220, 2002. 11. D. Feitelson. Parallel workloads archive. http://www.cs.huji.ac.il/labs/ parallel/workload. 12. D. Feitelson and E. Shmueli. A case for conservative workload modeling: Parallel job scheduling with daily cycles of activity. In Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009. MASCOTS ’09. IEEE International Symposium on, pages 1–8, Sept 2009. 13. D. G. Feitelson. Packing schemes for gang scheduling. In Job Scheduling Strategies for Parallel Processing, pages 89–110. Springer, 1996. 14. D. G. Feitelson and L. Rudolph. Parallel job scheduling: Issues and approaches. In Job Scheduling Strategies for Parallel Processing, pages 1–18. Springer, 1995. 122 A Case for a Multifaceted Fairness Model: An Overview of Fairness Methods 15. D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn. Parallel job scheduling—a status report. In Job Scheduling Strategies for Parallel Processing, pages 1–16. Springer, 2005. 16. D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong. Theory and practice in parallel job scheduling. In Job Scheduling Strategies for Parallel Processing, pages 1–34. Springer, 1997. 17. E. Frachtenberg and D. G. Feitelson. Pitfalls in parallel job scheduling evaluation. In Job Scheduling Strategies for Parallel Processing, pages 257–282. Springer, 2005. 18. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In ACM SIGOPS 22nd Symposium on Operating Systems Principles, pages 261–276, 2009. 19. J. Jann, P. Pattnaik, H. Franke, F. Wang, J. Skovira, and J. Riordan. Modeling of workload in mpps. In Job Scheduling Strategies for Parallel Processing, pages 95–116. Springer, 1997. 20. D. G. Kendall. Stochastic processes occurring in the theory of queues and their analysis by the method of the imbedded markov chain. The Annals of Mathematical Statistics, pages 338–354, 1953. 21. D. Klusácek and H. Rudová. Performance and fairness for users in parallel job scheduling. In Job Scheduling Strategies for Parallel Processing, pages 235–252. Springer, 2013. 22. D. Klusáček and H. Rudová. Alea 2: job scheduling simulator. In Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, page 61. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2010. 23. D. Klusáček and Š. Tóth. On interactions among scheduling policies: Finding efficient queue setup using high-resolution simulations. In F. Silva, I. Dutra, and V. S. Costa, editors, Euro-Par 2014, volume 8632 of LNCS, pages 138–149. Springer, 2014. 24. D. Krakov and D. G. Feitelson. Comparing performance heatmaps. In Job Scheduling Strategies for Parallel Processing. Citeseer, 2013. 25. D. Krakov and D. G. Feitelson. High-resolution analysis of parallel job workloads. In Job Scheduling Strategies for Parallel Processing, pages 178–195. Springer, 2013. 26. V. J. Leung, G. Sabin, and P. Sadayappan. Parallel job scheduling policies to improve fairness: a case study. Technical Report SAND2008-1310, Sandia National Laboratories, 2008. 27. V. Lo and J. Mache. Job scheduling for prime time vs. non-prime time. In Cluster Computing, 2002. Proceedings. 2002 IEEE International Conference on, pages 488– 493. IEEE, 2002. 28. U. Lublin and D. G. Feitelson. The workload on parallel supercomputers: modeling the characteristics of rigid jobs. Journal of Parallel and Distributed Computing, 63(11):1105–1122, 2003. 29. MetaCentrum, January 2014. http://www.metacentrum.cz/. 30. PBS Works. PBS Professional 12.1, Administrator’s Guide, January 2014. http: //www.pbsworks.com. 31. A. Rafaeli, G. Barron, and K. Haber. The effects of queue structure on attitudes. Journal of Service Research, 5(2):125–139, 2002. 32. A. Rafaeli, E. Kedmi, D. Vashdi, and G. Barron. Queues and fairness: A multiple study experimental investigation. Manuscript under review, 2005. 33. D. Raz, B. Avi-Itzhak, and H. Levy. Classes, priorities and fairness in queueing systems. RUTCOR, Rutgers University, Tech. Rep. RRR-21-2004, 2004. 123 Š. Tóth 34. D. Raz, H. Levy, and B. Avi-Itzhak. A resource-allocation queueing fairness measure. In ACM SIGMETRICS Performance Evaluation Review, volume 32, pages 130–141. ACM, 2004. 35. G. Sabin, G. Kochhar, and P. Sadayappan. Job fairness in non-preemptive job scheduling. In Parallel Processing, 2004. ICPP 2004. International Conference on, pages 186–194. IEEE, 2004. 36. W. Sandmann. A discrimination frequency based queueing fairness measure with regard to job seniority and service requirement. In Next Generation Internet Networks, 2005, pages 106–113. IEEE, 2005. 37. U. Schwiegelshohn. How to design a job scheduling algorithm. In Job Scheduling Strategies for Parallel Processing, 2014. 38. E. Shmueli and D. Feitelson. Using site-level modeling to evaluate the performance of parallel system schedulers. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International Symposium on, pages 167–178, Sept 2006. 39. E. Shmueli and D. G. Feitelson. Uncovering the effect of system performance on user behavior from traces of parallel systems. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2007. MASCOTS’07. 15th International Symposium on, pages 274–280. IEEE, 2007. 40. E. Shmueli and D. G. Feitelson. On simulation and design of parallel-systems schedulers: are we doing the right thing? Parallel and Distributed Systems, IEEE Transactions on, 20(7):983–996, 2009. 41. Š. Tóth and D. Klusáček. User-aware metrics for measuring quality of parallel job schedules. In Job Scheduling Strategies for Parallel Processing, 2014. 42. S. Vasupongayya and S.-H. Chiang. On job fairness in non-preemptive parallel job scheduling. In IASTED PDCS, pages 100–105, 2005. 43. A. Wierman. Fairness and classifications. ACM SIGMETRICS Performance Evaluation Review, 34(4):4–12, 2007. 44. A. Wierman. Scheduling for today’s computer systems: Bridging theory and practice. PhD thesis, Carnegie Mellon University, 2007. 45. A. Wierman. Fairness and scheduling in single server queues. Surveys in Operations Research and Management Science, 16(1):39–48, 2011. 46. A. Wierman and M. Harchol-Balter. Classifying scheduling policies with respect to unfairness in an M/GI/1. In ACM SIGMETRICS Performance Evaluation Review, volume 31, pages 238–249. ACM, 2003. 47. N. Zakay and D. G. Feitelson. Preserving user behavior characteristics in tracebased simulation of parallel job scheduling. 2014. 48. J. Zilber, O. Amit, and D. Talby. What is worth learning from parallel workloads?: a user and session based analysis. In Proceedings of the 19th annual international conference on Supercomputing, pages 377–386. ACM, 2005. 124 Part IV Presentations Fault Recovery Method with High Availability for Practical Applications∗ Jaroslav Borecký, Pavel Vı́t, and Hana Kubátová Department of Digital Design, Faculty of Information Technology Czech Technical University in Prague, Technická 9, Prague, Czech Republic {borecjar, pavel.vit, hana.kubatova}@fit.cvut.cz Our research is focused on mission critical applications using SRAM based Field Programmable Gate Arrays (FPGAs). The main goal is to reach higher availability and dependability and low power using unreliable components (FPGAs) with respect to highest safety according to strict Czech standards [1]. Our methodology is designed for fast applications and rapid prototyping of modular systems, which are useful for fast development thanks to its regulars structure. The methodology combines Concurrent Error Detection (CED) techniques, FPGA dynamic reconfigurations and our previously designed Modified Duplex System (MDS) architecture. The methodology tries minimizes area overhead. It is aimed for practical applications of modular systems, which are composed from blocks. We applied and tested it on the safety railway station system. The proposed method is based on static and partial dynamic reconfiguration of totally self-checking blocks which allows a full recovery from a Single Even Upset (SEU). The method is based on two independent FPGA boards with the same design, it decreases the development time. Each FPGA is divided into two main parts: a reconfiguration area (RA) and a static area (SA). The whole system is placed in reconfigurable partitions (RP) of the RA. The SA checks failure signals and immediately repairs soft errors in RPs. This reduces recovery time, because it uses partial reconfiguration often, while the whole FPGA reconfiguration only in critical situations. Every block is designed as TSC, also the static area satisfies TSC property. This paper was presented at DSD 2014. The main advantage is in usage of the partial reconfiguration. This allows faster detection and correction of faults. Reconfiguration of only one RP is faster than to reconfigure the whole FPGA. It leads to availability and security increase within minimal area overhead. Smaller overhead leads to smaller FPGA and low power consumption. In comparison with TMR, it uses less area, it is faster, cheaper, and with shorter development time. References 1. ČSN EN 50126, Czech Technical Norm ”http://nahledy.normy.biz/nahled.php?i=59709”, 2011 ∗ This research has SGS14/105/OHK3/1T/18. been partially supported by the project 127 Verification of Markov Decision Processes using Learning Algorithms? Tomáš Brázdil1 , Krishnendu Chatterjee2 , Martin Chmelı́k2 , Vojtěch Forejt3 , Jan Křetı́nský2 , Marta Kwiatkowska3 , David Parker4 , and Mateusz Ujma3 1 3 Masaryk University, Brno, Czech Republic 2 IST Austria University of Oxford, UK 4 University of Birmingham, UK We present a general framework for applying machine-learning algorithms to the verification of Markov decision processes (MDPs). The primary goal of these techniques is to improve performance by avoiding an exhaustive exploration of the state space. Our framework focuses on probabilistic reachability, which is a core property for verification, and is illustrated through two distinct instantiations. The first assumes that full knowledge of the MDP is available, and performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the case where we may only sample the MDP, and yields probabilistic guarantees, again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. The latter is the first extension of statistical model checking for unbounded properties in MDPs. In contrast with other related techniques, our approach is not restricted to time-bounded (finite-horizon) or discounted properties, nor does it assume any particular properties of the MDP. We also show how our methods extend to LTL objectives. We present experimental results showing the performance of our framework on several examples. The paper has been accepted to ATVA 2014. ? This research was funded in part by the European Research Council (ERC) under grant agreement 267989 (QUAREM), 246967 (VERIWARE) and 279307 (Graph Games), by the EU FP7 project HIERATIC, by the Austrian Science Fund (FWF) projects S11402-N23 (RiSE), S11407-N23 (RiSE) and P23499-N23, by the Czech Science Foundation grant No P202/12/P612, by the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/2007-2013) under REA grant agreement n 291734, by EPSRC project EP/K038575/1 and by the Microsoft faculty fellows award. 128 CEGAR for Qualitative Analysis of Probabilistic Systems ? Krishnendu Chatterjee, Martin Chmelı́k, and Przemyslaw Daca IST Austria We consider Markov decision processes (MDPs) which are a standard model for probabilistic systems. We focus on qualitative properties for MDPs that can express that desired behaviors of the system arise almost-surely (with probability 1) or with positive probability. We introduce a new simulation relation to capture the refinement relation of MDPs with respect to qualitative properties, and present discrete graph theoretic algorithms with quadratic complexity to compute the simulation relation. We present an automated technique for assumeguarantee style reasoning for compositional analysis of MDPs with qualitative properties by giving a counterexample guided abstraction-refinement approach to compute our new simulation relation. We have implemented our algorithms and show that the compositional analysis leads to significant improvements. Compositional analysis and CEGAR. One of the key challenges in analysis of probabilistic systems (as in the case of non-probabilistic systems) is the state explosion problem, as the size of concurrent systems grows exponentially in the number of components. One key technique to combat the state explosion problem is the assume-guarantee style composition reasoning, where the analysis problem is decomposed into components and the results for components are used to reason about the whole system, instead of verifying the whole system directly. This simple, yet elegant asymmetric rule is very effective in practice, specially with a counterexample guided abstraction-refinement (CEGAR) loop. Our contributions. In this work we focus on the compositional reasoning of probabilistic systems with respect to qualitative properties, and our main contribution is a CEGAR approach for qualitative analysis of probabilistic systems. We consider the fragment of pCTL∗ that is relevant for qualitative analysis, and refer to this fragment as QCTL∗ . The details of our contributions are as follows: 1. To establish the logical relation induced by QCTL∗ we consider the logic ATL∗ for two-player games and the two-player game interpretation of an MDP where the probabilistic choices are resolved by an adversary. In case of non-probabilistic systems and games there are two classical notions for refinement, namely, simulation and alternating-simulation. We first show that the logical relation induced by QCTL∗ is finer than the intersection of simulation and alternating simulation. We then introduce a new notion of simulation, namely, combined simulation, and show that it captures the logical relation induced by QCTL∗ . 2. We show that our new notion of simulation, which captures the logic relation of QCTL∗ , can be computed using discrete graph theoretic algorithms in quadratic time. We present a CEGAR approach for the computation of combined simulation. ? The paper was accepted to CAV 2014 129 From LTL to Deterministic Automata: A Safraless Compositional Approach Javier Esparza and Jan Křetı́nský Institut für Informatik, Technische Universität München, Germany IST Austria Linear temporal logic (LTL) is the most popular specification language for linear-time properties. In the automata-theoretic approach to LTL verification, formulae are translated into ω-automata, and the product of these automata with the system is analyzed. Therefore, generating small ω-automata is crucial for the efficiency of the approach. In quantitative probabilistic verification, LTL formulae need to be translated into deterministic ω-automata. Until recently, this required to proceed in two steps: first translate the formula into a nondeterministic Büchi automaton (NBA), and then apply Safra’s determinization or its variants. This is also the approach adopted in PRISM, a leading probabilistic model checker. Since automata produced in this way are often very large, we presented an algorithm that directly constructs a generalized DRA (GDRA) for the fragment of LTL containing only the temporal operators F and G [2]. The GDRA can be either (1) de-generalized into a standard DRA, or (2) used directly in the probabilistic verification process [1]. In both cases we get much smaller automata for many formulae. For instance, the standard approach translates a conjunction of three fairness constraints into an automaton with over a million states, while the algorithm of [2] yields a GDRA with one single state, and a DRA with 462 states. In this paper we present a novel approach able to handle full LTL, and even the alternation-free linear-time µ-calculus. The approach is compositional: the automaton is the parallel composition of a master automaton and an array of slave automata, one for each G-subformula of the original formula. Intuitively, the master monitors the formula that remains to be fulfilled and takes care of checking safety and reachability properties. A slave for each subformula of the form Gψ checks whether Gψ eventually holds, i.e., whether FGψ holds. Experimental results show improvement in the sizes of the resulting automata compared to existing methods. The paper was accepted and presented at CAV 2014. References 1. Krishnendu Chatterjee, Andreas Gaiser, and Jan Křetı́nský. Automata with generalized Rabin pairs for probabilistic model checking and LTL synthesis. In CAV, pages 559–575, 2013. 2. Jan Křetı́nský and Javier Esparza. Deterministic automata for the (F,G)-fragment of LTL. In CAV, pages 7–22, 2012. 130 Faster Existential FO Model Checking on Posets Jakub Gajarský Faculty of Informatics, Masaryk University [email protected] The model checking problem, i.e. the problem to decide whether a given logical sentence is true in a given structure, is one of the fundamental problems in theoretical computer science. For the familiar first order logic, there is a well-established line of study of the model checking problem on combinatorial structures, culminating in the recent result of Grohe, Kreutzer and Siebertz (STOC 2014) for the class of nowhere-dense graphs. In contrast, not much is known once we focus on finite algebraic structures. Recently, Bova, Ganian and Szeider (LICS 2014) investigated the complexity of the model checking problem for FO and partially ordered sets. They show that the model checking problem for the existential fragment of FO can be solved in time f (|φ|).ng(w) , where n is the size of a poset and w its width, i.e. the size of its largest antichain. In the parlance of parameterized complexity, this means that the problem is FPT (fixed-parameter tractable) in the size of the formula, but only XP in the width of the poset. The proof is a bit involved, and goes by first showing that the model checking problem (for the existential fragment of FO) is equivalent to the embedding problem for posets, and then reducing the embedding problem to a suitable family of instances of the homomorphism problem of certain semilattice structures. In this talk we improve upon (and simplify) the result of Bova et al. by showing that the model-checking problem is FPT in both the size of the formula and the width of the poset. We give two different, fixed-parameter algorithms solving the embedding problem. The first algorithm is a natural, and easy to understand, polynomial-time reduction to a CSP instance closed under min polymorphisms, giving us O(n4 ) dependence of the running time on the size of the poset. The second algorithm has even better, quadratic time complexity and works by reducing the embedding problem to a restricted variant of the multicoloured clique problem, which is then efficiently solved. To complement the previous fixed-parameter tractability results, we also investigate possible kernelization of the embedding problem for posets. We show that the embedding problem (and therefore the existential FO model checking problem) does not have a polynomial kernel, unless coNP ⊆ NP/poly, which is thought to be unlikely. This means the embedding problem cannot be efficiently reduced to an equivalent instance of size polynomial in the parameter. Presented work is a joint collaboration with Petr Hliněný, Jan Obdržálek and Sebastian Ordyniak accepted to ISAAC 2014. 131 Fully Automated Shape Analysis Based on Forest Automata Lukáš Holı́k, Ondřej Lengál, Adam Rogalewicz, Jiřı́ Šimáček, and Tomáš Vojnar FIT, Brno University of Technology, IT4Innovations Centre of Excellence, Czech Republic Forest automata (FAs) have recently been proposed as a tool for shape analysis of complex heap structures. FAs encode sets of tree decompositions of heap graphs in the form of tuples of tree automata. In order to allow for representing complex heap graphs, the notion of FAs allowed one to provide user-defined FAs (called boxes) that encode repetitive graph patterns of shape graphs to be used as alphabet symbols of other, higher-level FAs. In the presented work, we describe a newly developed technique of automatically learning the FAs to be used as boxes that avoids the need of providing them manually. Further, we propose a significant improvement of the automata abstraction used in the analysis. The result is an efficient, fully-automated analysis that can handle even as complex data structures as skip lists, with the performance comparable to state-of-theart fully-automated tools based on separation logic, which, however, specialise in dealing with linked lists only. This presentation is based on a paper with the same name that appeared in the proceedings of CAV 2013. Acknowledgement. This work was supported by the Czech Science Foundation (projects P103/10/0306, 13-37876P), the Czech Ministry of Education, Youth, and Sports (project MSM 0021630528), the BUT FIT project FIT-S-12-1, and the EU/Czech IT4Innovations Centre of Excellence project CZ.1.05/1.1.00/ 02.0070. 132 Multi-objective Genetic Optimization for Noise-Based Testing of Concurrent Software Vendula Hrubá, Bohuslav Křena, Zdeněk Letko, Hana Pluháčková, and Tomáš Vojnar IT4Innovations Centre of Excellence, FIT, Brno University of Technology, Czech Rep., {ihruba, krena, iletko, ipluhackova, vojnar}@fit.vutbr.cz Testing of multi-threaded programs is a demanding work due to the many possible thread interleavings one should examine. The noise injection technique helps to increase the number of thread interleavings examined during repeated test executions provided that a suitable setting of noise injection heuristics is used. The problem of finding such a setting, i.e., the so called test and noise configuration search problem (TNCS problem), is not easy to solve according to ”Testing of Concurrent Programs Using Genetic Algorithms.” (Hrubá, V., Křena, B., Letko, Z., and Vojnar, T., SSBSE’12). In this paper, we show how to apply a multi-objective genetic algorithm (MOGA) to the TNCS problem. In particular, we focus on generation of TNCS solutions that are suitable for regression testing where tests are executed repeatedly. Consequently, we are searching for TNCS candidate solutions that cover a high number of distinct interleavings (especially those which are rare) and provide stable results in the same time. To achieve this goal, we study suitable metrics and ways how to suppress effects of non-deterministic thread scheduling on the proposed MOGA-based approach. We also discuss a choice of a MOGA and its parameters suitable for our setting. Finally, we show on a set of benchmark programs that our approach provides better results when compared to the commonly used random approach as well as to the sooner proposed use of a single-objective genetic approach. The presentation is based on the paper ”Multi-objective Genetic Optimization for Noise-Based Testing of Concurrent Software” (Hrubá, V., Křena, B., Letko, Z., Pluháčková, H., and Vojnar, T., SSBSE’14). Acknowledgement. We thank Shmuel Ur and Zeev Volkovich for many valuable comments on the work presented in this paper. The work was supported by the Czech Ministry of Education under the Kontakt II project LH13265, the EU/Czech IT4Innovations Centre of Excellence project CZ.1.05/1.1.00/02.0070, and the internal BUT projects FIT-S-11-1 and FIT-S-12-1. Zdeněk Letko was funded through the EU/Czech Interdisciplinary Excellence Research Teams Establishment project (CZ.1.07/2.3.00/30.0005). 133 On Interpolants and Variable Assignments? Pavel Jancik2 , Jan Kofroň2 , Simone Fulvio Rollini1 , and Natasha Sharygina1 1 2 University of Lugano, Switzerland, {name.surname}@usi.ch D3S, Faculty of Mathematics and Physics, Charles University, Czech Rep., {name.surname}@d3s.mff.cuni.cz Craig interpolants are widely used in program verification as a means of abstraction. For propositional logic the interpolants can be computed by wellestablished McMillan’s and symmetric Pudlák’s interpolation systems, which are generalized by the Labeled Interpolation Systems (LISs) (D’Silva, 2010). In the area of Abstract Reachability Trees resp. Abstract Reachability Graphs (ARG) interpolants play an important role. In ARG each graph node has a label (representing an over-approximation of program states reachable at the node) assigned. The node labels are typically derived from (node) interpolants. A safe, complete, and well-labeled ARG can be used to show correctness of the corresponding program. In order to obtain a well-labeled ARG, the computed node interpolants have to be inductive. To compute node interpolants the ARG has to be first converted into a formula, which is then passed to a solver. If the formula is satisfiable, the program is not safe and the error trace can be derived from the variable assignment. Otherwise node interpolants can be derived from the refutation proof; to this end the input formula is split into two parts – A and B. Even though it is possible to use a standard interpolation systems, this suffers from various drawbacks; the interpolant over-approximates all states on the boundary between A and B, which can include many ARG nodes. Furthermore, the shared-variables occurring in the interpolant may not be in-scope at a given node; thus a post-processing steps (involving, e.g., quantifier elimination) are needed to derive a node label from the interpolant. To face the aforementioned issues, (i) we introduce the concept of Partial Variable Assignment Interpolants (PVAIs) as a generalization of Craig interpolants. A variable assignment focuses the computed interpolant via restricting the set of clauses taken into account during interpolation. In the scope of ARGs, a variable assignment is used to exclude some paths from the set being considered by the (node) interpolant, thus specializing the interpolant to the relevant ones, i.e., only to those going via the corresponding node. Due to this specialization it is possible to guarantee that unwanted out-of-scope variables (coming from ignored paths) do not appear in the interpolant. Furthermore, (ii) we present a way to compute PVAIs for propositional logic based on an extension of the LISs. The extension uses variable assignment to omit irrelevant parts of the resolution proofs (thus reducing the interpolant size) as well as to modify the locality constraints to omit the out-out-scope variables. Last, (iii) we show that the ex? This work is partially supported by the Grant Agency of the Czech Republic project 14-11384S, and Charles University Foundation grant 203-10/253297. 134 Finding Terms in Corpora for Many Languages Adam Kilgarriff† , Miloš Jakubı́ček‡† , Vojtěch Kovář‡† , Pavel Rychlý‡† , and Vı́t Suchomel‡† ‡ NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic † Lexical Computing Ltd., Brighton, United Kingdom [email protected],{xjakub,xkovar3,pary,xsuchom2}@fi.muni.cz Term candidates for a domain, in a language, can be found by taking a corpus for the domain, and a reference corpus for the language, identifying the grammatical shape of a term in the language, tokenising, lemmatising and POStagging both corpora, identifying and counting the items in each corpus which match the grammatical shape, and for each item in the domain corpus, comparing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference corpus will be the top term candidates. In this abstract we describe how we addressed the stages above. We make the simplifying assumption that terms are noun phrases (in their canonical form, without leading articles: the term is base station, not the base stations.) Then the task is to write a noun phrase grammar for the language. Within the Sketch Engine we already have machinery for shallow parsing, based on a ’Sketch Grammar’ of regular expressions over part-of-speech tags, written in Corpus Query Language. Our implementation is mature, stable and fast, processing million-word corpora in seconds and billion-word corpora in a few hours. The machinery has most often been used to find <grammatical-relation, word1, word2> triples for lexicography and related research. It was modified to find, and count, the items having the appropriate shape for a term. The challenge of identifying the best candidate terms for the domain, given their frequency in the domain corpus and the reference corpus, is a variant on the challenge of finding the keywords in a corpus. A good method is simply to take the ratio of the normalised frequency of the term in the domain corpus to its normalised frequency in a reference corpus. Candidate terms are then presented to the user in a sorted list, with the best candidates – those with the highest domain:reference ratio – at the top. Each item in the list is clickable: the user can click to see a concordance for the term, in either the domain or the reference corpus. The current challenge we face is to get the correct cannonical form of the term. In English one (almost) always wants to present each word in the term candidate in its canonical, dictionary form. But in gender sensitive languages one does not. A gender respecting lemma turns out necessary in such cases. Another challenge is to keep the same processing chains for all corpora, regardless the size. The reference corpus is processed in batch mode, and we hope not to upgrade it more than once a year. The domain corpus is processed at runtime. For term-finding, we have had to look carefully at the tools, separating 135 A. Kilgarriff et al. each out into an independent module, so that we can be sure of applying the same versions throughout. We have undertaken a first evaluation using the GENIA corpus , in which all terms have been manually identified. Keyword and term extraction was performed to obtain the top 2000 keywords and top 1000 multi-word terms. Terms manually annotated in GENIA as well as terms extracted by our tool were normalized before comparison (lower case, spaces and hyphens removed) and then GENIA terms were looked up in the extraction results. 61 of the top 100 GENIA terms were found by the system. The terms not found were not English words: most were acronyms, e.g. EGR1, STAT-6. We have built a system for finding terms in a domain corpus in Chinese, English, French, German, Japanese, Korean, Portuguese, Russian, Spanish. We will extend the coverage of languages and improve the system according to further feedback from users in 2014. This work has been partly supported by the Ministry of Education of CR within the LINDAT-Clarin project LM2010013. This work was presented in KILGARRIFF, Adam, Miloš JAKUBÍČEK, Vojtěch KOVÁŘ, Pavel RYCHLÝ a Vı́t SUCHOMEL. Finding Terms in Corpora for Many Languages with the Sketch Engine. In Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: The Association for Computational Linguistics, 2014. s. 53-56, 4 s. ISBN 978-1-937284-75-6. 136 Hereditary properties of permutations are strongly testable Tereza Klimošová and Daniel Král’ Institute of Mathematics, University of Warwick, Coventry CV4 7AL, UK. E-mail: [email protected], [email protected]. Property testing is a topic with growing importance with many connections to various areas of mathematics and computer science. A property tester is an algorithm that decides whether a large input object has the considered property by querying only a small sample of it. Since the tester is presented with a part of the input structure, it is necessary to allow an error based on the robustness of the tested property of the input. The most investigated area of property testing is testing graph properties. One of the most significant results in this area is that of Alon and Shapira asserting that every hereditary graph property, i.e., a property preserved by taking induced subgraphs, is testable with respect to the edit distance. Hoppen, Kohayakawa, Moreira, Sampaio obtained the similar result for permutations, showing, that every hereditary property of permutations is weakly testable, i.e., testable with respect to the rectangular distance, and they asked whether the same is true for a finer measure than the rectangular distance, the Kendall’s tau distance. We resolve this problem in the positive way. The Kendall’s tau distance is considered to correspond to the edit distance of graphs. For two permutations π, σ on N elements, it is defined as the minimum number of swaps of consecutive elements transforming π to σ, divided by N2 . The Kendall’s tau distance of a permutation π from a permutation property P is the minimum distance of π from an element of P. A property P is hereditary if it is closed under taking subpermutations, i.e., if π ∈ P, then any subpermutation of π is in P. We say that a property P is strongly testable through subpermutations there exists a tester A that is presented with a random subpermutation of the input permutation of size bounded by a function of ε independent of the input permutation and such that if the input permutation has the property P, then A accepts with probability at least 1 − ε, and if the input permutation is ε-far from P with respect to the Kendall’s tau distance, then A rejects with probability at least 1 − ε. We have proven that every hereditary permutation property is testable through subpermutations with respect to the Kendall’s tau distance. Unlike the algorithm of Hoppen et al. which is based regularity decompositions of permutations, our algorithm is based on a direct combinatorial argument, which yields better dependance on parameters of the problem. The result is a joint work with Dan Král’ and was presented at SODA’14 (T. Klimošová and D. Král’: Hereditary properties of permutations are strongly testable, in: Proc. SODA’14, SIAM, Philadelphia, PA, 2014 1164–1173). 137 Paraphrase and Textual Entailment Generation in Czech Zuzana Nevěřilová Natural Language Processing Centre, Faculty of Informatics, Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic The presentation covers automatic paraphrase and textual entailment generation. We focus on Czech language but most of the concepts and ideas are language independent. A paraphrase, i.e. a sentence with the same meaning, conveys a certain piece of information with new words and new syntactic structures. Textual entailment, i.e. an inference that humans will judge most likely true, can employ real-world knowledge in order to make some implicit information explicit. Paraphrases can also be seen as mutual entailments, i.e. if a sentence s1 entails a sentence s2 and vice versa then s1 and s2 are paraphrases. Paraphrase and textual entailment generation can support natural language processing (NLP) tasks that simulate text understanding, e.g. text summarization (express an idea using different words), plagiarism detection (express somebody’s ideas using different words), or question answering (the answer can be retrieved from a text if the question is reformulated using different words). In addition, paraphrase generation is similar to the task of machine translation except that only one language is processed. We present a new system that generates paraphrases and textual entailments from a given text in the Czech language. First, the process is rule-based, i.e. the system analyzes the input text, produces its inner representation, transforms it according to particular transformation rules, and generates new sentences. The domain of the input text is not restricted, therefore the generation process demands huge language resources. Second, the generated sentences are ranked according to a statistical model and only the best ones are output. The models are based on corpus data and on previous annotations. The evaluation whether a paraphrase or textual entailment is correct or not is left to humans. For this purpose we designed an annotation game based on a conversation between a detective (the human player) and his assistant (the system). The result of such annotation is a collection of annotated pairs text– hypothesis. Currently, the system and the game are intended to collect data in the Czech language. However, the idea can be applied for other languages. So far, we have collected 3,321 text–hypothesis pairs. From these pairs, 1,563 were judged correct (47.06 %), 1,238 (37.28 %) were judged incorrect entailments, and 520 (15.66 %) were judged non-sense or unknown. The results are presented on CICLing 2014 and on 17th International Conference on Text, Speech and Dialogue. 138 Minimizing Running Costs in Consumption Systems Petr Novotný Faculty of Informatics, Masaryk University, Brno, Czech Republic A standard approach to optimizing long-run running costs of discrete systems is based on minimizing the mean-payoff, i.e., the long-run average amount of resources (“energy”) consumed per transition. More precisely, a system is modelled as a finite directed graph C, where the set of states S corresponds to configurations, and transitions model the discrete computational steps. Each transition is labeled by a non-negative integer specifying the amount of energy consumed by a given transition. Then, to every run % in C one can assign the associated mean-payoff, which is the limit of average energy consumption per transition computed for longer and longer prefixes of %. A basic algorithmic task is to find a suitable controller for a given system which minimizes the mean-payoff. Recently, the problem has been generalized by requiring that the controller should also achieve a given linear time property ϕ, i.e., the run produced by a controller should satisfy ϕ while minimizing the mean-payoff (Chatterje et al, 2005). This is motivated by the fact that the system is usually required to achieve some functionality, and not just “run” with minimal average costs. However, the above approach inherently assumes that all transitions are always enabled, i.e., the amount of energy consumed by a transition is always available. This is not always realistic. For example, an autonomous robotic device has a battery of finite capacity that has to be recharged periodically, and the total amount of energy consumed between two successive charging cycles is bounded by the capacity. Hence, a controller minimizing the mean-payoff must obey this capacity restriction. In this paper we study the controller synthesis problem for consumption systems with a finite battery capacity, where the task of the controller is to minimize the mean-payoff while satisfying the capacity restriction and preserving the functionality of the system encoded by a given linear-time property. We show that an optimal controller always exists, and it may either need only finite memory or require infinite memory (it is decidable in polynomial time which of the two cases holds). Further, we show how to compute an effective description of an optimal controller in polynomial time. Finally, we consider the limit values achievable by larger and larger battery capacity, show that these values are computable in polynomial time, and we also analyse the corresponding rate of convergence. To the best of our knowledge, these are the first results about optimizing the long-run running costs in systems with bounded energy stores. The presentation is based on the paper “Minimizing Running Costs in Consumption Systems” by T. Brázdil, D. Klaška, A. Kučera and P. Novotný, which was accepted for publication in proceedings of CAV 2014. 139 Testing Fault-Tolerance Methodologies in Electro-mechanical Applications ⋆ Jakub Podivinsky and Zdenek Kotasek Faculty of Information Technology, Brno University of Technology, Czech Republic {ipodivinsky,kotasek}@fit.vutbr.cz The aim of the presentation is to introduce a new platform under development for estimating the fault-tolerance quality of electro-mechanical applications based on FPGAs. In several areas, such as aerospace and space applications or automotive safety-critical applications, fault tolerant electro-mechanical (EM) systems are highly desirable. In these systems, the mechanical part is controlled by its electronic controller. Currently, a trend is to add even more electronics into EM systems. We have identified two areas that we would like to focus on in our research of fault-tolerant systems: The first one is that methodologies are validated and demonstrated only on simple electronic circuits implemented in FPGAs. However, in real systems different types of blocks must be protected against faults at the same time and must communicate with each other. Therefore, a general evaluation platform for testing, analysis and comparison of aloneworking or cooperating fault-tolerance methodologies is needed. As for the second area of the research and the main contribution of our work, we feel that it must be possible to check the reactions of the mechanical part of the system if the functionality of its electronic controller is corrupted by faults. In the presentation, a working example of such EM application will be demonstrated that was evaluated using our platform: the mechanical robot and its electronic controller in FPGA. Different building blocks of the electronic robot controller allow to model different effects of faults on the whole mission of the robot (searching a path in a maze). In the experiments, the mechanical robot is simulated in the simulation environment where the effects of faults injected into its controller can be seen. In this way, it is possible to differentiate between the fault that causes the failure of the system and the fault that only decreases the performance. Further extensions of the platform focus on the interconnection of the platform with the functional verification environment working directly in FPGA that allows automation and speed-up of checking the correctness of the system after the injection of faults. The original work was accepted on MEDIAN 2014 [1] and DSD 2014 [2]. References 1. Podivinsky, J., Simkova, M., Kotasek, Z.: Complex Control System for Testing FaultTolerance Methodologies. In: Proceedings of The Third Workshop on MEDIAN. COST (2014). 2. Podivinsky, J., Cekan, O., Simkova, M., Kotasek, Z.: The Evaluation Platform for Testing Fault-Tolerance Methodologies in Electro-mechanical Applications. In: 17th Euromicro Conference on Digital Systems Design. Verona (2014). ⋆ This work was supported by the following projects: National COST LD12036, project Centrum excellence IT4Innovations (ED1.1.00/02.0070), EU COST Action IC1103 - MEDIAN and BUT project FIT-S-14-2297. 140 A Simple and Scalable Static Analysis for Bound Analysis and Amortized Complexity Analysis Moritz Sinn, Florian Zuleger, and Helmut Veith TU Vienna Automatic methods for computing bounds on the resource consumption of programs are an active area of research [11, 8, 3, 9, 13, 4, 1, 10, 2]. We present the first scalable bound analysis for imperative programs that achieves amortized complexity analysis. Our techniques can be applied for deriving upper bounds on how often loops can be iterated as well as on how often a single or several control locations can be visited in terms of the program input. The majority of earlier work on bound analysis has focused on mathematically intriguing frameworks for bound analysis. These analyses commonly employ general purpose reasoners such as abstract interpreters, software model checkers or computer algebra tools and therefore rely on elaborate heuristics to work in practice. Our work takes an orthogonal approach that complements previous research. We propose a bound analysis based on a simple abstract program model, namely lossy vector addition systems with states. We present a static analysis with four well-defined analysis phases that are executed one after each other: program abstraction, control-flow abstraction, generation of a lexicographic ranking function and bound computation. A main contribution of our work is a thorough experimental evaluation. We compare our approach against recent bounds analysis tools [3, 1, 2, 5], and show that our approach is faster and at the same time achieves better results. Additionally, we demonstrate the scalability of our approach by a comparison against our earlier tool [13], which to the best of our knowledge represents the only tool evaluated on a large publicly available benchmark of C programs. We show that our new approach achieves better results while increasing the performance by an order of magnitude. Moreover, we discuss on this benchmark how our tool achieves amortized complexity analysis in real-world code. Our technical key contribution is a new insight how lexicographic ranking functions can be used for bound analysis. Earlier approaches such as [3] simply count the number of elements in the image of the lexicographic ranking function in order to determine an upper bound on the possible program steps. The same idea implicitly underlies the bound analyses [6, 8, 7, 9, 13, 2, 5]. However, this reasoning misses arithmetic dependencies between the components of the lexicographic ranking function. In contrast, our analysis calculates how much a lexicographic ranking function component is increased when another component is decreased. This enables amortized analysis. The talk presents work [12] published at CAV 2014. 141 M. Sinn, F. Zuleger, and H. Veith References 1. Elvira Albert, Puri Arenas, Samir Genaim, German Puebla, and Damiano Zanardini. Cost analysis of object-oriented bytecode programs. Theor. Comput. Sci., 413(1):142–159, 2012. 2. Elvira Albert, Samir Genaim, and Abu Naser Masud. On the inference of resource usage upper and lower bounds. ACM Trans. Comput. Log., 14(3):22, 2013. 3. Christophe Alias, Alain Darte, Paul Feautrier, and Laure Gonnord. Multidimensional rankings, program termination, and complexity bounds of flowchart programs. In SAS, pages 117–133, 2010. 4. Diego Esteban Alonso-Blas and Samir Genaim. On the limits of the classical approach to cost analysis. In SAS, pages 405–421, 2012. 5. Marc Brockschmidt, Fabian Emmes, Stephan Falke, Carsten Fuhs, and Juergen Giesl. Alternating runtime and size complexity analysis of integer programs. In TACAS, page to appear, 2014. 6. Bhargav S. Gulavani and Sumit Gulwani. A numerical abstract domain based on expression abstraction and max operator with application in timing analysis. In CAV, pages 370–384, 2008. 7. Sumit Gulwani, Sagar Jain, and Eric Koskinen. Control-flow refinement and progress invariants for bound analysis. In PLDI, pages 375–385, 2009. 8. Sumit Gulwani, Krishna K. Mehra, and Trishul M. Chilimbi. Speed: precise and efficient static estimation of program computational complexity. In POPL, pages 127–139, 2009. 9. Sumit Gulwani and Florian Zuleger. The reachability-bound problem. In PLDI, pages 292–304, 2010. 10. Jan Hoffmann, Klaus Aehlig, and Martin Hofmann. Multivariate amortized resource analysis. ACM Trans. Program. Lang. Syst., 34(3):14, 2012. 11. Martin Hofmann and Steffen Jost. Static prediction of heap space usage for firstorder functional programs. In POPL, pages 185–197, 2003. 12. Moritz Sinn, Florian Zuleger, and Helmut Veith. A simple and scalable static analysis for bound analysis and amortized complexity analysis. In Armin Biere and Roderick Bloem, editors, CAV, volume 8559 of Lecture Notes in Computer Science, pages 745–761. Springer, 2014. 13. Florian Zuleger, Sumit Gulwani, Moritz Sinn, and Helmut Veith. Bound analysis of imperative programs with the size-change abstraction. In SAS, pages 280–297, 2011. 142 Optimal Temporal Logic Control for Deterministic Transition Systems with Probabilistic Penalties? Mária Svoreňová1 , Ivana Černá1 , and Calin Belta2 1 2 Faculty of Informatics, Masaryk University, Brno 60200, Czech Republic [email protected], [email protected] Dep. of Mechanical Engineering, Boston University, Boston, MA 02215, USA [email protected] While optimal control theory is a mature discipline, control of systems from a temporal logic specification has gained considerable attention in control literature only recently. The combination of the two areas, where the goal is to optimize the behavior of a system subject to correctness constraints, is a largely open area with a potentially high impact in applications. In this work, we employ formal methods such as automata-based model checking and games to solve an optimal temporal logic control problem motivated by robotic applications. As an example, consider a mobile robot involved in a complex mission under tight fuel and time constraints. We assume such a system being modeled as a weighted deterministic transition system required to satisfy a Linear Temporal Logic (LTL) formula over its labels. Every state of the system is associated with a time-varying, locally sensed penalty modeled as a Markov chain (MC) that can be used to encode environmental phenomena with known statistics, such as energy or time demands for the mobile robot that change according to traffic load. Motivated by persistent surveillance robotic missions, our goal in this work is to minimize the expected average cumulative penalty incurred between consecutive satisfactions of a desired property, while at the same time satisfying an additional temporal logic constraint. We provide two solutions to this problem. First, we derive a provably correct optimal strategy within the class of strategies that do not exploit values of penalties sensed in real time, only the a priori known transition probabilities of the penalties’ MCs. Second, by taking advantage of locally sensing the penalties, we construct heuristic strategies that lead to lower collected penalty, while still ensuring satisfaction of the LTL constraint. The abstract is based on the following published work and its extension: Svoreňová, M., Černá, I., Belta, C.: Optimal Receding Horizon Control for Finite Deterministic Systems with Temporal Logic Constraints. American Control Conference (ACC), 2013, 4399–4404. Svoreňová, M., Černá, I., Belta, C.: Optimal Temporal Logic Control for Deterministic Transition Systems with Probabilistic Penalties. IEEE Transactions on Automatic Control, accepted. ? The work was partially supported at MU by grants GAP202/11/0312, LH11065, and at BU by ONR grants MURI N00014-09-1051, MURI N00014-10-10952 and by NSF grant CNS-1035588. 143 Understanding the Importance of Interactions among Job Scheduling Policies? Šimon Tóth1 and Dalibor Klusáček2 1 Faculty of Informatics, Masaryk University, Brno, Czech Republic 2 CESNET a.l.e., Prague, Czech Republic [email protected], [email protected] Many studies in the past two decades focused on the problem of efficient job scheduling in large computational systems. While many new scheduling algorithms have been proposed, mainstream resource management systems and schedulers are still using only a limited set of scheduling policies. For example, the core of the system is generally based on the simple First Come First Served (FCFS) approach, while backfilling (a trivial optimization of FCFS to increase utilization) is typically the most advanced option available. Since backfilling has been proposed in 1995, it is obvious that there is some misunderstanding between the research community and system administrators concerning “what is really important”. In this work [3] — recently presented at the Euro-Par conference — we show that the problem of operating a production scheduler is far more complex than just choosing a proper scheduling algorithm. Using our experience from the Czech National Grid Infrastructure MetaCentrum we explain several additional challenges that appear when searching for a functional solution. These problems are related to the fact that real systems must meet far more complicated requirements than those that are typically considered in classical research papers [1]. In fact, production systems need to balance various policies that are set in place to satisfy both resource providers and users. Among them — according to our findings — the most important policies are often those that define how queues are ordered and prioritized and what are their corresponding limits. Queue limits define, e.g., the number of CPUs that can be used at a given moment by a given class of jobs. The major problem in this area is that although many works address these separate policies, e.g., fairshare for fair resource allocation, complex interactions between policies are not properly discussed in the literature. In our work [3] we describe how to approach these interactions when developing site-specific policies. Notably, we describe how (priority) queues interact with scheduling algorithms, fairshare and with anti-starvation mechanisms. Importantly, we have considered a real-life problem where we were searching for a new scheduling configuration for MetaCentrum. To achieve that, we have used detailed and high-resolution simulations using the advanced jobs scheduling simulator Alea [2]. One of the most important finding was that a minor ? This work was kindly supported by the LM2010005 project funded by the Ministry of Education, Youth, and Sports of the Czech Republic and by the grant No. P202/12/0306 funded by the Grant Agency of the Czech Republic. 144 Understanding the Importance of Interactions among Job Scheduling Policies “conservative” modification of existing system configuration — which was initially considered as safe — may produce unforeseen chain reactions in the system, leading to much poorer performance. The newly developed configuration for MetaCentrum, which has significantly increased its overall performance, is a rather complex modification of previous setup. The whole queue configuration has been modified, introducing new queues with new limits. At the same time, fairness was stressed out which required modifications of the queue ordering scheme. Finally, the overall throughput has been increased by optimizing existing job anti-starvation mechanism. The newly developed configuration is applied in production use within MetaCentrum’s TORQUE resource manager since January 2014 without any major problems. In fact, the amount of utilized CPU hours have increased by 23% compared to the same period of time prior the new solution was deployed. Also the number of processed jobs has increased by 87%. Importantly, even with a higher throughput, job wait times remained decent, in fact they were decreased significantly, as is shown in Fig. 1 (more jobs now wait shorter than previously). Fig. 1. The comparison of waiting times for the second half of year 2013 (old configuration) and the first half of year 2014 (new configuration). References 1. Eitan Frachtenberg and Dror G. Feitelson. Pitfalls in parallel job scheduling evaluation. In Dror G. Feitelson, Eitan Frachtenberg, Larry Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, volume 3834 of LNCS, pages 257–282. Springer Verlag, 2005. 2. Dalibor Klusáček and Hana Rudová. Alea 2 – job scheduling simulator. In 3rd International ICST Conference on Simulation Tools and Technique. ICST, 2010. 3. Dalibor Klusáček and Šimon Tóth. On interactions among scheduling policies: Finding efficient queue setup using high-resolution simulations. In Fernando Silva, Inês Dutra, and Vı́tor Santos Costa, editors, Euro-Par 2014, volume 8632 of LNCS, pages 138–149. Springer, 2014. 145 Author Index Avros, R., 15 Barnat, J., 28 Belta, C., 143 Benáček, P., 40 Beneš, N., 28 Bezděk, P., 28 Blažek, R. B., 40 Borecký, J., 127 Brázdil, T., 128 Čejka, T., 40 Černá, I., 143 Černá, I., 28 Chatterjee, K., 128, 129 Chmelı́k, M., 128, 129 Daca, P., 129 Dvořák, M., 52 Esparza, J., 130 Forejt, V., 128 Gajarský, J., 131 Holı́k, L., 132 Hrubá, V., 15, 133 Jakubı́ček, M., 135 Jančı́k, P., 134 Kekely, L., 40 Kilgarriff, A., 135 Klimošová, T., 137 Klusáček, D., 144 Kofroň, J., 134 Kolář, D., 63 Kořenek, J., 52, 77 Košař, V., 77 Kotásek, Z., 140 Kovář, V., 135 Král’, D., 137 Křena, B., 15, 133 Křetı́nský, J., 128, 130 Kubátová, H., 40, 127 Kwiatkowska, M., 128 Lengál, O., 132 Letko, Z., 133 Matula, P., 63 Matyska, L., 101 Meduna, A., 89 Nevěřilová, Z., 138 Novotný, P., 139 Parker, D., 128 Pazúriková, J., 101 Pluháčková, H., 133 Pluháčková, H., 15 Podivı́nský, J., 140 Rogalewicz, A., 132 Rollini, S. F., 134 Rychlý, P., 135 Sharygina, N., 134 Šimáček, J., 132 Sinn, M., 141 Soukup, O., 89 Suchomel, V., 135 Svoreňová, M., 143 Tóth, Š., 113, 144 Ujma, M., 128 Ur, S., 15 Veith, H., 141 Vı́t, P., 127 Vojnar, T., 15, 132, 133 Volkovich, Z., 15 Závodnı́k, T., 52 Zuleger, F., 141 }w !"#$%&'()+,-./012345<yA| Organisers Faculty of Informatics Masaryk University Botanická 68a Brno, Czech Republic http://www.fi.muni.cz Faculty of Information Technology Brno University of Technology Božetěchova 2 Brno, Czech Republic http://www.fit.vutbr.cz Workshop Sponsors Petr Hliněný, Zdeněk Dvořák, Jiřı́ Jaroš, Jan Kofroň, Jan Kořenek, Petr Matula, Karel Pala (Eds.) MEMICS 2014 Ninth Doctoral Workshop on Mathematical and Engineering Methods in Computer Science Printing and publishing: NOVPRESS s.r.o., nám. Republiky 725/15, 614 00 Brno Edition: first Year of publication: 2014 Typesetting: camera ready by paper authors and PC members, data conversion and design by Jaroslav Rozman Cover design: Tomáš Staudek ISBN 978-80-214-5022-6