Skip to content

Latest commit

 

History

History
93 lines (57 loc) · 4.35 KB

README.md

File metadata and controls

93 lines (57 loc) · 4.35 KB

Aho & Corasick - C++ implementation

License Codacy Badge c++ version

GitHub last commit

Introduction

This code is a C++ implementation of Aho-Corasick patern matching algorithm.

In their paper published in 1975, Alfred V. Aho and Margaret J. Corasick describe this algorithm like this:

[Aho-Corasick is an] efficient algorithm to locate all occurences of any of finite number of keywords in a string of text.

Nowaday, this algorithm remains the best one in termes of memory usage and complexity to search lot of patterns simultaneously in one pass. The complexity is linear in the length of the patterns plus the length of the searched text plus the number of output matches.

The famous example is GNU Grep that uses Aho-Corasick algorithm to match multiple fixed patterns (cf. GNU Grep Performance page).

Why another Aho & Corasick implementation?

Reading this page means you had made some few searches about Aho & Corasick algorithm, and you would certainly agree with that there are a lot of Aho & Corasick algorithm implementations in many languages.

The main aim of this code is to propose a light, simple and easy to read Aho & Corasick algorithm implementation using a lexicographic tree.

Using a lexicographic tree is a such of naïve approch, but that allows to:

  • remain close to what we figure out when we are thinking about this algorithm,
  • take avantage of the tree structure to implement some algorithm's parts in recursive way (e.g. failure nodes computation or goto/output fonction)

In addition, with this code you will find:

  • An easy to use, adapt and embed object-oriented implementation.
  • An uncommon recursive implementation of calculating failure nodes (definitely needs improvement).
  • A function to convert tree to a DataViz graph file. This file can be converted in picture, that is helpful to understand how the aho-coracick algorithm works.

Usage

This implementation is header only. To compile the code, your compiler must support C++17 standard features.

The following will create a lexicographic tree with a couple of keywords.

 aho_corasick::LexicographicTree lt;
 
 lt.addKeyword( "he" );
 lt.addKeyword( "she" );
 lt.addKeyword( "his" );
 lt.addKeyword( "hers" );

 lt.finalize();

Once the tree structure is finalized, it is ready to be used to find the keywords: just give the characters one by one to the processAndGetOutput methode like this:

std::string text("ushers");

for(const auto& current_char : text) {
  for( const auto* str : lt.processAndGetOutput(current_char)) {
      std::cout << *str << std::endl;
  }

Each call to processAndGetOutput methode returns a "set" of keywords. This set may or may not be empty, depending on whether the last current_char reached none, one or several keywords.

How to visualize lexicographic tree

The methode

const std::string DisplayTools::getGraphVizDescription( const LexicoNode * lnode, bool graphSuffix = true, bool graphWord = true  )

can be used to transform the tree in to DataViz file that would be used later to draw the tree.

Then use GraphViz dot command like this:

> dot -Tpng graph1.dot > graph1.png

The result is:

Graph2

To view GraphViz diagrams online, visit this project and try to use GraphvizOnline online here.

Further reading