SimString

Documentation for SimString.

A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching. This package is be particulary useful for natural language processing tasks which require the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.

CPMerge Paper: https://aclanthology.org/C10-1096/

Features

  • [X] Fast algorithm for string matching
  • [X] 100% exact retrieval
  • [X] Support for unicodes
  • [X] Support for building databases directly from text files
  • [X] Mecab-based tokenizer support for Japanese
  • [ ] Support for persistent databases like MongoDB

Suported String Similarity Measures

  • [X] Dice coefficient
  • [X] Jaccard coefficient
  • [X] Cosine coefficient
  • [X] Overlap coefficient
  • [X] Exact match

Installation

You can grab the latest stable version of this package from Julia registries by simply running;

NB: Don't forget to invoke Julia's package manager with ]

pkg> add SimString

The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with ]:

pkg> add SimString#master

You are good to go with bleeding edge features and breakages!

To revert to a stable version, you can simply run:

pkg> free SimString

Usage

using SimString

# Inilisate database and some strings
db = DictDB(CharacterNGrams(2, " ")); 
# OR: db = DictDB(WordNGrams(2, " ")); for word based ngrams 
# OR  db = DictDB(MecabNGrams(2, " ", Mecab())) for Japanese ngrams. Requires installation of Mecab
push!(db, "foo");
push!(db, "bar");
push!(db, "fooo");

# Convinient approach is to use an array of strings for multiple entries: `append!(db, ["foo", "bar", "fooo"]);`

# OR: Build database from text files: `append!(db, "YOUR_FILE_NAME.txt");

# Retrieve the closest match(es)
res = search(Dice(), db, "foo"; α=0.8, ranked=true)
# 2-element Vector{Tuple{String, Float64}}:
#  ("foo", 1.0)
#  ("fooo", 0.8888888888888888)

# Describe a working database collection
desc = describe_collection(db)
# (total_collection = 3, avg_size_ngrams = 4.5, total_ngrams = 13)

TODO: Benchmarks

Release History

  • 0.1.0 Initial release.
  • 0.2.0 Added support for unicodes
  • 0.3.0 Added Japanese support via Mecab
SimString.DictDBMethod
DictDB(x::CharacterNGrams)

Initialize a dict DB with additional containers and Metadata for CharacterNGrams

Arguments

  • x: CharacterNGrams object

Example

db = DictDB(CharacterNGrams(2, " "))

Returns

  • DictDB: A DictDB object with additional containers and Metadata for CharacterNGrams
source
SimString.DictDBMethod
DictDB(x::MecabNGrams)

Initialize a dict DB with additional containers and Metadata for MecabNGrams

Arguments

  • x: MecabNGrams object

Example

db = DictDB(MecabNGrams(2, " ", Mecab()))

Returns

  • DictDB: A DictDB object with additional containers and Metadata for MecabNGrams
source
SimString.DictDBMethod
DictDB(x::WordNGrams)

Initialize a dict DB with additional containers and Metadata for WordNGrams

Arguments

  • x: WordNGrams object

Example

db = DictDB(WordNGrams(2, " ", " "))

Returns

  • DictDB: A DictDB object with additional containers and Metadata for WordNGrams
source
Base.append!Method
append!(db::AbstractSimStringDB, file::AbstractString)

Add bulk items to a new or existing collection of strings using from a file using the custom AbstractSimStringDB type.

Arguments:

  • db`: AbstractSimStringDB - The database to add the items to
  • file: AbstractString - Path to the file to read from

Example:

db = DictDB(CharacterNGrams(2, " "));
append!(db, "./data/test.txt")

Returns:

  • db: AbstractSimStringDB - The database with the items added
source
Base.append!Method
append!(db::AbstractSimStringDB, str::Vector)

Add bulk items to a new or existing collection of strings using the custom AbstractSimStringDB type.

Arguments:

  • db: AbstractSimStringDB - The database to add the strings to
  • str: Vector of AbstractString - Vector/Array of strings to add to the database

Example:

db = DictDB(CharacterNGrams(2, " "));
append!(db, ["foo", "foo", "fooo"]);

Returns:

  • db: AbstractSimStringDB - The database with the new strings added
source
Base.push!Method
push!(db::AbstractSimStringDB, str::AbstractString)

Add a new item to a new or existing collection of strings using the custom AbstractSimStringDB type.

Arguments:

  • db: AbstractSimStringDB - The collection of strings to add to
  • str: AbstractString - The string to add to the collection

Example:

julia db = DictDB(CharacterNGrams(2, " ")); push!(db, "foo") push!(db, "bar") push!(db, "fooo")`

Returns:

  • db: AbstractSimStringDB - The collection of strings with the new string added
source
SimString.describe_collectionMethod
describe_collection(db::DictDB)

Basic summary stats for the DB

Arguments

  • db: DictDB object

Example

db = DictDB(CharacterNGrams(2, " "));
append!(db, ["foo", "bar", "fooo"]);
describe_collection(db)
(total_collection = 3, avg_size_ngrams = 4.5, total_ngrams = 13)

# Returns
* NamedTuples: Summary stats for the DB
source
SimString.search!Method

Search for strings in custom DictDB string collection using the SimString algorithm and a similarity measure.

source
SimString.searchMethod
search(measure::AbstractSimilarityMeasure, db_collection::AbstractSimStringDB, query::AbstractString;
    α=0.7, ranked=true)

Search for strings in a string collection using the SimString algorithm and a similarity measure.

Arguments:

  • measure::AbstractSimilarityMeasure - The similarity measure to use.
  • db_collection::AbstractSimStringDB - The database collection to search.
  • query::AbstractString - The query string to search for.
  • α::float - The α parameter for the SimString algorithm.
  • ranked::Boolean - Whether to return the results in ranked order.

Example

db = DictDB(CharacterNGrams(2, " "));
append!(db, ["foo", "bar", "fooo"]);

search(Dice(), db, "foo"; α=0.8, ranked=true)
# 2-element Vector{Tuple{String, Float64}}:
#  ("foo", 1.0)
#  ("fooo", 0.8888888888888888)

Returns

  • A Vector of results, where each element is a Tuple of the form (string, similarity measure score).
source