SimString
Documentation for SimString.
A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching. This package is be particulary useful for natural language processing tasks which require the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
CPMerge Paper: https://aclanthology.org/C10-1096/
Features
- [X] Fast algorithm for string matching
- [X] 100% exact retrieval
- [X] Support for unicodes
- [X] Support for building databases directly from text files
- [X] Mecab-based tokenizer support for Japanese
- [ ] Support for persistent databases like MongoDB
Suported String Similarity Measures
- [X] Dice coefficient
- [X] Jaccard coefficient
- [X] Cosine coefficient
- [X] Overlap coefficient
- [X] Exact match
Installation
You can grab the latest stable version of this package from Julia registries by simply running;
NB: Don't forget to invoke Julia's package manager with ]
pkg> add SimString
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with ]
:
pkg> add SimString#master
You are good to go with bleeding edge features and breakages!
To revert to a stable version, you can simply run:
pkg> free SimString
Usage
using SimString
# Inilisate database and some strings
db = DictDB(CharacterNGrams(2, " "));
# OR: db = DictDB(WordNGrams(2, " ")); for word based ngrams
# OR db = DictDB(MecabNGrams(2, " ", Mecab())) for Japanese ngrams. Requires installation of Mecab
push!(db, "foo");
push!(db, "bar");
push!(db, "fooo");
# Convinient approach is to use an array of strings for multiple entries: `append!(db, ["foo", "bar", "fooo"]);`
# OR: Build database from text files: `append!(db, "YOUR_FILE_NAME.txt");
# Retrieve the closest match(es)
res = search(Dice(), db, "foo"; α=0.8, ranked=true)
# 2-element Vector{Tuple{String, Float64}}:
# ("foo", 1.0)
# ("fooo", 0.8888888888888888)
# Describe a working database collection
desc = describe_collection(db)
# (total_collection = 3, avg_size_ngrams = 4.5, total_ngrams = 13)
TODO: Benchmarks
Release History
- 0.1.0 Initial release.
- 0.2.0 Added support for unicodes
- 0.3.0 Added Japanese support via Mecab
SimString.AbstractSimStringDB
SimString.AbstractSimilarityMeasure
SimString.CharacterNGrams
SimString.Cosine
SimString.Dice
SimString.DictDB
SimString.DictDB
SimString.DictDB
SimString.DictDB
SimString.ExactMatch
SimString.FeatureExtractor
SimString.Jaccard
SimString.MecabNGrams
SimString.Overlap
SimString.WordNGrams
Base.append!
Base.append!
Base.push!
Base.show
SimString.cummulative_ngram_count
SimString.describe_collection
SimString.extract_features
SimString.extract_features
SimString.extract_features
SimString.generate_base_dict_db
SimString.init_ngrams
SimString.init_ngrams
SimString.lookup_feature_set_by_size_feature
SimString.make_zero_index_circular_array
SimString.maximum_feature_size
SimString.maximum_feature_size
SimString.maximum_feature_size
SimString.maximum_feature_size
SimString.maximum_feature_size
SimString.minimum_feature_size
SimString.minimum_feature_size
SimString.minimum_feature_size
SimString.minimum_feature_size
SimString.minimum_feature_size
SimString.minimum_overlap
SimString.minimum_overlap
SimString.minimum_overlap
SimString.minimum_overlap
SimString.minimum_overlap
SimString.n_grams
SimString.overlap_join
SimString.pad_string
SimString.pad_string
SimString.rank_search_results
SimString.search
SimString.search!
SimString.similarity_score
SimString.similarity_score
SimString.similarity_score
SimString.similarity_score
SimString.similarity_score
SimString.tokenize
SimString.AbstractSimStringDB
— TypeBase type for all custom db collections.
SimString.AbstractSimilarityMeasure
— TypeAbstract base type for all string similarity measures.
SimString.CharacterNGrams
— TypeFeature extraction on character-level ngrams
SimString.Cosine
— TypeCosine Similarity Measure.
SimString.Dice
— TypeDice Similarity Measure.
SimString.DictDB
— TypeCustom DB collection for storing SimString data using base Dictionary Dict
SimString.DictDB
— MethodDictDB(x::CharacterNGrams)
Initialize a dict DB with additional containers and Metadata for CharacterNGrams
Arguments
x
: CharacterNGrams object
Example
db = DictDB(CharacterNGrams(2, " "))
Returns
DictDB
: A DictDB object with additional containers and Metadata for CharacterNGrams
SimString.DictDB
— MethodDictDB(x::MecabNGrams)
Initialize a dict DB with additional containers and Metadata for MecabNGrams
Arguments
x
: MecabNGrams object
Example
db = DictDB(MecabNGrams(2, " ", Mecab()))
Returns
DictDB
: A DictDB object with additional containers and Metadata for MecabNGrams
SimString.DictDB
— MethodDictDB(x::WordNGrams)
Initialize a dict DB with additional containers and Metadata for WordNGrams
Arguments
x
: WordNGrams object
Example
db = DictDB(WordNGrams(2, " ", " "))
Returns
DictDB
: A DictDB object with additional containers and Metadata for WordNGrams
SimString.ExactMatch
— TypeExact Match Similarity Measure.
SimString.FeatureExtractor
— TypeAbstract type for feature extraction structs
SimString.Jaccard
— TypeJaccard Similarity Measure.
SimString.MecabNGrams
— TypeFeature extraction based on mecab word-level ngrams
SimString.Overlap
— TypeOverlap Similarity Measure.
SimString.WordNGrams
— TypeFeature extraction based on word-level ngrams
Base.append!
— Methodappend!(db::AbstractSimStringDB, file::AbstractString)
Add bulk items to a new or existing collection of strings using from a file using the custom AbstractSimStringDB type.
Arguments:
db
`: AbstractSimStringDB - The database to add the items tofile
: AbstractString - Path to the file to read from
Example:
db = DictDB(CharacterNGrams(2, " "));
append!(db, "./data/test.txt")
Returns:
db
: AbstractSimStringDB - The database with the items added
Base.append!
— Methodappend!(db::AbstractSimStringDB, str::Vector)
Add bulk items to a new or existing collection of strings using the custom AbstractSimStringDB type.
Arguments:
- db: AbstractSimStringDB - The database to add the strings to
- str: Vector of AbstractString - Vector/Array of strings to add to the database
Example:
db = DictDB(CharacterNGrams(2, " "));
append!(db, ["foo", "foo", "fooo"]);
Returns:
- db: AbstractSimStringDB - The database with the new strings added
Base.push!
— Methodpush!(db::AbstractSimStringDB, str::AbstractString)
Add a new item to a new or existing collection of strings using the custom AbstractSimStringDB type.
Arguments:
db
: AbstractSimStringDB - The collection of strings to add tostr
: AbstractString - The string to add to the collection
Example:
julia db = DictDB(CharacterNGrams(2, " ")); push!(db, "foo") push!(db, "bar") push!(db, "fooo")
`
Returns:
db
: AbstractSimStringDB - The collection of strings with the new string added
Base.show
— MethodPretty print summary stats for the DB
SimString.cummulative_ngram_count
— MethodInternal function to count and pad generated character-level ngrams (including duplicates)
SimString.describe_collection
— Methoddescribe_collection(db::DictDB)
Basic summary stats for the DB
Arguments
db
: DictDB object
Example
db = DictDB(CharacterNGrams(2, " "));
append!(db, ["foo", "bar", "fooo"]);
describe_collection(db)
(total_collection = 3, avg_size_ngrams = 4.5, total_ngrams = 13)
# Returns
* NamedTuples: Summary stats for the DB
SimString.extract_features
— MethodInternal function to generate character-level ngrams features from an AbstractString
SimString.extract_features
— MethodInternal function to generate Mecab word-level ngrams features from an AbstractString
SimString.extract_features
— MethodInternal function to generate word-level ngrams features from an AbstractString
SimString.generate_base_dict_db
— MethodInternal function for generating a base DictDB object for WordNGrams and MecabNGrams
SimString.init_ngrams
— MethodInternal function to generate intial uncounted word ngrams on a word level
SimString.init_ngrams
— MethodInternal function to generate intial uncounted ngrams on a character level
SimString.lookup_feature_set_by_size_feature
— MethodInternal function to lookup feature sets by size and feature
SimString.make_zero_index_circular_array
— MethodInternal function to make zero indexed circular arrays
SimString.maximum_feature_size
— MethodCalculate maximum feature size for Cosine similarity measure.
SimString.maximum_feature_size
— MethodCalculate maximum feature size for Dice similarity measure.
SimString.maximum_feature_size
— MethodCalculate maximum feature size for ExactMatch similarity measure.
SimString.maximum_feature_size
— MethodCalculate maximum feature size for Jaccard similarity measure.
SimString.maximum_feature_size
— MethodCalculate maximum feature size for Overlap similarity measure.
SimString.minimum_feature_size
— MethodCalculate minimum feature size for Cosine similarity measure.
SimString.minimum_feature_size
— MethodCalculate minimum feature size for Dice similarity measure.
SimString.minimum_feature_size
— MethodCalculate minimum feature size for ExactMatch similarity measure.
SimString.minimum_feature_size
— MethodCalculate minimum feature size for Jaccard similarity measure.
SimString.minimum_feature_size
— MethodCalculate minimum feature size for Overlap similarity measure.
SimString.minimum_overlap
— MethodCalculate the minimum overlap (τ) for a query size, candidate size, and α using Cosine similarity measure.
SimString.minimum_overlap
— MethodCalculate the minimum overlap (τ) for a query size, candidate size, and α using Dice similarity measure.
SimString.minimum_overlap
— MethodCalculate the minimum overlap (τ) for a query size, candidate size, and α using ExactMatch similarity measure.
SimString.minimum_overlap
— MethodCalculate the minimum overlap (τ) for a query size, candidate size, and α using Jaccard similarity measure.
SimString.minimum_overlap
— MethodCalculate the minimum overlap (τ) for a query size, candidate size, and α using Overlap similarity measure.
SimString.n_grams
— MethodInternal function to create counted ngrams
SimString.overlap_join
— MethodInternal function which performs the overlap join
SimString.pad_string
— MethodInternal function to pad AbstractString types with specified padder
SimString.pad_string
— MethodInternal function to pad AbstractVector types with specified padder
SimString.rank_search_results
— MethodInternal function which ranks the results of a search using the specified similarity measure.
SimString.search!
— MethodSearch for strings in custom DictDB string collection using the SimString algorithm and a similarity measure.
SimString.search
— Methodsearch(measure::AbstractSimilarityMeasure, db_collection::AbstractSimStringDB, query::AbstractString;
α=0.7, ranked=true)
Search for strings in a string collection using the SimString algorithm and a similarity measure.
Arguments:
measure
::AbstractSimilarityMeasure - The similarity measure to use.db_collection
::AbstractSimStringDB - The database collection to search.query
::AbstractString - The query string to search for.α
::float - The α parameter for the SimString algorithm.ranked
::Boolean - Whether to return the results in ranked order.
Example
db = DictDB(CharacterNGrams(2, " "));
append!(db, ["foo", "bar", "fooo"]);
search(Dice(), db, "foo"; α=0.8, ranked=true)
# 2-element Vector{Tuple{String, Float64}}:
# ("foo", 1.0)
# ("fooo", 0.8888888888888888)
Returns
- A Vector of results, where each element is a Tuple of the form (
string
,similarity measure score
).
SimString.similarity_score
— MethodCalculate similarity score between X and Y using Cosine similarity measure.
SimString.similarity_score
— MethodCalculate similarity score between X and Y using Dice similarity measure.
SimString.similarity_score
— MethodCalculate similarity score between X and Y using ExactMatch similarity measure.
SimString.similarity_score
— MethodCalculate similarity score between X and Y using Jaccard similarity measure.
SimString.similarity_score
— MethodCalculate similarity score between X and Y using Overlap similarity measure.
SimString.tokenize
— MethodInternal function to tokenize a string using Mecab