Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various improvements #31

Closed
wants to merge 53 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
c6db850
add missing abs() call in test
jnqnfe Nov 3, 2018
fe97094
move tests out to tests directory
jnqnfe Nov 3, 2018
2788a60
purge "works" tests
jnqnfe Nov 3, 2018
9644ed1
reorganise tests into separate files
jnqnfe Nov 3, 2018
23aac76
add copyright+license header blocks
jnqnfe Nov 3, 2018
2b67d0d
doc clarity fix
jnqnfe Nov 3, 2018
4de9169
formatting fixes
jnqnfe Nov 3, 2018
d710e2a
add note that implementations are scalar value based
jnqnfe Nov 3, 2018
69dd8d6
tidy func descriptions & add links
jnqnfe Nov 3, 2018
b918eac
smart quotes
jnqnfe Nov 3, 2018
2e1e713
simplify hamming tests
jnqnfe Nov 3, 2018
1e999a7
simplify and improve jaro/jaro-winkler tests
jnqnfe Nov 3, 2018
594748c
fix formatting
jnqnfe Nov 3, 2018
6128de6
nicer range syntax
jnqnfe Nov 3, 2018
b1fb470
rename a_len/b_len to a_numchars/b_numchars
jnqnfe Nov 3, 2018
6364c44
jaro optimisation #1
jnqnfe Nov 3, 2018
0e42654
levenshtein optimisation #1
jnqnfe Nov 3, 2018
4db35f9
osa optimisation #1
jnqnfe Nov 3, 2018
533028b
osa optimisation #2
jnqnfe Nov 3, 2018
9e9f6d3
d-l optimisation #1
jnqnfe Nov 3, 2018
e01dd0d
add helper functions
jnqnfe Nov 3, 2018
92c1a24
levenshtein optimisation #2
jnqnfe Nov 3, 2018
1421477
levenshtein optimisation #3
jnqnfe Nov 3, 2018
88e2747
jaro optimisation #2
jnqnfe Nov 3, 2018
287b944
drop leftover appveyor artefacts
jnqnfe Nov 3, 2018
fba3c80
remove unnecessary submod wrapper in benches
jnqnfe Nov 3, 2018
6bb3daa
remove unnecessary references in benches
jnqnfe Nov 3, 2018
8c11d12
Add note to jaro-winkler fn doc about unlimited prefix length
jnqnfe Nov 3, 2018
ce21126
osa optimisation #3
jnqnfe Nov 3, 2018
2e6087d
d-l optimisation #2
jnqnfe Nov 3, 2018
f8b719a
reduce excessive normalised l/d-l examples
jnqnfe Nov 3, 2018
9ce8df6
osa optimisation #4
jnqnfe Nov 4, 2018
feb6683
osa optimisation #5
jnqnfe Nov 4, 2018
afc7524
d-l optimisation #3
jnqnfe Nov 4, 2018
960ff55
jaro optimisation #3
jnqnfe Nov 4, 2018
56a2286
osa optimisation #6
jnqnfe Nov 4, 2018
53a54c6
d-l optimisation #4
jnqnfe Nov 4, 2018
a790a9f
jaro simplification
jnqnfe Nov 4, 2018
04928f7
jaro further simplification
jnqnfe Nov 4, 2018
f2e5e16
osa optimisation #7
jnqnfe Nov 4, 2018
501a5a4
d-l optimisation #5
jnqnfe Nov 4, 2018
c1a0964
clarify jaro with use of `std::ops::Range`
jnqnfe Nov 4, 2018
711b58b
update changelog per optimisation work
jnqnfe Nov 3, 2018
960f164
credit myself in changelog
jnqnfe Nov 3, 2018
ed8c930
typo
jnqnfe Nov 5, 2018
654e1a4
tests: convert `assert_approx_eq_f64` helper to macro
jnqnfe Nov 7, 2018
e432594
j-w optimisation #1
jnqnfe Nov 7, 2018
c776489
add helper inlining hints
jnqnfe Nov 7, 2018
0d6eaa6
have helpers count chars
jnqnfe Nov 8, 2018
99b6b22
j-w optimisation #2
jnqnfe Nov 8, 2018
871252b
tests: add j/j-w 'same-one'char' tests
jnqnfe Nov 8, 2018
f3a91a4
jaro optimisation #4 & j-w optimisation #3
jnqnfe Nov 8, 2018
c68a55a
add some inline hints
jnqnfe Nov 8, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@
This project attempts to adhere to [Semantic Versioning](http://semver.org).

## [Unreleased]
Most of the improvements of this release are thanks to [@jnqnfe](https://github.com/jnqnfe)

### Changed
- Optimisations to metric implementations:
- Avoided char counting where unnecessary
- Avoided comparing portions of strings twice in Levenshtein variants with
equal length but non-identical strings
- Avoided repeated char counting with `normalized_levenshtein`
- Avoided using floats for counting in Jaro, converting to float at end instead
- Moved tests out to test directory and reorganised
- Simplified the Hamming tests
- Simplified and improved failure output of the Jaro/Jaro-Winkler tests
- Tidied up documentation

## [0.8.0] - (2018-08-19)
### Added
Expand All @@ -12,8 +25,8 @@ This project attempts to adhere to [Semantic Versioning](http://semver.org).
- Faster Levenshtein implementation (thanks [@wdv4758h](https://github.com/wdv4758h))

### Removed
- Remove the "against_vec" functions. They are one-liners now, so they don't
seem to add enough value to justify making the API larger. I didn't find
- Remove the against_vec functions. They are one-liners now, so they dont
seem to add enough value to justify making the API larger. I didnt find
anybody using them when I skimmed through a GitHub search. If you do use them,
you can change the calls to something like:
```rust
Expand Down
3 changes: 1 addition & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,7 @@ keywords = ["string", "similarity", "Hamming", "Levenshtein", "Jaro"]
homepage = "https://github.com/dguo/strsim-rs"
repository = "https://github.com/dguo/strsim-rs"
documentation = "https://docs.rs/strsim/"
exclude = ["/.travis.yml", "/appveyor.yml", "/dev"]
exclude = ["/.travis.yml", "/dev"]

[badges]
travis-ci = { repository = "dguo/strsim-rs" }
appveyor = { repository = "dguo/strsim-rs" }
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ fn main() {

## Contributing

If you don't want to install Rust itself, you can run `$ ./dev` for a
If you dont want to install Rust itself, you can run `$ ./dev` for a
development CLI if you have [Docker] installed.

Benchmarks require a Nightly toolchain. Run `$ cargo +nightly bench`.
Expand Down
142 changes: 72 additions & 70 deletions benches/benches.rs
Original file line number Diff line number Diff line change
@@ -1,84 +1,86 @@
// Copyright 2015 Danny Guo
//
// Licensed under the MIT license. You may not copy, modify, or distribute this
// file except in compliance with said license. You can find a copy of this
// license either in the LICENSE file, or alternatively at
// <http://opensource.org/licenses/MIT>.

//! Benchmarks for strsim.

#![feature(test)]

extern crate strsim;
extern crate test;
use self::test::Bencher;

mod benches {
use super::*;

extern crate test;
use self::test::Bencher;

#[bench]
fn bench_hamming(bencher: &mut Bencher) {
let a = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGG";
let b = "CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGC";
bencher.iter(|| {
strsim::hamming(&a, &b).unwrap();
})
}
#[bench]
fn bench_hamming(bencher: &mut Bencher) {
let a = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGG";
let b = "CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGC";
bencher.iter(|| {
strsim::hamming(a, b).unwrap();
})
}

#[bench]
fn bench_jaro(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::jaro(&a, &b);
})
}
#[bench]
fn bench_jaro(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::jaro(a, b);
})
}

#[bench]
fn bench_jaro_winkler(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::jaro_winkler(&a, &b);
})
}
#[bench]
fn bench_jaro_winkler(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::jaro_winkler(a, b);
})
}

#[bench]
fn bench_levenshtein(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::levenshtein(&a, &b);
})
}
#[bench]
fn bench_levenshtein(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::levenshtein(a, b);
})
}

#[bench]
fn bench_normalized_levenshtein(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::normalized_levenshtein(&a, &b);
})
}
#[bench]
fn bench_normalized_levenshtein(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::normalized_levenshtein(a, b);
})
}

#[bench]
fn bench_osa_distance(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::osa_distance(&a, &b);
})
}
#[bench]
fn bench_osa_distance(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::osa_distance(a, b);
})
}

#[bench]
fn bench_damerau_levenshtein(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::damerau_levenshtein(&a, &b);
})
}
#[bench]
fn bench_damerau_levenshtein(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::damerau_levenshtein(a, b);
})
}

#[bench]
fn bench_normalized_damerau_levenshtein(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::normalized_damerau_levenshtein(&a, &b);
})
}
#[bench]
fn bench_normalized_damerau_levenshtein(bencher: &mut Bencher) {
let a = "Philosopher Friedrich Nietzsche";
let b = "Philosopher Jean-Paul Sartre";
bencher.iter(|| {
strsim::normalized_damerau_levenshtein(a, b);
})
}
75 changes: 75 additions & 0 deletions src/helpers.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
// Copyright 2018 Lyndon Brown
//
// Licensed under the MIT license. You may not copy, modify, or distribute this
// file except in compliance with said license. You can find a copy of this
// license either in the LICENSE file, or alternatively at
// <http://opensource.org/licenses/MIT>.

/// Checks both strings for a common prefix, splitting them after it.
///
/// It returns a tuple consisting of the prefix, the two suffixes, and the
/// `char` count of the prefix: `(prefix, a-suffix, b-suffix,
/// prefix-char-count)`.
#[inline(always)]
pub(crate) fn split_on_common_prefix<'a, 'b>(a: &'a str, b: &'b str)
-> (&'a str, &'a str, &'b str, usize)
{
let (i, cc) = get_diverge_indice(a, b);
unsafe {
(a.get_unchecked(..i), a.get_unchecked(i..), b.get_unchecked(i..), cc)
}
}

/// Finds the byte offset of the next `char` following a prefix common to both
/// strings, and returns this along with the count of `char`s that make up the
/// prefix.
#[inline(always)]
pub(crate) fn get_diverge_indice(a: &str, b: &str) -> (usize, usize) {
let mut char_count = 0;
let indice = a.char_indices()
.zip(b.char_indices())
.take_while(|&((_, a_char), (_, b_char))| a_char == b_char)
.inspect(|_| char_count += 1)
.last()
.map_or(0, |((a_indice, a_char), (_, _))| a_indice + a_char.len_utf8());
(indice, char_count)
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_split_on_common_prefix() {
assert_eq!(("", "", "", 0), split_on_common_prefix("", ""));
assert_eq!(("", "a", "", 0), split_on_common_prefix("a", ""));
assert_eq!(("", "", "a", 0), split_on_common_prefix("", "a"));
assert_eq!(("a", "", "", 1), split_on_common_prefix("a", "a"));
assert_eq!(("", "thank", "you", 0), split_on_common_prefix("thank", "you"));
assert_eq!(("", "hello world!", "foo bar", 0), split_on_common_prefix("hello world!", "foo bar"));
assert_eq!(("hello w", "orld!", "urld?", 7), split_on_common_prefix("hello world!", "hello wurld?"));
assert_eq!(("kit", "ten", "es", 3), split_on_common_prefix("kitten", "kites"));
assert_eq!(("kitten", "", "", 6), split_on_common_prefix("kitten", "kitten"));
assert_eq!(("ki", "香ten", "tten", 2), split_on_common_prefix("ki香ten", "kitten"));
assert_eq!(("ki", "tten", "香ten", 2), split_on_common_prefix("kitten", "ki香ten"));
assert_eq!(("ki香ten", "", "s", 6), split_on_common_prefix("ki香ten", "ki香tens"));
assert_eq!(("ki香", "ten", "zen", 3), split_on_common_prefix("ki香ten", "ki香zen"));
}

#[test]
fn test_get_diverge_indice() {
assert_eq!((0, 0), get_diverge_indice("", ""));
assert_eq!((0, 0), get_diverge_indice("a", ""));
assert_eq!((0, 0), get_diverge_indice("", "a"));
assert_eq!((1, 1), get_diverge_indice("a", "a"));
assert_eq!((0, 0), get_diverge_indice("thank", "you"));
assert_eq!((0, 0), get_diverge_indice("hello world!", "foo bar"));
assert_eq!((7, 7), get_diverge_indice("hello world!", "hello wurld?"));
assert_eq!((3, 3), get_diverge_indice("kitten", "kites"));
assert_eq!((6, 6), get_diverge_indice("kitten", "kitten"));
assert_eq!((2, 2), get_diverge_indice("ki香ten", "kitten"));
assert_eq!((2, 2), get_diverge_indice("kitten", "ki香ten"));
assert_eq!((8, 6), get_diverge_indice("ki香ten", "ki香tens"));
assert_eq!((5, 3), get_diverge_indice("ki香ten", "ki香zen"));
}
}
Loading