strgrp: Add cosine similarity filter
The cosine similarity measure[1] (O(m + n)) contributes a decent runtime
reduction when used as a filter prior to execution of more expensive
algorithms such as LCS[2] (O(m * n)).
A private test set of 3500 strings was used to quantify the improvement.
The shape of the test set is described by Python's Pandas module as:
>>> frames.apply(len).describe()
count 3500.000000
mean 47.454286
std 14.980197
min 10.000000
25% 37.000000
50% 45.000000
75% 61.000000
max 109.000000
dtype: float64
>>>
The tests were performed on a lightly loaded Lenovo X201s stocked with a
Intel Core i7 L 640 @ 2.13GHz CPU with 3.7 GiB of memory. The test was
compiled with GCC 4.9.3:
$ gcc --version
gcc (Gentoo 4.9.3 p1.0, pie-0.6.2) 4.9.3
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Using the test program outlined below, ten runs for input set sizes
incrementing in batches of 500 were taken prior to filtering with cosine
similarity:
500: 0.61, 0.25, 0.08, 0.07, 0.07, 0.07, 0.09, 0.07, 0.07, 0.07
1000: 0.33, 0.32, 0.34, 0.32, 0.32, 0.33, 0.32, 0.32, 0.34, 0.32
1500: 0.72, 1.53, 0.72, 0.70, 0.72, 0.70, 0.72, 0.71, 1.46, 0.71
2000: 1.22, 1.20, 1.22, 1.98, 1.20, 1.20, 1.22, 1.94, 1.20, 1.20
2500: 1.97, 2.72, 1.94, 2.33, 2.44, 1.94, 2.82, 1.93, 1.94, 2.69
3000: 2.69, 3.41, 2.66, 3.38, 2.67, 3.42, 2.63, 3.44, 2.65, 3.39
3500: 4.18, 3.65, 4.21, 4.21, 3.56, 4.21, 4.16, 3.63, 4.20, 4.13
After adding the cosine similarity filter the runtimes became:
500: 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02
1000: 0.08, 0.07, 0.07, 0.07, 0.08, 0.07, 0.07, 0.08, 0.07, 0.07
1500: 0.16, 0.16, 0.15, 0.16, 0.16, 0.15, 0.15, 0.15, 0.16, 0.16
2000: 0.26, 0.26, 0.25, 0.26, 0.26, 0.26, 0.25, 0.26, 0.26, 0.26
2500: 0.41, 0.41, 0.41, 0.40, 0.42, 0.42, 0.42, 0.41, 0.41, 0.41
3000: 0.58, 0.56, 0.57, 0.56, 0.58, 0.57, 0.56, 0.56, 0.57, 0.55
3500: 0.75, 0.74, 0.73, 0.74, 0.74, 0.73, 0.72, 0.75, 0.75, 0.75
As such, on average the runtime improvements are:
N Avg Pre Avg Post Improvement (Pre / Post)
500 0.145 0.02 7.25
1000 0.326 0.073 4.47
1500 0.869 0.156 5.57
2000 1.358 0.258 5.26
2500 2.272 0.412 5.51
3000 3.034 0.566 5.36
3500 4.014 0.74 5.42
The test driver is as below, where both it and its dependencies were
compiled with 'CFLAGS=-O2 -fopenmp' and linked with 'LDFLAGS=-fopenmp',
'LDLIBS=-lm':
$ cat test.c
#include "config.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "ccan/strgrp/strgrp.h"
int main(void) {
FILE *f;
char *buf;
struct strgrp *ctx;
f = fdopen(0, "r");
#define BUF_SIZE 512
buf = malloc(BUF_SIZE);
ctx = strgrp_new(0.85);
while(fgets(buf, BUF_SIZE, f)) {
buf[strcspn(buf, "\r\n")] = '\0';
if (!strgrp_add(ctx, buf, NULL)) {
printf("Failed to classify %s\n", buf);
}
}
strgrp_free(ctx);
free(buf);
fclose(f);
return 0;
}
[1] https://en.wikipedia.org/wiki/Cosine_similarity
[2] https://en.wikipedia.org/wiki/Longest_common_subsequence_problem