numpy - Find minimum cosine distance between two matrices -


i have 2 2d np.arrays let's call them a , b, both having shape. every vector in 2d array a need find vector in matrix b, have minimum cosine distance. have double loop inside of try find minimum value. following:

from scipy.spatial.distance import cosine l, res = a.shape[0], [] in xrange(l):     minimum = min((cosine(a[i], b[j]), j) j in xrange(l))     res.append(minimum[1]) 

in code above 1 of loop hidden behind comprehension. works fine, double loop makes slow (i tried rewrite double comprehension, made things little bit faster, still slow).

i believe there numpy function can achieve following faster (using linear-algebra).

so there way achieve want faster?

from cosine docs have following info -

scipy.spatial.distance.cosine(u, v) : computes cosine distance between 1-d arrays.

the cosine distance between u , v, defined as

enter image description here

where u⋅v dot product of u , v.

using above formula, have 1 vectorized solution using `numpy's broadcasting capability, -

# dot products, l2 norms , cosine distances dots = np.dot(a,b.t) l2norms = np.sqrt(((a**2).sum(1)[:,none])*((b**2).sum(1))) cosine_dists = 1 - (dots/l2norms)  # min values (if needed) , corresponding indices along rows res. # take care of 0 l2 norm values, using nanmin , nanargmin   minval = np.nanmin(cosine_dists,axis=1) cosine_dists[np.isnan(cosine_dists).all(1),0] = 0 res = np.nanargmin(cosine_dists,axis=1) 

runtime tests -

in [81]: def org_app(a,b):     ...:    l, res, minval = a.shape[0], [], []     ...:    in xrange(l):     ...:        minimum = min((cosine(a[i], b[j]), j) j in xrange(l))     ...:        res.append(minimum[1])     ...:        minval.append(minimum[0])     ...:    return res, minval     ...:      ...: def vectorized(a,b):     ...:     dots = np.dot(a,b.t)     ...:     l2norms = np.sqrt(((a**2).sum(1)[:,none])*((b**2).sum(1)))     ...:     cosine_dists = 1 - (dots/l2norms)     ...:     minval = np.nanmin(cosine_dists,axis=1)     ...:     cosine_dists[np.isnan(cosine_dists).all(1),0] = 0     ...:     res = np.nanargmin(cosine_dists,axis=1)     ...:     return res, minval     ...:   in [82]: = np.random.rand(400,500)     ...: b = np.random.rand(400,500)     ...:   in [83]: %timeit org_app(a,b) 1 loops, best of 3: 10.8 s per loop  in [84]: %timeit vectorized(a,b) 10 loops, best of 3: 145 ms per loop 

verify results -

in [86]: x1, y1 = org_app(a, b)     ...: x2, y2 = vectorized(a, b)     ...:   in [87]: np.allclose(np.asarray(x1),x2) out[87]: true  in [88]: np.allclose(np.asarray(y1)[~np.isnan(np.asarray(y1))],y2[~np.isnan(y2)]) out[88]: true 

Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -