numpy - Find minimum cosine distance between two matrices -
i have 2 2d np.arrays
let's call them a
, b
, both having shape. every vector in 2d array a
need find vector in matrix b
, have minimum cosine distance. have double loop inside of try find minimum value. following:
from scipy.spatial.distance import cosine l, res = a.shape[0], [] in xrange(l): minimum = min((cosine(a[i], b[j]), j) j in xrange(l)) res.append(minimum[1])
in code above 1 of loop hidden behind comprehension. works fine, double loop makes slow (i tried rewrite double comprehension, made things little bit faster, still slow).
i believe there numpy function can achieve following faster (using linear-algebra).
so there way achieve want faster?
from cosine docs
have following info -
scipy.spatial.distance.cosine(u, v) : computes cosine distance between 1-d arrays.
the cosine distance between u
, v
, defined as
where u⋅v
dot product of u
, v
.
using above formula, have 1 vectorized solution using `numpy's broadcasting capability, -
# dot products, l2 norms , cosine distances dots = np.dot(a,b.t) l2norms = np.sqrt(((a**2).sum(1)[:,none])*((b**2).sum(1))) cosine_dists = 1 - (dots/l2norms) # min values (if needed) , corresponding indices along rows res. # take care of 0 l2 norm values, using nanmin , nanargmin minval = np.nanmin(cosine_dists,axis=1) cosine_dists[np.isnan(cosine_dists).all(1),0] = 0 res = np.nanargmin(cosine_dists,axis=1)
runtime tests -
in [81]: def org_app(a,b): ...: l, res, minval = a.shape[0], [], [] ...: in xrange(l): ...: minimum = min((cosine(a[i], b[j]), j) j in xrange(l)) ...: res.append(minimum[1]) ...: minval.append(minimum[0]) ...: return res, minval ...: ...: def vectorized(a,b): ...: dots = np.dot(a,b.t) ...: l2norms = np.sqrt(((a**2).sum(1)[:,none])*((b**2).sum(1))) ...: cosine_dists = 1 - (dots/l2norms) ...: minval = np.nanmin(cosine_dists,axis=1) ...: cosine_dists[np.isnan(cosine_dists).all(1),0] = 0 ...: res = np.nanargmin(cosine_dists,axis=1) ...: return res, minval ...: in [82]: = np.random.rand(400,500) ...: b = np.random.rand(400,500) ...: in [83]: %timeit org_app(a,b) 1 loops, best of 3: 10.8 s per loop in [84]: %timeit vectorized(a,b) 10 loops, best of 3: 145 ms per loop
verify results -
in [86]: x1, y1 = org_app(a, b) ...: x2, y2 = vectorized(a, b) ...: in [87]: np.allclose(np.asarray(x1),x2) out[87]: true in [88]: np.allclose(np.asarray(y1)[~np.isnan(np.asarray(y1))],y2[~np.isnan(y2)]) out[88]: true
Comments
Post a Comment