Recently I've been working on a pull request for fancy indexing in sparse matrices. Most of the work has already been done by Daniel Smith (thanks!). I've been working on fixing some bugs that were causing test failures. And cleaning up the interface.
Below is some data I will be working with throughout this post.
import numpy as np import scipy.sparse as sp A = np.asmatrix(np.arange(50).reshape(5, 10)) Asp = sp.csr_matrix(A)
Python is pretty forgiving in what it allows for indexing with slices. For example it does not raise an error when the upper bound of a slice is outside of the object. Things like
list(range(3))[:9000] are perfectly legal. Here I adopted python's forgiveness when implementing how sparse matrices should be have when indexing should return something empty.
What should happen when indexing should return an empty matrix? With ndarrays, a matrix of "size" 0 is returned. Where is size is all the elements of shape multiplied:
(matrix(, shape=(0, 10), dtype=int64), matrix(, shape=(0, 10), dtype=int64))
scipy.sparse there is currently no way to create a size 0 matrix. There is an explicit requirement in the sparse matrix base class that both entries of shape be
>= 1. So the next best thing I could do is to return a size 1 matrix with a
(1,1) shape and a single zero element. This is currently supported by all sparse matrix formats. The result of fancy indexing is sometimes created using the coordinate sparse matrix format (COO). A small change in its
__init__ method now has it return a
(1, 1) empty COO matrix when you try to create a COO matrix which would haze size 0. There was also a change to how slices are checked to allow the start and stop arguments to be equal, and if so to return an empty matrix as described above. So now we have:
(<1x1 sparse matrix of type '<class 'numpy.int64'>' with 0 stored elements in Compressed Sparse Row format>, <1x1 sparse matrix of type '<class 'numpy.float64'>' with 0 stored elements in Compressed Sparse Row format>)
Sparse Boolean Indexing¶
I also add a new feature to the fancy indexing. Since Boolean sparse matrices work now, I added support for indexing with them. So this works:
Asp[Asp > 30] = 42 Asp.todense()
matrix([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 42, 42, 42, 42, 42, 42, 42, 42, 42], [42, 42, 42, 42, 42, 42, 42, 42, 42, 42]], dtype=int64)
Supporting this was mostly a matter of allowing existing code paths to accept the sparse matrices.
However the boolean indexing is not completely analogous to NumPy. Since sparse matrices cannot be one dimensional. So there is no way to do something like.
bool_vector = np.random.randint(2, size=5).astype(np.bool) A[bool_vector]
matrix([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]])
CSC's .nonzero method¶
There was still a bit of work to do with CSC. The CSC
__getitem__ was simplified to just two cases. The first is the case where a sparse matrix is returned, the second is the case where a sequence of elements is returned. However there was still a bug which was causing problems. The
csc_matrix.nonzero method was return its indices sorted incorrectly. NumPy states that
ndarray.nonzero should sort it's indices C-style, with the last entry varying the fastest. However the indices in this case were sorted with the first index varying the fastest.
I fixed this by defining a
.nonzero method in the
csc_matrix class to override the method defined in the base class (
spmatrix). I tried to fix this in a few other ways. Like by resorting the indices in
_cs_matrix.tocoo which is used to define the base
.nonzero but this caused a lot of round off problems for some reason I could not figure out.