I wrote some tests to look a little further into trac ticket #1533. I wanted to know what sparse matrix types cannot .toarray() with dtype as bool. Suprisingly, the only type that passed was the Lists of Lists (lil) type. Why is this?

Lists of Lists

Looking around in the sparse package, every toarray method except for lil's basically does this.

def toarray(self, order=None, out=None):
        """See the docstring for `spmatrix.toarray`."""
        return self.tocoo(copy=False).toarray(order=order, out=out)

Where the Coordinate list (coo) matrix's toarray is

def toarray(self, order=None, out=None):
    """See the docstring for `spmatrix.toarray`."""
    B = self._process_toarray_args(order, out)
    fortran = int(B.flags.f_contiguous)
    if not fortran and not B.flags.c_contiguous:
        raise ValueError("Output array must be C or F contiguous")
    M,N = self.shape
    coo_todense(M, N, self.nnz, self.row, self.col, self.data,
                B.ravel('A'), fortran)
    return B

The coo toarray calls the coo_todense function, which just creates a dense matrix with the data, but it doesn't support the bool dtype. This is a c function deffined in coo.h.

But why didn't lil fail? Looking at its toarray:

 def toarray(self, order=None, out=None):
    """See the docstring for `spmatrix.toarray`."""
    d = self._process_toarray_args(order, out)
    for i, row in enumerate(self.rows):
        for pos, j in enumerate(row):
            d[i, j] = self.data[i][pos]
    return d

It is not using any of these c functions. Why not? Python is slow and c is fast, so is lil taking a preformance hit?

lil's .toarray() benchmark

I wrote some code to benchmark lil's .toarray() performance compared with other types. A typical result with 3000 by 3000 matrix with around 5 nonzero random values per row is:

$ python lil_benchmark.py

It's not much slower at all. But dia and dok are really slow, this is probably from converting them to coo type before doing .toarray(). This is promising, maybe the toarray methods can be written in python instead trying to tack on bool support to the existing c code. Without much of a performance hit.


Comments powered by Disqus