week 5 - Various bool stuff

I spent this week tying up loose ends relating to the sparse package.

DepreciationWarning

A recent change to NumPy caused DeprecationWarning to be thrown whenever there was (potentially) implicit casting between dtypes.

 DeprecationWarning: Implicitly casting between incompatible kinds. 
In a future numpy release, this will raise an error. Use
casting="unsafe" if this is intentional.

This was happening a lot in the sparse test suite, where in-place division and multiplication are tested. Since this behavior is being deprecated, I removed the tests for the appropriate cases. But not without some trouble first.

Recall that in python3 / is always true division. So there is always a potential to change the type from int to float, however if the result is expressible as an integer, int is returned. In cases like this where there is no difference in type between the input and result type, NumPy will still throw this warning. So simply checking if there is a type difference between input and result was not enough to programmatically remove the deprecated tests, (which I noticed when my first patch did not work). Eventually I just added special cases to the tests for different data.

Output Type for SWIG Routines

The compressed types of sparse matrices (BSR, CSC, CSR) each have a set of routines for preforming binary operations (binop) between two sparse matrices. These routines however, only returned data that was the same type as the input data. So boolean operations had to have their result cast to bool at the python level. I modified the sparsetools routines to output bool data where appropriate.

Implementation

This involved changes at every level of the codebase (c++, SWIG, python), it is a good demonstration of how these levels work together.

C++

The binop routines are implemented as function templates in c++, where the template arguments are (including the new one I added):

  • I: Always int for storing nnz (number of non-zeros), vector length etc.

  • T: The input data type.

  • T2: The output data type, which I added.

Adding this new argument required minimal changes to the existing code. Basically I just changed T to T2 in a few places.

Since the output type we used here was bool I also had to add a conversion from the complex wrapper class to bool.

SWIG

At the SWIG level typemaps define what types the function templates inputs and outputs are. I had to add new typemaps to instantiate functions with boolean output.

Python

Since the new binop routines now output the correct dtype. I removed all the casting to bool. However in the cases where boolean data must be returned, python has to pass an empty matrix to the binop routines with the correct dtype. So _binopt now checks what the operation is to decide what kind of empty matrix it should create.

Sparse Bool Wrapper

Pull request. Previously the c++ routines in sparsetools defined the bool dtype with:

typedef npy_bool_wrapper npy_int8;

This was so that bool value would be one byte. But since this is stored as an int8 type, it would rollover when the underlying integer got to 256. Like this:

In [2]: a = sp.csr_matrix([True, False])

In [3]: for _ in range(8):
   ...:     a = a + a
   ...:     print(a.todense())
   ...:     
[[ True False]]
[[ True False]]
[[ True False]]
[[ True False]]
[[ True False]]
[[ True False]]
[[ True False]]
[[False False]]        # <----- !

So I rewrote npy_bool_wrapper as a class with one char data member (recall that in c++ char is 1 byte). And added the required arithmatic overrides for boolean algebra. e.g. 1 + 1 = 1.

Build Failures

After modifying complex_ops.h in my pull request for sparse matrix inequalities, people were getting build failures on clang and intel compilers. This was related to ill defined overrides of the boolean comparison operators. As an example, previously we had:

bool operator !=(const c_type& B) const{
    return npy_type::real != B || npy_type::imag != c_type(0);
}

Where c_type is a template argument, which is used as the type of the underlying real and imag numbers. So comparisons with things that were not c_type were ambiguous. So I redefined the boolean comparisons as template functions like:

template<class T>
bool operator !=(const T& B) const{
    return npy_type::real != B || npy_type::imag != T(0);
}

Which should handle all comparisons, within reason.

week 4 - Sparse Matrix Inequalities

This week I worked on adding support for inequalities to sparse matrices. The pull request is here (still pending). I also modified, the c++ routines used to produce a boolean output, pull request here. This is so these inequalities don't have to be cast to bool in python. Once both of these are accepted the inequalities will be using the c++ routines that produce boolean output.

These are separate pull requests but they are related and one will need to be rebased once the other is accepted. Or I may combine them.

Each of the four basic inequalities has its own particular quirks which make it more or less efficient with sparse matrices. Here we consider A to be a sparse matrix. The cases where B is a scalar, or sparse are considered. (By scalar we effectively mean a matrix the same size as A with every element as the scalar value B.

A < B

For scalar B, this operation is only efficient if B < 0 otherwise the resulting matrix will be dense.

For sparse A and B this is also efficient.

A > B

For scalar B the efficiency is the opposite of the less than operator. That is B > 0 has efficient output. Similarly to < this operation is efficient with other sparse matrices.

A <= B

This operation is pretty much useless. The only case which make this efficient is when B < 0. But this case is already covered by <, and < can be used efficiently with sparse matrices. So why use <=? I'll discuss the pros and cons of having these non-strict inequalities in a moment.

Another disadvantage of non strict inequalities like these is that they raies NotImplementedError when comparing with 0. This is because of the nature of the c++ routines. When comparing every pair of elements they don't consider cases where both are zero.

A >= B

Like <= this operation is only efficient for one case, when B > 0. And also like <=, it is not very useful.

Why have non strict inequalities?

I can't really think of uses for >= and <= but they might exist, I added them because they were easy to implement once I had done > and <. In practice, they don't slow the usage down. But removing non - strict inequalities removes 24,990 lines of code.

Implementation

The inequality operations are implemented as c++ routines, these are wrapped to provide a python interface using SWIG, then the various inequalities are added appropriately.

C++

The inequalities routines were implemented by reusing existing routines like csr_binop_csr. Overrides for inequalities had to be added to complex_wrapper.h too.

The binop routines had to altered, another class was added to their templates, T2, for the type of data out. In theory this could be something other than bool.

SWIG

To handle boolean output data a new swig macro was added to create typemaps that have npy_bool_wrapper as this output class. I then use these to instantiate the boolean operations.

Python

Here the associated python special methods corresponding to the inequalities were added in compressed.py. These operations can be used with a scalar, dense, or sparse matrix. Numpy style broadcasting is not implemented.

To produce a bool output by default for boolean comparisons, I altered the _binopt function to pass the c++ routines a matrix to use for output with a bool dtype.

Problems

When testing the inequalities with dense matrices, I could not get the proper behavior with a dense matrix on the left hand side. Previously, with != and == I modified the __bool__ method but that did not work in this case. This is because of the same problem I will be dealing with in the next stage of my proposal, interactions with numpy ufuncs.

week 3 - Boolean Operations, != and == for CSR, CSC, and BSR

This week focused on implementing support for the == and != boolean comparisons for the BSR, CSC, CSR sparse matrix types. The culmination of my work this week is this pull request.

!=

For sparse matrices, the != (not equal to) generally has an output that can be efficiently represented by sparse matrices. With the exception of comparing the whole matrix with a nonzero scalar.

==

The boolean equal to comparison in general has an output that is not efficiently represented by sparse matrices, since all the zero entries return True. The exception being comparison with zero.

Adding a c routine for this operation with the existing code, would be problematic. The binop routines only apply the given binop to element pairs in which one of them is nonzero. So all the co-occurring zero elements would not return True like they should. Because of this, and == being in general inefficient, I did not bother to implement the == operation in sparsetools. Instead, the == operation is computed using the != operation and inverting it.

Implementation

C++ & SWIG level

Implementing the routines for != and == in the sparsetools/ was pretty easy, using the existing code. There are handy csr_binop_csr and bsr_binop_bsr functions which take c++ a functional as an argument for the binop. The only problem is that the output type is not the same as the input type. So the output is not bool by default.

Python level

In Python, != and == operations are can be overridden by defining the methods __ne__ and __eq__. I did this to handle the case wher the other object is a scalar, dense, or a sparse matrix.

I defined these methods in compressed.py which defines the base class for BSR, CSC, and CSR.

In defined these methods in base.py, here they just convert the sparse matrix to CSR and preform the desired operation.

__bool__

There was however some complication with how these worked when a dense ndarray is the first argument, ideally it would ask the sparse matrix what to do, however this was not happening whenever the sparse matrix's __bool__ (for Python3, __nonzero__ for python2.x) method returned True or False, I still don't know why. It did however work when __bool__ raised an error or returned something other than T/F. Since NumPy's ndarrays raise an error when __bool__ is called, I thought this was a reasonable change to make to sparse matrices __bool__. Although ideally, I'd like to better understand what is going on so I don't have to make an incompatible change to SciPy's API.

This change broke one test in scipy/io/matlab/tests/test_mio.py there could be more serious issues relating to this change, currently they are being discussed on the pull request

SciPy Workflow

There are a few things that have taken some getting used to in working with SciPy. These things are aggregated here:

Keeping my SciPy repo up-to-date

I need to upadate my copy of SciPy regularly from upstream, to avoid merge conflicts. I have done this plenty of times, but for some reason I forget how every time. There is a good howto on stackoverflow on this. From said article:

# Add the remote, call it "upstream":

git remote add upstream git://github.com/whoever/whatever.git

# Fetch all the branches of that remote into remote-tracking branches,
# such as upstream/master:

git fetch upstream

# Make sure that you're on your master branch:

git checkout master

# Rewrite your master branch so that any commits of yours that
# aren't already in upstream/master are replayed on top of that
# other branch:

git rebase upstream/master

Pushing new local branches to a remote Git repo

Every time a make a new local branch, I forget how to push it to github as new branch. Google always brings me to this article. Basically if I have a new local branch call it new-feature and want to push it to my github repo and create a new new-feature branch there too. I do:

git push -u origin new-feature

and if I wanted to delete this remote branch, I could do:

git push origin :new-feature

and delete it locally with:

git branch -D new-feature

SciPy's commit messages

SciPy require commit messages start with some standard acronyms.

API: an (incompatible) API change
BLD: change related to building numpy
BUG: bug fix
DEP: deprecate something, or remove a deprecated object
DEV: development tool or utility
DOC: documentation
ENH: enhancement
MAINT: maintenance commit (refactoring, typos, etc.)
REV: revert an earlier commit
STY: style fix (whitespace, PEP8)
TST: addition or modification of tests
REL: related to releasing numpy

The acronyms are easy to remember, but adding the acronyms is easy to forget. Which brings me to my next topic.

Changing commit messages, after you have pushed

This is not the simplest thing to do in git, fortunately my mentor Pauli showed me a great way to do it. Assuming no one has pulled from your repo yet.

  • First do git log and look for the last COMMIT befor the commit messages you want to change.

  • Do git rebase -i COMMIT

  • Replace pick with r

  • Then Git will prompt you to change the commit messages one at a time.

  • Push your changes with git push -f

Week 1 - bool dtype support

I got accepted to Google Summer of Code! Thank you everyone who has helped me get this far, and Amber for making me a congratulatory google cake!

google cake

On my proposal timeline I gave myself 2 weeks to add support to sparse matrices for dtype=bool. So far, dtype=bool works implementation was not too hard, and most of my time was spent figuring out how SWIG works. However there is still a lot of work to do in testing bool support. This will be more time consuming than it should be because the sparse test suite is messy.

bool dtype implementation

sparsetools

The sparsetools/ directory contains several *.cxx, *.h, *.i, and *.py files. The header files (*.h) contain C++ routines these are wrapped by SWIG according to instructions in the *.i interface files. SWIG uses the header and interface files to generate python *.py files C++ code in *.cxx files. Once these *.cxx files are compiled and linked the functions they define can be called by the generated *.py files.

Where do I add bool support?

Contained in sparsetools/ is a file called complex_ops.h which translates numpy's complex types to C++ classes. Well, that is what I want to do, except for numpy's bool types instead! My new file bool_ops.h is much simpler though. It basically does

typedef npy_int8 npy_bool_wrapper;

Yep that's it, (for now anyway). This file basically says to treat the npy_bool_wrapper type like the npy_int8. Because boolean algebra is almost integer algebra.

npy_bool_wrapper is used in numpy.i to define a typemap from numpy's npy_bool type to our npy_bool_wrapper.

In sparsetools.i we have to declare the new data type and add it to the INSTANTIATE_ALL macro which is used in all the *.i files to call the functions we want from the *.h files.

What else?

Adding bool to supported_dtypes in sputils.py prevents it from upcast. Then tests need to be added, but this isn't too easy. The test suite does not have a way to take some general data or dtype. So I am going through adding tests one-by-one where it seems appropriate. I'm also trying to add tests for other dtypes because there don't seem to be any tests for anything other than float64 and int64.

While I was doing this I discovered a few bugs relating to how sparse handles uint dtypes.

Once I'm done adding tests and fixing what ever does not pass, I'll move on to the next stage in my proposal. Adding support for boolean operations.

Google Summer of Code Proposal

Improvements to the sparse package of Scipy: support for bool dtype and better interaction with NumPy

Abstract

The scipy sparse package is used extensively in other projects like scikit-learn, and is integrated with NumPy, SymPy, and Sage. However it leaves a lot to be desired. There is no support for boolean sparse matrices, and boolean operations on sparse matrices return inconsistent results. See these examples. The sparse matrices also have problems when used with NumPy ufuncs and ndarrays (see this ticket which I did some work on, and this ticket). The second part of this proposal is to remedy this. So writing code combining sparse and dense matrices will be easier, and making the sparse codebase easier to maintain.

The two main parts of this proposal are:

  1. Add support for bool dtype to sparse matrices, as well as boolean operations so that sparse matrices behave more like NumPy ndarrays.

  2. Improve interaction of sparse matrix objects and other types. Especially NumPy ndarray objects and ufuncs. This will make numpy.dot, .multiply, etc. work with sparse matrices, making generic sparse and dense matrix code much easier to write. This will also involve adding new functions/methods (similar to numpy ufuncs) to the sparse package which will improve it's usabilty.

Both of these will be implemented according to separate specifications (written with community input), and testing suites.

About me

I've been programming using the SciPy/Numpy for around 3 years for reasearch and homework assingnments and have always wanted to give back. previous work with SciPy:

There are a lot of projects on my GitHub, many incomplete, here are a select few:

CV

Timeline

I've shifted the Coding period to start on May 27th (with the consent of my mentor) and end earlier so I can return to school.

Before May 27th.
  • Familiarizing myself with scipy.sparse by investigating bug reports, reading documentation, etc.
  • Building rapport with the Scipy community and my mentor by trying to be helpful in #scipy and on the mailing list.
  • Get a head start! (my last day of school is May 3rd.)
  • Get community feedback on bool handling specification and ndarray handling specification. Write said specifications. (I've already started this.)
bool handling -- May 27th to July 1st (5 weeks)

Implementation will be done according to this specification which is currently being written. In order:

  • implement suport for bool datatype (2 weeks)
  • implement boolean operations (2 weeks)
  • code review and unit tests (1 week)
NumPy interactions -- July 1st to August 26th (8 weeks)

See this specification (under construction). In order:

  • Modify NumPy universal function objects (ufuncs) to be aware of sparse matrix types. (1 week)
  • Refactor and consolidate sparse code to clean up the interface (1 week)
  • Add sparse matrix functions(or methods) to handle calls from ufuncs (5 weeks)
    • Implement ufuncs that should just operate on the dense form of the sparse matrix.
    • Add C++ code for row-wise or column-wise broadcasting.
    • Implement sparse forms of the binary ufuncs that should broadcast.
    • Test suite for all the sparse matrix methods which correspond to ufuncs.
  • Mid-term evaluation, July 29th -- deliverables are:
    • bool spec implemented
    • Have at least one sparse matrix method that a NumPy ufunc will dispatch to.
  • Wrap up. test suite for Numpy ufunc and sparse matrix interactions for sparse/dense and sparse/sparse combinations. (1 week)
Onward -- post August 27th

My school begins on August 29th, so I'll cut work to around 1/3 to 1/4 of before. So I will continue with a few features, and fixing bugs etc for the remaining weeks at a slow pace. Final evaluation, Sept 23rd -- deliverables are:

  • bool spec implemented, along with documentation and test suite.
    • Support for bool dtype.
    • Support for boolean operations.
  • NumPy interaction spec implemented.
    • Many Numpy ufuncs work with sparse matricies, and sparse dense combinations. Making sparse/dense code easier to write.
    • A nicer interface and codebase with less repeated code, more maintainable.

When all this is done I will use my now vast knowledge of Scipy to be a productive member of the community :)

.toarray()

I wrote some tests to look a little further into trac ticket #1533. I wanted to know what sparse matrix types cannot .toarray() with dtype as bool. Suprisingly, the only type that passed was the Lists of Lists (lil) type. Why is this?

Lists of Lists

Looking around in the sparse package, every toarray method except for lil's basically does this.

def toarray(self, order=None, out=None):
        """See the docstring for `spmatrix.toarray`."""
        return self.tocoo(copy=False).toarray(order=order, out=out)

Where the Coordinate list (coo) matrix's toarray is

def toarray(self, order=None, out=None):
    """See the docstring for `spmatrix.toarray`."""
    B = self._process_toarray_args(order, out)
    fortran = int(B.flags.f_contiguous)
    if not fortran and not B.flags.c_contiguous:
        raise ValueError("Output array must be C or F contiguous")
    M,N = self.shape
    coo_todense(M, N, self.nnz, self.row, self.col, self.data,
                B.ravel('A'), fortran)
    return B

The coo toarray calls the coo_todense function, which just creates a dense matrix with the data, but it doesn't support the bool dtype. This is a c function deffined in coo.h.

But why didn't lil fail? Looking at its toarray:

 def toarray(self, order=None, out=None):
    """See the docstring for `spmatrix.toarray`."""
    d = self._process_toarray_args(order, out)
    for i, row in enumerate(self.rows):
        for pos, j in enumerate(row):
            d[i, j] = self.data[i][pos]
    return d

It is not using any of these c functions. Why not? Python is slow and c is fast, so is lil taking a preformance hit?

lil's .toarray() benchmark

I wrote some code to benchmark lil's .toarray() performance compared with other types. A typical result with 3000 by 3000 matrix with around 5 nonzero random values per row is:

$ python lil_benchmark.py
lil
0.0481379032135
dok
0.113062858582
dia
0.530215024948
csr
0.045028924942
csc
0.0446619987488
bsr
0.0462918281555
coo
0.0453071594238

It's not much slower at all. But dia and dok are really slow, this is probably from converting them to coo type before doing .toarray(). This is promising, maybe the toarray methods can be written in python instead trying to tack on bool support to the existing c code. Without much of a performance hit.

Building Numpy and Scipy on Ubuntu 12.04

This post should help you build Numpy and Scipy from source on Ubuntu 12.04.

If you already have a Numpy/Scipy installation from the Ubuntu repos, uninstall them.

$ sudo apt-get remove python-scipy python-numpy

Get the latest source from github.

$ git clone https://github.com/scipy/scipy.git

$ git clone https://github.com/numpy/numpy.git

Get all the needed tools to build Numpy, and Scipy. Except for cython which we will get next.

$ sudo apt-get install python-dev python-nose libblas-dev liblaplack-dev build-essential gfortran gcc

The cython in the ubuntu repos is too old to build Scipy with. So we add a ppa with a more recent version. Then update apt.

$ sudo apt-add-repository ppa:cython-dev/master-ppa
...
$ sudo apt-get update
...
$ sudo apt-get install cython

You need to build Numpy first.

$ cd ~/numpy
$ sudo python setup.py install
...

Then Scipy.

$ cd ~/scipy
$ sudo python setup.py install
...

Then try it out.

$ python

>>> import numpy
>>> import scipy

Google Summer of Code Proposal Preperation

Improving the sparse matrix package in Scipy

Among otherthings, I'm preparing a proposal to improve the handling of the bool datatype in Scipy's sparse matrix package. Once these types are consistently codified, improving interactions between spmatrix objects and other Numpy/Scipy types should be easier. So fixing things like trac ticket#1598 would come next.

So first I would need to write a specification for handling bools with other spmatrix objects and other kinds of objects like ndarrays. But what is wrong now? Here is one thing.

Bool problems: Making sparse bool matricies

Not every sparse matrix format supports bool dtypes. Try instantiating any class which inherits from _cs_matrix with a ndarray of bool dtype, and it will upcast it to int8. See my comments on the trac ticket #1533.

In [3]: A = np.array([[True, False],[False, True]], dtype=bool)

In [4]: B = sp.csr_matrix(A)

In [5]: B.data
Out[5]: array([1, 1], dtype=int8)

But we can get a spmatrix with a bool dtype if we pass the kwarg dtype=bool

In [6]: C = sp.csr_matrix(A, dtype=bool)

In [7]: C.data
Out[7]: array([ True,  True], dtype=bool)

This seems inconsisent; but now we have other problems.

In [8]: C.toarray()

...

    171     """
--> 172     return _coo.coo_todense(*args)
    173 
    174 def coo_matvec(*args):


TypeError: Array of type 'byte' required.  Array of type 'bool' given

I think the type is upcast here because there is no support for bool types in the coo_todense function. This is defined in coo.h , where the type is required by the class T. To apparently not be bool.

So to add support for bool, this class T needs to have support for it added. Once that is done the toarray() method should work with dtype==bool.

I don't yet understand how this SWIG wrapper works. I'm not sure where this class T is coming from. Any comments would be helpful.

First

Testing testing...

I'm switching to this new static site genearator nikola because:

  • python based
  • Markdown support
  • static
  • looks clean
  • small simple codebase
  • actively developed