Integrating Python For High Throughput Science

Eclipse has always been a platform that unites different technology and makes it possible for developers to use the tools of their trade such as language editors, debuggers, profilers, version control, all in one environment. Until relatively recently scientists and researchers were at the mercy of fragmented suites of tools or closed-source workbenches. Actually many still are. But now with the increased adoption of Eclipse in the science sector, they too can look forward to the huge benefits of tool integration and interoperability as well as  the resulting speed-up in productivity.

When it comes to tools-of-the-trade for scientists, Python is high on the list. This is in large part thanks to its fast, powerful libraries such as numpy and scipy. Also the dynamic and easy-to-use nature of the language lends itself well to the exploring necessary for experimental work. The accessibility of learning resources also contributes to Python being the language of choice for the scientist embracing programming.

So how do you go about providing a tight integration of Python in Eclipse; one that allows a user to seamlessly move data around and use it both interactively and within views? This was the challenge faced by Diamond Light Source as they developed their science workbench DAWN. Diamond is a synchrotron in the UK countryside which regularly hosts scientists and researchers from diverse fields doing groundbreaking research on all manner of things: viruses, retinas, dinosaur bones and chocolate to name a few. The users collect huge amounts of data for their samples and then perform extensive analysis to arrive at their conclusions. Analysis of the data from the tomography, crystallography or other experiments is done in DAWN, the Eclipse-based data analysis workbench developed by Diamond.

The video below shows some of the power of the integrated environment. Users can plot arrays in views from the interactive console and access regions-of-interest that were created graphically from the command line. Knowledge of the types being used further enhances the experience, for example dragging a file node type to the console automatically pastes the code used to load the node from file.  To achieve this level of integration of  Python with DAWN required two key pieces of enabling technology: PyDev and AnalysisRPC.

PyDev

Using Eclipse gives access to the huge Eclipse ecosystem, which includes PyDev. PyDev is a feature-rich, popular Python IDE. It is made up of a set of mature Eclipse plug-ins which were straightforward to customise and integrate into DAWN. This immediately gave DAWN features such as context-sensitive code completion, an IPython interactive console and rich debugging. That last item is particularly powerful – for instance a user can debug their script and see their array values in the variable view. However, it is hard to make sense of arrays when seeing them as a list of values, but in DAWN you can right click on the value and plot the array so you can examine it visually to see whether it looks right or not.

Pydev
Visual debugging using PyDev and DAWN

By using PyDev, this instantly provided an easy way to use Jython which gives easy access to the Java objects and for scripting. However, with Jython you cannot access C-based libraries like numpy. There was still the need to be able to perform plotting operations in the workbench views from the interactive console while making use of numpy functionality. So this was the next problem that needed solving and gave rise to an in-house solution known as AnalysisRPC.

AnalysisRPC

AnalysisRPC is a Python-to-Java bridge that provides a generic way to call Python functions from Java as well as Java functions from Python. Its real value-add comes from its deep understanding of complex types such as data sets (ndarrays) or regions of interest. It provides a consistent API to call functions regardless of whether they are local or remote calls by using a mechanism referred to as ‘flattening’. It was also important to provide a robust way of handling exceptions so information was not lost when something went wrong between the two layers of Java and Python.

python2java
Illustration of a plot operation calling Java from Python using AnalysisRPC

The initial requirements, such as scripting plotting operations, only required CPython to call Java code. However the implementation, which uses XML-RPC as the transport mechanism, lent itself well to calling Python from Java. It was not long before many useful features of Java calling Python came up such as when using workflows. Now it is fairly seamless to enhance a workflow with a custom algorithm written in Python.

java2python
Illustration of calling algorithm in Python from Java using AnalysisRPC

By combining the existing technology of PyDev with the newly developed AnalysisRPC,  DAWN provides an ideal environment for facing the challenges of high throughput science. Since its initial implementation, the technology continues to evolve and use other pieces of technology such as Py4J where they fit best. AnalysisRPC itself is evolving and is finding applications in other non-science areas where data needs to be exchanged between Java and Python. To this end work is being planned to improve the framework for reuse with custom data types in custom environments so other areas can have the same benefits of integration. This is the true spirit of the Eclipse eco-system.

To find out more or keep abreast of the latest developments, sign up to the Eclipse Science mailing list.

One Reply to “Integrating Python For High Throughput Science”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s