This post will give you a quick update on the status of the collaboration between IBM and the SETI Institute in Mountain View, CA. Last year we began an effort to increase the SETI Institute’s computational power by utilizing IBM’s Spark and Object Store services. We wrote about our initial plans around SETI and Spark back in September 2015.
Recently, on Jan 19-20, 2016, the SETI Institute in Mountain View hosted a two-day workshop in order to focus our efforts. The topics discussed centered on current and new analysis algorithms, incorporating SystemML machine learning, and building a real- or near-time Spark-based microservice for preliminary data analysis and feedback to the Allen Telescope Array (ATA). In attendance were
- Jill Tarter (SETI)
- Jon Richards (SETI)
- Gerald Harp (SETI)
- Jeff Scargle (NASA)
- Chris Henze (NASA)
- Ian Morison (Swinburne University)
- Francois Luus (IBM)
- Ed Elze (IBM)
- Graham Mackintosh (IBM)
- Jim Smith (IBM)
- Nick Poore (IBM)
- Rhonda Edwards (IBM)
- Jim Smith (IBM)
- Adam Cox (IBM)
- Berthold Reinwald (IBM)
- Faraz Makari Manshadi (IBM)
As a reminder, the data we are working with are the complex-amplitude measures of the radio signals recorded by the ATA as it was pointed toward a particular location in the sky. The ATA data acquisition system observes and records data in chunks of time over a particular range of frequencies. Only data where a signal greater than the average noise was observed are digitized and recorded to disk in a set of raw data files. Then, a first-pass analysis of the triggered data takes place. This preliminary examination extracts a handful of quantities and provides an estimate of the type of event. Generally, it categorizes the event as interesting or as some type of noise. These results, along with the meta-data for the associated raw data files are stored in the SignalDB database. There are about 20 million raw data files (data taken from July 2013 to present) and for each file there exists a row in the SignalDB.
Optimizing for Spark
Nick Poore and Francois Luus have been optimizing our use of the Open Stack Swift Object Storage and Spark service offering at IBM (available through a Bluemix account). Initially, our Spark-based analysis accessed each raw data file individually, which are around 70 kB each. However, it turns out the most optimal file size for analytics on Spark is in the range of 64 to 128 MB. Fortunately, the set of raw data files for a particular date and “activity” can be serialized into objects that are, approximately, 10 to 40 MB. (An “activity” is SETI-lingo for a group of data associated with an observed signal.) Though not quite in the optimal range, grouping these data have significantly decreased the time it takes to process the data on our Spark cluster.
Raw Data Feature Extraction
Jeff Scargle, from NASA, gave us an update on his spectrograph analysis. In addition to characterizing the spectrographs with standard deviation measures along the time and frequency axis, he’s adding measurements of higher moments and other metrics to characterize the signal. This set of measures for each spectrograph may eventually be used as a set of features for machine learning algorithms. We are still exploring which features to choose. Of course, the difficulty in the SETI analysis is the lack of known signal characteristics. Although we can make some educated guesses, we don’t know what E.T. may be transmitting. As such, the plan is to utilize unsupervised machine learning algorithms to classify the signals. It is subsequently hoped that, with more clearly classified data, we can hone in on the most interesting signals that could be interpreted as E.T. communication. For example, might we find a signal that looks similar to one of our own space craft (click on Spacecraft Tracking), but eminating from a Kepler planet?
Wide-band Signal Search
Ian Morrison, of Swinburne University, however, discussed other possible modes of E.T. communication. Instead of transmitting a signal within a narrow range of frequencies (narrow-band), communication signals can be sent across a wide range of frequencies. The reason intelligent beings may choose to do so is economical. It is far cheaper to transmit signals across a wide frequency range because they can be transmitted at significantly lower power (which means less cost). In addition, wide-band signals at the right frequencies would show very little dispersion as they travel across interstellar space. Even though these signals are very low in power, below the average amplitude of the noise, they may still be detected through the statistical properties they exhibit by observing locations in the sky for a long enough time.
Berthold Reinwald and his colleague Faraz Makari Manshadi, both from the IBM Amalden Research Lab in San Jose, gave us a great introduction to SystemML. SystemML is a declarative language for machine learning, which was built to abstract-away the physical representation of data sources and automatically optimize machine learning algorithms. It is now an open-source Apache project and runs on both Hadoop and Spark.
Finally, we began discussions of deploying the ATA on an attempt to eavesdrop on E.T. communications. The idea is to point the ATA at Kepler systems where there exist multiple planets, especially those where the alignment of the plane of rotation of those planets are in such a way that radio transmissions between those planets would be directly in-line with Earth (and the ATA).
In the meantime since this meeting in January, we’ve been formulating a plan to bring this data to the general public so you may analyze data from the ATA with your own IBM Spark service. Stay tuned! (Get it?)