If you were, in mad scientist fashion, to create a set of experiences to build a person, someone deeply knowledgeable in the commercial exigencies of technology start-ups AND so expert in machine learning technology that they can define its future, you would be creating Nick Pentreath. Nick co-founded Graphflow, a big data and machine learning company focused on recommendations and customer intelligence. He worked as an analyst at Goldman Sachs—the finance world equivalent of Navy Seals training. He was a research scientist at Cognitive Match in London, and he led the Data Science and Analytics team at Mxit, Africa’s largest social network.
In his spare time (I imagine him penning this at the Oyster Bar in Heathrow’s Terminal 5), Nick authored Machine Learning with Spark, and made such substantial contributions to Apache Spark that the Apache Foundation elected him a Committer. Now on his first week as STC Principal Engineer, Nick tolerates my questions about bush babies and surfing, and tells us what what he learned from founding companies, and his plans for Spark ML.
Where did you grow up in South Africa, and where did you go to school? Did you grow up playing in the veldt and hiding bush babies in your school uniform? What do you do now for fun? Do you surf?
I was born in Johannesburg, which is South Africa’s largest city. I went to school in Johannesburg and Pretoria, and attended university in Cape Town. I spent eight years living in London, where I completed a Masters’ degree at University College, London focused on machine learning. I returned to my home country a few years ago.
Growing up in “Joburg” was not too different from growing up in the suburbs of any medium-sized city. So no bush babies in my uniform, or other wild animals (though my dog often used to jump the walls of our house and follow me to school, to be found at break times running around on the sport fields). However, I did visit the “bush” (or veldt) every year for family vacations, to look at the lions, elephants, rhinos and other African wildlife. I hope there will still be some rhinos left in South Africa to show my daughter when she is a bit older (we have major problems with poaching).
I don’t surf much (I should do more), but Cape Town is an amazing place for trail running, mountain biking and hiking, which is what I try to do in my spare time.
Do you find the Apache Spark community is international — that it doesn’t matter where you live, you can work from the woods of Finland like Linus Torvalds (actually I think he’s in the woods of Beaverton, OR)? Is there a thriving open source community in South Africa, and is the conversation there different? How important are meetups, companies that support open source innovation, and universities physically located where you are and generating an ongoing intellectual exchange between academia, enterprise, and community to the kind of work you do—contributing to Apache Spark at a very high level?
The Spark community is definitely very international. There are contributors, not to mention Spark users, from all over the world, all collaborating together. For example, some of the largest-scale Spark users are outside of the U.S., in China. In many ways, open source software has been among the earliest examples of “distributed teams” which is now becoming more and more common in software development. I believe the open source model has proved how effectively diverse, distributed teams of developers can operate, despite the challenges involved.
There is a strong open-source community in South Africa, though it is relatively smaller than most other places. Meetups and corporate sponsorship are as important here as anywhere else. South Africa could benefit from more corporate sponsorship of open source software projects. I’d also love to see more university collaboration – one only has to look at Spark’s origins to see the benefits of this. I’d love to encourage South African students and researchers to work on projects like Spark. This is the great thing about open source and cloud technologies – a student can sit in South Africa (or anywhere else) and contribute to cutting edge big data technology research using computing resources located in the cloud. This would have been prohibitively expensive just a few years ago.
What did you learn from founding a company that you couldn’t have learned at, say, Google or IBM, or in academia? Does any of that knowledge come to bear in your work with Apache Spark? Does it inform the way you guide the community? How do you think it will affect the way you operate at the Spark Technology Center – which is a hybrid beast, backed by IBM but with a mandate to operate like a startup, in the service of community? Would you ever do a startup again?
Starting and trying to grow a company was one of the most stressful, yet exhilarating, experiences of my life. In a startup, you have to build a lot from scratch, quickly, virtually single-handedly. This leads, by necessity, to a fast-moving, “just build it” attitude. There is no point in trying to design the perfect product upfront, since the end product ends up being very different from what you initially thought you needed to make. The only real way to figure out what the end product actually is, is to build and release those first, imperfect, cobbled-together versions, and get feedback from real customers.
This type of approach fit well with the early days of Spark too. As the project has matured, it’s now necessary to be more careful and rigourous about what goes into the core project. However, I think the attitude of a startup can absolutely be applied to the work that the STC does, in particular around demos, example applications and cutting edge proof-of-concept projects built on Spark. Building these types of projects often leads to ideas for new products, or enhancements to Spark or the wider ecosystem, and allows these ideas to be iteratively refined until they’re of really high quality. This approach is a big part of what attracted me to STC and what I hope to do a lot of here.
As for doing another startup, never say never, but it’s not on the radar screen any time soon!
What do you think is the most interesting challenge to Spark as enterprises begin to adopt the technology, in greater numbers? What’s the most interesting use case you’ve seen, or would like to see – and why would you like to see it, because of how it would change the world in some way or because it would be an interesting technical challenge?
I think a major challenge for the Spark community is, and will continue to be, keeping up with the pace of development of the project and adapting to the diverse new use cases coming out of production deployments at enterprise scale and complexity. This is an area where the STC can bring significant resources to bear to help solve these challenges.
One of the most interesting uses of Spark for me is the work of Jeremy Freeman’s lab on neuroscience. He always presented the prettiest demos at Spark Summits! It would be fantastic to see Spark help us understand the inner workings of the human brain in the future.
What are you most excited about working on for Spark 2.0?
In the case of ML specifically, I will be excited to see the new ML pipelines API get up to feature-parity with MLLib, and the PySpark / R APIs get more up to parity with Scala. I also believe that being able to use models trained on Spark within other systems is critical to the long-term growth and success of machine learning with Spark, and so model persistence and export (via Spark native formats as well as PMML) is a key area of focus. These may not necessarily be “cool” things to work on, but they are vital to ensure widespread Spark adoption especially within enterprise customers.
I’m also interested in driving “reasearch-oriented” projects focusing on large-scale machine learning on Spark (for example, cutting edge algorithms, deep learning and parameter server approaches). I hope these kinds of projects can help showcase the talents and skills of the growing STC team.
Any plans to mentor new developers from outside IBM, to get more people committing to Spark itself? There’s an opportunity for people with experience in different industries and verticals to contribute to the core of Spark.
Absolutely! The strength of any open source community is in its diversity. Spark started life as a university research project, and there is still that sense of “cutting edge” around Spark. While the project has matured significantly, I believe there is still a lot of scope to bring in expertise from other industries, to help keep Spark at the forefront of open source big data technologies, while continuing to grow adoption.
Will we continue to grow the pipeline API? What sorts of things are you really excited to bring over from scikit-learn into Spark? What kinds of things make sense in a distributed system vs. local, and what’s your thinking around pipelines?
It seems clear to me that the pipeline (or DataFrame) API will continue to be the main focus of Spark’s MLlib library. If anything, the API is likely to become more like that of scikit-learn in terms of semantics. Personally I think this is a good thing as I really like scikit-learn’s API.
In terms of local vs. distributed, the missing piece in Spark ML currently is being able to easily push pipelines and models to production. There is work underway currently to separate the linear algebra dependencies from the core, to move towards allowing models trained using Spark to be deployed in production environments without bringing in the entirety of Spark’s dependencies. This will also need careful thought about how to integrate the “local” versions of pipelines (used in live serving environments) with the “distributed” versions (used to train them).
Spark ML is growing in this pipeline direction. Do you have things that are going to be difficult to fit into this pipeline? Text algorithms, for e.g., can be difficult. Any thoughts on where they might fit?
Surprisingly many problems can be represented in the pipeline / DataFrame view of the world (even graphs, such as in the GraphFrame library).
In fact, while the user interface for Spark ML is based on DataFrames (soon to be Datasets), the underlying algorithms usually still use RDDs. So it will be interesting to see where performance gains or code simplification can be achieved by using Datasets instead of RDDs.
Do you have any thoughts on R integration and how it’s going to develop?
The SparkR API is very important for growing adoption of Spark, given the large R user base within the data science community. My background is mostly in Scala and Python, so I am not an expert on R. My current sense is that the focus for SparkR will be to ease the transition of R users coming to Spark, so perhaps we may see the API start to evolve differently from the Python or Scala APIs.
What do you think the greatest opportunity is for the Spark Technology Center to contribute to the open source community?
What attracted me to the STC is the opportunity to work in a team that combines the significant resources of IBM, a deep understanding of the enterprise user base, and an open, research-oriented, “startup” mindset. To remain relevant and grow adoption, while still keeping the “cutting edge” nature that characterizes the Spark community, will require a balance of all these elements. IBM clearly has a long history of success in this balancing act, and that is an important benefit that the STC can share with the community.
More? Follow Nick Pentreath on Twitter at @MLnick