As any data scientist knows, Python notebooks are a powerful tool for fast and flexible data analysis. But the learning curve is steep, and it’s easy to get blank page syndrome when you’re starting from scratch. Thankfully, it's easy to save and share notebooks. However, even for seasoned data scientists or developers, modifying an existing notebook can be daunting.
Data science notebooks were first popularized in academia, and there are some formalities to work through before you can get to your analysis. For example, in a Python interactive notebook, a mundane task like creating a simple chart or saving data into a persistence repository requires mastery of complex code like this matplotlib snippet:
Once you do create a notebook that provides great data insights, it's hard to share with business users, who don’t want to slog through all that dry, hard-to-read code, much less tweak it and collaborate.
PixieDust to the rescue. To improve the notebook experience and ease collaboration, I created an open source Python helper library that works as an add-on to Jupyter notebooks. If you can’t wait any longer, go ahead and check out the code: https://github.com/ibm-cds-labs/pixiedust.
Friendlier Data Science Notebooks
When I watched data scientists and developers work with Python noteboooks, I thought it shouldn't be so difficult. PixieDust fills feature gaps that made notebooks too challenging for certain users and scenarios.
PixieDust extends the usability of notebooks with the following features:
- packageManager lets you install spark packages inside a Python notebook. This is something that you can't do today on hosted Jupyter notebooks, which prevents developers from using a large number of spark package add-ons.
Visualizations. One single API called
display()lets you visualize your spark object in different ways: table, charts, maps, etc…. Much easier than matplotlib (but you can still use matplotlib, if you want). This module is designed to be extensible, providing an API that lets anyone easily contribute a new visualization plugin.
This sample visualization plugin uses d3 to show the different flight routes for each airport:
Export. Share and save your data. Download to .csv, html, json, etc. locally on your laptop or into a variety of back-end data sources, like Cloudant, dashDB, GraphDB, etc.
Scala Bridge. Use scala directly in your Python notebook. Variables are automatically transfered from Python to Scala and vice-versa.
Extensibility. Create your own visualizations using the pixiedust APIs. If you know html and css, you can write and deliver amazing graphics without forcing notebook users to type one line of code.
Apps. Allow nonprogrammers to actively use notebooks. Transform a hard-to-read notebook into a polished graphic app for business users. Check out these preliminary sample apps:
- An app can feature embedded forms and responses, like flightpredict, which lets users enter flight details to see the likelihood of landing on-time.
- Or present a sophisticated workflow, like our twitter demo, which delivers a real-time feed of tweets, trending hashtags, and aggregated sentiment charts with Watson Tone Analyzer.
See for yourself. You can play with pixiedust right now online via IBM's Data Science Experience. To get a look at the features you just read about, follow these steps:
- Visit the IBM Data Science Experience and log in with your Bluemix account credentials or sign up.
If prompted, create an instance of the Apache Spark service.
Data Science Experience may generate a Spark instance for you automatically. If not, you'll be prompted to instantiate your own. You'll need it to run your Python code.
Create a new notebook.
On the upper left of the screen, click the hamburger menu to reveal the left menu. Then click New > Notebook. Click From URL, enter a name, and in the Notebook URL field, enter
Create or select your Spark instance.
If you don't already have the Spark service up, Data Science Experience prompts you to instantiate it. You'll need it to run your Python code.
Your new notebook opens. Run each cell in order to see a few PixieDust features
If you get an error: PixieDust is preinstalled on Data Science Experience. If you get an error in cell 2, insert a cell at the top of the notebook and enter and run the following code:
!pip install --user --no-deps --upgrade pixiedust
Then restart the kernel and run cells above. If you get other errors, it's always a good idea to restart the kernel and try again.
PixieDust is an open source project. Join the conversation and contribute. You'll find lots of guidance in our repo's wiki with more to come. Write your own app or visualization plugin. Pull requests welcome! Visit the PixieDust GitHub repo.