Classify programming languages

Get the code


In this developer code pattern, we will use Jupyter Notebooks in IBM Watson™ Studio to build a model that predicts a code’s programming language based on its text. The model will then be evaluated using the Watson Natural Language classifier.


With IBM Watson Natural Language Classifier, a data scientist can build a model that looks at text documents and classifies them based on the categories used to build the model. We can use this tool to look at the contents of GitHub and classify code based on the programming language used. With a Jupyter Notebook running on Watson Studio, the data can cleaned and manipulated, and the Watson Developer Cloud SDK for Python provides APIs to create and use a model in Watson Natural Language Classifier.

When you have completed this code pattern, you will understand how to:

  • Build a labeled data set.
  • Use Watson Natural Language Classifier to create a predictive model.
  • Build a predictive model within a Jupyter Notebook.
  • Configure and use Watson APIs.


programming language classification chart

  1. Create an IBM Watson Studio Workspace.
  2. Using Watson Studio, create a Jupyter Notebook and Watson Natural Language Classifier instance.
  3. Create a new dataset from Github, or use exsiting one in this repo.
  4. Interact with Jupyter Notebook to build Naive Bayes classifier and Natural Language Classifier instance using the Watson Developer Cloud SDK.
  5. The Python code can use NLC APIs to create and use a classifier.


Please see the README for detailed instructions.