Language modeling is the task of assigning probabilities to sequences of words, and is one of the most important tasks in natural language processing. Given the context of one word or a sequence of words in the language that the language model was trained on, the model should provide the next most probable words or sequence of words that follows from the given sequence of words in the sentence.
Recurrent neural networks (RNN) are a class of neural networks that work well for modeling sequence data such as time series or natural language. Basically, an RNN uses a
for loop and performs multiple iterations over the timesteps of a sequence while maintaining an internal state that encodes information about the timesteps it has seen so far. RNNs can easily be constructed by using the Keras RNN API available within TensorFlow, an end-to-end open source machine learning platform that makes it easier to build and deploy machine learning models.
IBM Watson® Studio is a data science platform that provides all of the tools necessary to develop a data-centric solution on the cloud. It uses Apache Spark clusters to provide the computational power that is needed to develop complex machine learning models. You can choose to create assets in Python, Scala, and R, and use open source frameworks (such as TensorFlow) that are already installed on Watson Studio.
In this tutorial, learn how to perform language modeling on the Penn Treebank data set by creating an RNN using the long short-term memory (LSTM) unit and deploying it on IBM Watson Studio on IBM Cloud Pak® for Data as a Service.
In the tutorial, you import a Jupyter Notebook that is written in Python into IBM Watson Studio on IBM Cloud Pak for Data as a Service, then run through the Notebook. The Notebook creates an RNN model based on the LSTM unit to train and benchmark on the Penn Treebank data set. After running the Notebook, you should understand how TensorFlow builds and executes an RNN model for language modeling. You’ll learn how to:
- Run a Jupyter Notebook using Watson Studio on IBM Cloud Pak for Data as a Service
- Build an RNN model using the LTSM unit for language modeling
- Train the model and evaluate the model by performing validation and testing
The following prerequisites are required to follow the tutorial:
It should take you approximately 4 hours complete the tutorial. Most of this time will be spent training and evaluating the LSTM model. You can refer to reducing the Notebook execution time for methods to reduce the time required for execution.
- Set up IBM Cloud Pak for Data as a Service.
- Create a new project and import the Notebook.
- Read through the Notebook.
- Run the Notebook.
Set up IBM Cloud Pak for Data as a Service
Open a browser, and log in to IBM Cloud with your IBM Cloud credentials.
Watson Studioin the search bar at the top. If you already have an instance of Watson Studio, it should be visible. If so, click it. If not, click Watson Studio under Catalog Results to create a new service instance.
Select the type of plan to create if you are creating a new service instance. A Lite (free) plan should suffice for this tutorial. Click Create.
Click Get Started on the landing page for the service instance.
This takes you to the landing page for IBM Cloud Pak for Data as a Service.
Click your avatar in the upper right, then click Profile and settings under your name.
Switch to the Services tab. You should see the Watson Studio service instance listed under your Your Cloud Pak for Data services.
You can also associate other services such as Watson Knowledge Catalog and Watson Machine Learning with your IBM Cloud Pak for Data as a Service account. These are listed under Try our available services.
In the example shown here, a Watson Knowledge Catalog service instance exists in the IBM Cloud account, so it’s automatically associated with the IBM Cloud Pak for Data as a Service account. To add any other service (Watson Machine Learning in this example), click Add within the tile for the service under Try our available services.
Select the type of plan to create (a Lite plan should suffice), and click Create.
After the service instance is created, you are returned to the IBM Cloud Pak for Data as a Service instance. You should see that the service is now associated with Your IBM Cloud Pak for Data as a Service account.
Create a new project and import the Notebook
Navigate to the menu (☰) on the left, and choose View all projects. After the screen loads, click New + or New project + to create a new project.
Select Create an empty project.
Provide a name for the project. You must associate an IBM Cloud Object Storage instance with your project. If you already have an IBM Cloud Object Storage service instance in your IBM Cloud account, it should automatically be populated here. Otherwise, click Add.
Select the type of plan to create (a Lite plan should suffice for this tutorial), and click Create.
Click Refresh on the project creation page.
Click Create after you see the IBM Cloud Object Storage instance that you created displayed under Storage.
After the project is created, you can add the Notebook to the project. Click Add to project +, and select Notebook.
Switch to the From URL tab. Provide the name of the Notebook as
RecurrentNeuralNetworkUsingTensorFlowand the Notebook URL as
Under the Select runtime drop-down menu, select Default Python 3.7 S (4 vCPU 16 GB RAM). Click Create.
After the Jupyter Notebook is loaded and the kernel is ready, you can start running the cells in the Notebook.
Important: Make sure that you stop the kernel of your Notebooks when you are done to conserve memory resources.
Note: The Jupyter Notebook included in the project has been cleared of output. If you would like to see the Notebook that has already been completed with output, refer to the example Notebook.
Read through the Notebook
Spend some time looking through the sections of the Notebook to get an overview. A Notebook is composed of text (markdown or heading) cells and code cells. The markdown cells provide comments on what the code is designed to do.
You run cells individually by highlighting each cell, then either clicking Run at the top of the Notebook or using the keyboard shortcut to run the cell (Shift + Enter, but this can vary based on the platform). While the cell is running, an asterisk (
[*]) appears to the left of the cell. When that cell has finished running, a sequential number appears (for example,
Note: Some of the comments in the Notebook are directions for you to modify specific sections of the code. Perform any changes as indicated before running the cell.
The Notebook is divided into multiple sections.
- Section 1 gives an introduction to language modeling.
- Section 2 provides information about the Penn Treebank data set that is being used to train and validate the model being built in this tutorial.
- Section 3 gives an introduction to word embeddings.
- Section 4 contains the code to build the LSTM model for language modeling.
- Section 5 contains the code to train, validate, and test the model.
Run the Notebook
Run the code cells in the Notebook starting with the ones in section 4. The first few cells bring in the required modules such as TensorFlow, Numpy, reader, and the data set.
Note: The second code cell checks for the version of TensorFlow. The Notebook works only with TensorFlow version 2.2.0-rc0. Therefore, if an error is thrown here, you need to ensure that you have installed TensorFlow version 2.2.0-rc0 in the first code cell.
Note: If you have installed TensorFlow version 2.2.0-rc0 and still get the error, your changes are not being picked up and you must restart the kernel by clicking Kernel->Restart and Clear Output. Wait until all of the outputs disappear, and then your changes should be picked up.
The training, validation, and testing of the model does not happen until the last code cell. Due to the number of epochs and the sheer size of the data set, running this cell can take approximately three hours.
Reducing the Notebook execution time
There are a number of ways to reduce the time required for running the Notebook, that is, the time needed to train and validate the model. While all of these methods affect the performance of the model, some cause a drastic change in performance.
Fewer training epochs: The number of epochs set for training the model in the Notebook is 15. You can reduce the number of epochs by changing the value of
A lower number of epochs might not bring down the model perplexity, resulting in poor performance of the model. On the other hand, a higher number of epochs can cause overfitting, which also results in poor model performance.
Smaller data set: Reducing the size of the data set is another method to reduce the amount of time required for training. However, you should note that this can negatively impact model performance. The model is better trained when it is trained on a large amount of varied data.
Use GPUs to increase processing power: You can use the GPU environments available within Watson Studio to accelerate model training. With GPU environments, you can reduce the training time needed for compute-intensive machine learning models that you create in a Notebook. With more compute power, you can run more training iterations while fine-tuning your machine learning models.
Note: GPU environments are only available with paid plans and not with the Lite (Free) Watson Studio plan. See the Watson Studio pricing plans for more information.
EarlyStopping: You can also use EarlyStopping to improve training time. EarlyStopping stops training when a monitored metric has stopped improving. So, for example, you can specify that the training needs to stop if there is no improvement in perplexity for three consecutive epochs.
Refer the Notebook used in the Demand forecasting using deep learning code pattern for an example of EarlyStopping.
In this tutorial, you learned about language modeling. You learned how to run a Jupyter Notebook using Watson Studio on IBM Cloud Pak for Data as a Service and how to create an RNN model using TensorFlow based on the LSTM unit. Finally, you used this model to train and benchmark on the Penn Treebank data set.