Chinese Phonetic Similarity Estimator

Overview

The Chinese Phonetic Similarity Estimator provides a phonetic algorithm for indexing Chinese characters by sound. Given two Chinese words of the same length, the model determines the distances between the two words and also returns a few candidate words which are close to the given word(s). The code complies with the phonetic principles of Mandarin Chinese as guided by the Romanization defined in ISO 7098:2015. The model is based on the DimSim model.

Model Metadata

Domain Application Industry Framework Training Data Input Data Format
NLP Text Clustering/Phonetics Social Media Python N/A Chinese Text (utf-8 encoded)

References

Licenses

Component License Link
Model GitHub repository Apache 2.0 LICENSE
Model Weights N/A N/A
Model Code (3rd party) Apache 2.0 LICENSE
Test assets N/A N/A

Options available for deploying this model

This model can be deployed using the following mechanisms:

  • Run Locally as a library from PyPi: follow the instructions in the model README on GitHub

  • Deploy from Dockerhub:

    docker run -it -p 5000:5000 codait/max-chinese-phonetic-similarity-estimator
    
  • Deploy on Red Hat OpenShift:

    Follow the instructions for the OpenShift web console or the OpenShift Container Platform CLI in this tutorial and specify codait/max-chinese-phonetic-similarity-estimator as the image name.

  • Deploy on Kubernetes:

    kubectl apply -f https://raw.githubusercontent.com/IBM/MAX-Chinese-Phonetic-Similarity-Estimator/master/max-chinese-phonetic-similarity-estimator.yaml
    
  • Locally: follow the instructions in the model README on GitHub

Example Usage

You can test or use this model

Test the model using cURL

Once deployed, you can test the model from the command line. For example if running locally, run the following command through the terminal:

$ curl -X POST "http://localhost:5000/model/predict?first_word=%E5%A4%A7%E8%99%BE&second_word=%E5%A4%A7%E4%BE%A0&mode=simplifiedθ=1" -H  "accept: application/json"

You should see a JSON response like that below:

{
  "status": "ok",
  "predictions": [
    {
      "distance": "0.0002380952380952381",
      "candidates": [
        [
          "打下",
          "大虾",
          "大侠"
        ],
        [
          "打下",
          "大虾",
          "大侠"
        ]
      ]
    }
  ]
}

Test the model through Python

Open a Python shell through terminal


$ python

Run the following commands through Python to test the model:

import dimsim

dist = dimsim.get_distance("大侠","大虾")
0.0002380952380952381

dist = dimsim.get_distance("大侠","大人")
25.001417183349876

dist = dimsim.get_distance(['da4','xia2'],['da4','xia1']], pinyin=True)
0.0002380952380952381

dist = dimsim.get_distance(['da4','xia2'],['da4','ren2']], pinyin=True)
25.001417183349876

Test the model in a serverless app

You can utilize this model in a serverless application by following the instructions in the Leverage deep learning in IBM Cloud Functions tutorial.