This blog post is re-posted from Medium.

The motivation behind this particular project comes from playing one of my favorite Android games: Rusted Warfare – one of the few real-time strategy games on the mobile marketplace. It’s similar to Age of Empires and Command and Conquer. It’s a lot of fun, but the multiplayer games require effective communication for strategizing and teamwork. That tends to be difficult simply because most of the players speak different languages, as we can see in the gameplay screenshot below.

Our proposed solution here is to allow for each player to be able to define his preferred language, then send and receive all messages in that selected language. In the scenario in the gameplay screenshot, I’d see each of the incoming messages in English, Gabriel would see his in Spanish, etc.

This can be accomplished by re-implementing the game chat clients to send and receive messages via an MQTT messaging broker. As messages are received by the broker, they’re then transcribed/translated by a series of serverless functions, and the results are broadcast to the subscribed clients. The structure of the MQTT channel each client publishes to describes which event has occurred and, subsequently, which serverless functions will be called to handle the event.

So for example, an audio message from a device used by an English-speaking client can be published to the following MQTT channel, which will specify the message type and language like fromClient/voice/english.

A serverless action bound to the fromClient/voice/+ channel (+ representing a wildcard variable) will process each incoming message by identifying the input language if not specified in the topic, transcribing it, and translating to other languages. Once translated, each result is then published to the toClients// channels, which broadcasts the translated result to the subscribed clients in the format/language of their choice.

System Architecture

Workflow

  • Message received from a client, which can be a web browser, CLI, OpenWhisk action, SMS text, etc.
  • If message payload contains an audio file, it is transcribed to text.
  • Transcribed text is translated to other supported languages.
  • If message is sent via SMS, sender phone number is added to an etcd key/value store. etcd is used here to maintain a list of subscribers’ phone numbers, as well as their respective languages. An adjustable TTL value is used here to remove numbers from the store if the subscriber does not participate in the conversation for 300 seconds.
  • Translated messages/audio streams are published to various channels on the MQTT broker, which then distributes the messages among subscribing clients.

By using the MQTT broker as a mediator, we are able to decouple our logical components, which allows us to move away from a more traditional “call stack” and toward a truly event-driven architecture. This decoupling method makes our architecture’s logic more modular and independent. This is great for agile development, since components are independent and isolated, they can be updated in place without affecting others or having to restage/push the entire system.

We also chose to extend the system to support SMS clients by using Twilio. Twilio allows for us to trigger Webhooks whenever a call or text is made to a registered phone number. So in this case, we have a Twilio “messaging” number that waits for incoming SMS messages, and forwards the message information (sender number, city, body, etc.) to the OpenWhisk sequence.

Demo

We’ve deployed a UI that can be accessed at https://translation-mqtt.mybluemix.net. Since the logic is maintained by serverless actions, the UI is not required for the services to work; it simply provides an accessible MQTT client and a way to capture/send/receive audio via web socket. Since we’re utilizing web sockets, the voice input module can be captured continuously to allow for a more natural, free-flowing conversation.

We also have registered a Twilio number to demonstrate the SMS integration. Send a text to (310) 340–2202 in any language, and your phone number will be added as a subscriber for 5 minutes.

Demonstration

Use cases and next steps

In addition to handling game/messaging chat clients, this system can be beneficial for live-streaming scenarios, such as sports broadcasts, political hearings, university classes, podcasts, etc.

We’re looking into adding VoIP capabilities from Twilio, which will enable the system to be used to handle a multilingual conference call. We’ll also investigate Lyrebird, which is an API that has the ability to learn and mimic speech patterns by recording and analyzing the user’s voice. If Lyrebird can be used to mimic a user’s speech and tone in different languages, that’ll make the conversation flow more naturally.

If that’s possible, it’d be interesting to experiment with music – possibly feeding in artist discography, acoustic recordings, and interviews – to train Lyrebird to better differentiate between their voice and the melody. This might open a world of possibilities, such as being used at concerts or silent parties, where the same song could be processed in real time and broadcast in different languages. As an alternative to Lyrebird, we may also look into modifying the pitch of the speaker voice using the following project.

Join The Discussion

Your email address will not be published. Required fields are marked *