IBM iX and Corporate team project members include: Brian Adams, Aaron Baughman, Karen Eickemeyer, Monica Ellingson, Stephen Hammer, Eythan Holladay, John Kent, William Padget, David Provan, Karl Schaffer, Andy Wismar

+ This content is part 3 of the 4-part series that describes AI Highlights at the Masters Golf Tournament.

Multimedia Preparation

Specific areas from the course golf never fail to deliver drama.  Golfers fear while fans admire the Amen Corner, which is comprised of holes 11, 12, and 13.  The Amen Corner typically separates the contenders from the rest of the players.  Next, the tournament contenders can win or lose the tournament based on play at holes 15 and 16.  Certain players and competition dynamics are combined into featured groups to encapsulate climatic moments.  Other media director generated coverage, that is simulcast to multiple channels, can also produce candidate highlights.

The stretch of golf that includes the Amen Corner, hole 15 and hole 16 along with featured group and broadcast simulcast are streamed to the scene clipper followed by the AI Highlight Ranking system on the IBM Cloud.   The produced clips can enter the system through a Python Flask application or is pulled from an IBM Cloudant First In First Out (FIFO) datastore.

if (self.read_job_id and not retrieved_work):
  job_files = data_utils.get_sorted_files(self.job_output, '*.json')
  if job_files is None:
    print("No work from API.")
  elif (len(job_files)>0):
    retrieved_work = True
    with open(os.path.join(self.job_output,job_files[0])) as fh:
      json_input = json.load(fh)
      job_id = json_input['jobid']
      os.makedirs(os.path.join(self.output_dir,job_id))        os.rename(os.path.join(self.job_output,job_files[0]),os.path.join(self.output_dir,job_id,'job.json'))
    print("No job files.")
except Exception as e:
  print("Error getting job id: "+str(e))

if (self.read_cloudant and not retrieved_work):
    doc = self.cloudant.update_document_with_query({"status": {"$exists": False}},'status','retrieved')
    if not doc is None:
      job_id = str(data_utils.random_digits(6))
      retrieved_work = True
      name_parts = doc["source-assets"]["storage-url"].split('/')
      name = str(job_id)+name_parts[len(name_parts)-1]
      url = self.netstorage_path+doc["source-assets"]["storage-url"]
      urllib.urlretrieve(url, os.path.join(self.output_dir,name))
      statinfo = os.stat(os.path.join(self.output_dir,name))
      if (statinfo.st_size>self.max_file_size_bytes):
        error_msg = "File "+str(name)+" is too big ("+str(statinfo.st_size)+")\n"+
        "File size threshold is "+str(self.max_file_size_bytes)
        utc_datetime = datetime.datetime.utcnow()
        utc_formatted_time = utc_datetime.strftime("%Y-%m-%d %H:%M:%S UTC")
        except Exception as e:
          print("Could not get the file name of error file.  Continuing")

In both cases, the video file in form of an mp4 is split into 1 frame per second jpg’s and 6 second wav files.  Each wav file needs to be the same duration in preparation for Soundnet feature extraction through Caffe and Lua.  If any of the wav files are less than 6 seconds, an empty wav file is generated by sox and saved on disk.

outputfile = str(time.time())+'.wav'
silencefile = 'silence'+str(time.time())+'.wav'

generate_silence_length = 6 - sec_length
cmnd = ['sox','-n','-r','16000',os.path.join(dir_path,silencefile),'trim','0.0',str(generate_silence_length)]
p = subprocess.Popen(cmnd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

The empty file is appended to the end of the original file wav to guarantee that the file is of 6 seconds in duration.  Every wav file undergoes this pre processing.

cmnd = ['sox',os.path.join(dir_path,file),os.path.join(dir_path,silencefile),os.path.join(dir_path,outputfile)]
p = subprocess.Popen(cmnd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

If the mp4 file is pulled from the Cloudant FIFO, the document-based record is updated so that other system dispatchers will not pick up the work that has already been processed.  The update is immediately executed to minimize the likelihood of duplicate execution.

doc = self.cloudant.update_document_with_query({"status": {"$exists": False}},'status','retrieved')

def update_document_with_query(self,query,field,value):
doc_result = None
query = cloudQuery.query.Query(self._db,selector=query)
for doc in query(limit=1)['docs']:
doc_result = doc
if (not doc_result is None):
  doc = self._db[(doc_result['_id'])]
  return doc_result
return None

The Cloudant query finds all documents where a status field does not exist.  The status field holds the state of the piece of work such as “retrieved”.

A job JSON file is written into the disk-based queue for the dispatcher.  The job JSON file references the location of the original mp4 file and split jpeg’s and wav files.  The dispatcher iterates within a loop looking for work within its job queue and maintains status of each job.  The dispatcher ensures a job finishes within a specific time and writes error messages to a Cloudant sink.


During each iteration, the dispatcher performs maintenance on each launched job.  A maximum number of concurrent jobs protect the AI Highlight Ranker from excessive CPU, GPU, and memory usage.  In parallel, 4 separate processes are executed through sockets to interpret sound, action, speech, and graphics.

for component_name in self.components:
  if component_name == 'soundnet' and self.soundnet_msg:
    result = self.send_request(job_id, component_name, audios)
  elif component_name == 'graphics' and self.graphics_msg:
    result = self.send_request(job_id, component_name, frames)
  elif component_name == 'speech2text' and self.speech2text_msg:
    result = self.send_request(job_id, component_name, audios)
  elif component_name == 'actions' and self.actions_msg:
    result = self.send_request(job_id, component_name, frames)

Each of the requests is sent through sockets to minimize any networking and protocol overhead.  The parallel processes use the SocketServer.TCPServer class to start each server.

server = TCPServer((HOST, PORT), SoundnetRequestHandler, componentHandler)

Listening to Golf Play

Golf dramatic moments are punctuated by sound from the fans, player and commentator.  If the fans have an uproar of excitement, this noise is a great indicator of a very interesting highlight segment.  The additional cheering from the player and commentator add to the magnitude of excitement.  Each of the wav files that encode sound during golf play is converted into an mp3.  Numerical understanding about the sound is decoded from a deep 1-D convolutional neural network called SoundNet that encodes the representation of sounds from nearly 2 million videos.  Features from the conv-5 layer are exported for each input sound.  The exported feature vector size of 17,152 is sent to two different Support Vector Machine (SVM) models.

print('Loading network: ' .. opt.model)
local net = torch.load(opt.model)
local sound = audio.load(path)
if sound:size(2) > 1 then sound = sound:select(2,1):clone() end 
sound = sound:view(1, 1, -1, 1)                        
sound = sound:cuda()                                    
local feat = net.modules[opt.layer].output:float()
local result = feat:squeeze()

A cheer magnitude detector was trained using Python’s sklearn.  Feature vectors that had known cheer magnitudes were labeled with a ground truth and used to train the optimal margins of the SVM.  The resulting output of the SVM measures the degree of cheering.   A second SVM measures the general excitement from the sound of speech during play.  A 17,152 feature vector from SoundNet is input into the excitement model SVM.  The output of the SVM is another indicator of a high quality highlight.  Both of the results from the SVM’s are sent to the evidence fusion component.

Comprehending Commentators

The commentation during golf provides domain knowledge clues about the strategic importance of a stroke.  The commentator’s speech expresses the drama and significance of a play.  The general sentence structure and sentiment provide the base commentator excitement level.  For example, the output of the Watson speech to text could be sent to the python Natural Language Toolkit’s Vader polarity scorer.  The polarity of the transcribed text for each wav file is calculated and scaled.

if self.user_vader:
  polarity=self.vader.polarity_scores(' '.join(words))
  polarity_scale = ((polarity['compound']-(-1))*(0-1))/(1-(-1))+1

Next, specific spot words within the commentator’s dictation such as “amazing”, “beautiful shot”, “great” and etc. adds to the magnitude level of excitement.  Golf Subject Matter Experts (SMEs) have provided relative weight and importance of spot words.  The weights of all spot words are summed to a maximum of 1.  The semantic score of the text is averaged with the spot word score.

if score > 1:
  score = 1
if len(positive_scores)>0:
  average_positive = np.mean(np.array(positive_scores))
  weighted_score = score*0.5 + average_positive*0.5
  weighted_score = score

A Python adapter that encapsulates the logic and service calls to translate the speech into text is supported through reflection.  The system loads the specific Python module, such as Watson, that translates the speech to text.  Every wav file, if any, is translated into a text file with a corresponding time.  The text is sent to the Python Natural Language Toolkit’s Vader model to retrieve a general polarity measure about the full text.

if (self.config['business_logic'].lower() == 'watson'):
  targetClass = getattr(SpeechComponentHandlerWatsonBusinessLogic,   
  self.business_logic = targetClass(self.config)

def speech_to_text_helper(self,job_id, input_dir, files, output_dir):
    return self.business_logic.speech_to_text(job_id, input_dir, files, output_dir)
  except Exception as e:

Watching Golf Action

Certain gestures before, during or after a stroke show the occurrence of a pinnacle of action.  For example, a fist pump or both hands in the air generally indicate a significant outcome and a player celebration.  The animation of body motion is represented within a series of jpeg’s that have been split from the mp4.  A deep learning model called VGG-16 that was pre-trained on imagenet was iteratively trained on custom data.  IBM Research iteratively trained the model on Masters videos from years that includes 6,744 examples that did not have significant gestures and 2,906 that did.  The classifier achieved 88% accuracy on a test set with 858 negative and 460 positive images with important human actions. = caffe.Net(self.prototxt_path, self.model_path, caffe.TEST)
file = os.path.join(input_dir, dir_name, filename)
image =
results = self.predict_image(image)

def predict_image(self, img):
  num =['data'].shape[0]
  orig_img = self.preprocess_image(img)['data'].data[...] = self.generate_augmentation(orig_img, num)
  out =
  result = out['prob'][0][0]
  return result * 1000

Each image is preprocessed to create a standardized size, number of channels and image background.  The resulting image is sent to a data augmentator to generate new images.  The image is randomly cropped and mirrored into a set new of images.  The set of images is input into the CNN for action recognition.  Each of the jpg’s score is written to file storage and a message is sent to the evidence fusion socket listener.

Reading Broadcast Graphics

Very similar to the Masters’s logo and graphic detector in the clipper (provide link), we search for the same graphic to determine if the candidate clips can be subdivided further.

If a graphic can be found, the region of interest is segmented from the frame for OCR processing.

img = cv2.imread(img_path, 3)
if (len(img) < xy[0] or len(img) < xy2[0]):
  print("Graphic images are too small, continuing")
  return results
if (len(img[0]) < xy[1] or len(img[0]) < xy2[1]):
  print("Graphic images are too small, continuing")
  return results
px = img[xy[1], xy[0]]
px2 = img[xy2[1], xy2[0]]

dR = abs(px[0] - rgbc[2])
dG = abs(px[1] - rgbc[1])
dB = abs(px[2] - rgbc[0])
dR2 = abs(px2[0] - rgbc[2])
dG2 = abs(px2[1] - rgbc[1])
dB2 = abs(px2[2] - rgbc[0])
r = 0
# detected graphics
if dR < tol and dG < tol and dB < tol:
  r = 1
  if dR2 < tol and dG2 < tol and dB2 < tol:
    r = 2
    if r > 0:
      for k in range(0, 4-r):
        textRegion = img[[k][k] + h[k], self.xt[k]:self.xt[k] + w[k]]

The open source tesseract is called from the command line with the following code.

p = subprocess.Popen(["tesseract", imgName2, "stdout", "-l", "eng", "-psm", "3"],stdout=subprocess.PIPE)

The output of the process is then processed to find the golfer that was mentioned along with hole number and general text.  The result is sent to the evidence fusion socket listener.

Highlight Ranking

Each of the clips and the associated predictors from sound, speech, gesture and graphics provide a basis for evidential ranking.  As a clip is analyzed from parallel features extractors and finishes, the fusion socket determines if all extractors have completed.

if not (job_id in self.result_dict):
  array = dict()
  array[sender] = result
  self.result_dict[job_id] = array
  array = self.result_dict[job_id]
  array[sender] = result
  self.result_dict[job_id] = array
if len(self.result_dict[job_id]) == self.component_num:
  #run fusion

If the every multimedia algorithm is complete, actions are extracted from the clip.  Any number of actions can be segmented from the original clip based on encapsulated golf business logic.  For example, multiple shots within one clip should be scored separately so that they do not have direct influence over the model outputs.  Television graphics are used to segment the clip into different actions.  Each of the detected cheer times are iterated through to find the closest television graphic time, if any.  The start time will be 5 seconds before the closest television appearance time.  The end time is determined by detecting a scene change by examining the color histogram difference between each neighboring frame.

Every action segment’s timing is used to lookup specific predictors for speech, sound, and action.  The following code shows an example.

if fAction:
  SNScoresSegsN, ActionScoresSegs = self.processAction(SNScoresSegs, fAction)
#processing speaker excitement
if fdetExc:
  SpExcScoresSegs = self.processSpExc(SNScoresSegsN, fdetExc)
#processing speech2text
if s2txtflag:
  speech_results = os.path.join(self.job_dir,job_id,'results')
  TextScoresSegs, Sp2TxtHash = self.processSp2Txt(SNScoresSegs, speech_results)

The gesture logic finds the maximum VGG-16 score that is within the action start and end time.  Similarly, the maximum score for speaker excitement from SoundNet is retrieved between the action’s time period.  The speech to text component averages over the semantic meaning of every sentence from the speech to text transcript and enhances with spot word detection.  Finally, the crowd excitement score that between the transition of scenes and the appearance of a television graphic is returned.

Linearly fusing the multimedia evidence together ranks each of the actions within the clip. An overall excitement score stores the relative ranking to other highlights.

FinalScoresSegs[t] = float(self.SNWt)*SNScoresSegs[t] + float(self.ActionWt)*ActionScoresSegs[t] + float(self.SpExcWt)*SpExcScoresSegs[t] + float(self.TextWt)*TextScoresSegs[t]

The resulting JSON file is sent to the Content Management System and uploaded to a Cloudant highlights database.

"cognitiveInsights": {
  "commentatorTone": 0,
  "overallMetric": 0.14845520432486656,
  "gestureMetric": 9.800137664347137e-7,
  "crowdMetric": 0.18737307692307692,
  "commentatorText": 0.26275,
  "scoringMetric": 0

AI Listens, Watches and Reads the Masters Golf Tournament Action

+ This content is part 3 of the 4-part series that describes AI Highlights at the Masters Golf Tournament.

Join The Discussion

Your email address will not be published. Required fields are marked *