IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Find the perfect image to illustrate a news story, documentary, or product

Thomas Smith is co-founder and CEO of Gado Images, a media and technology company based in the San Francisco Bay Area that digitizes, captures and shares the world’s visual history.

Gado Images is a media and technology company that works with photographers, archives, and collectors worldwide to help them digitize, annotate and monetize their photographs, illustrations, sheet music, and other visual materials. We work with everyone from large historical archives like Johns Hopkins University and the Afro-American Newspapers to small historical collectors. We also have a network of contemporary photographers worldwide, providing news photographs documenting technology, business, finance, and travel. Our images are routinely used in CNN, The New York Times, Entrepreneur, Vanity Fair, Forbes, Fortune, and many more.

A big part of our business is taking in imagery and quickly making sense of it – who is depicted, what they are doing, when the photo was taken or illustration created, etc. This kind of information and context is essential to helping our customers find the perfect image to illustrate their news stories, documentaries, products, etc. As you can imagine, this is no small task, especially when dealing with everything from a news photo shot today to a historical engraving that might be more than 200 years old.

To quickly process and make sense of our customers’ materials – and to make them more searchable and, thus, more valuable on the image licensing market – we use our Cognitive Metadata Platform (CMP), which automates many tasks related to understanding images. The CMP, in turn, draws heavily on the cognitive computing capabilities of IBM Watson™.

One of the first steps we perform when we ingest a new client image is to run the image through the Watson Visual Recognition service. This provides us with tags describing the content of the image, as well as confidence figures for each tag assigned. The tags are cross-referenced against our own controlled vocabulary, which draws on ImageNet and WordNet. We then run the image through our own recognition services, as well as several other automatic tagging platforms, again referencing against our controlled vocab to get a standard set of terms and confidence values.

Depending on the end goal for the image, we then handle this visual tagging information differently. For an image that depicts a simple object and will be used in a commercial setting, our system might accept any automatic tags that are above a certain confidence value. For a historical image that depicts a complex scene for a news story, we might surface the automatic tags in the interface of a human researcher as tag suggestions to be approved or removed. In some cases, Watson and the CMP allows us to fully automate the tagging process (though all images go through a QA step before being supplied to customers). In other cases, the Watson Visual Recognition can at least provide an automatic background level of tagging – for example, telling us whether a documentary photo shows a person, architecture, an animal, etc. with crucial historical details filled in by a human researcher later.

During the next step in our process, complex images are researched by our professional research team. In some cases, we might also pull in metadata from a partner archive, or use OCR to extract text from the image itself, or from a handwritten or typed original caption. What this yields is a lot of unstructured text describing the image. The next challenge for our platform is to transform this unstructured text into structured data that matches the keywords and controlled vocabularies of each of our image licensing marketplaces. This is where the Watson Natural Language Understanding services are crucial.

At first glance, you would think we could just iterate through the raw text for each image and look for each keyword in our vocab as a substring. So if the image’s caption, research notes, or OCR text contained the word “building,” we would add the structured term “building” to the image. This works with simple concepts, but especially in dealing with historical imagery, it gets more complex very quickly. We represent, for example, the remarkable archives of the Afro-American Newspapers in Baltimore. Among the most valuable images in the collection are portraits of civil rights pioneer Dr. Martin Luther King, Jr.

Imagine if we did a simple substring search of our unstructured text, looking for the term “Martin Luther King Jr.” We would pick up and tag photos of Dr. King, our desired goal. But our collections also contains hundreds of images that do not depict Dr. King, but still contain his name – images of Martin Luther King Jr. High School, Martin Luther King Jr. Boulevard, etc. With a simple substring comparison, all of these images would be tagged as depicting Dr. King, which is not accurate and would make search challenging for our clients.

This is where Watson NLU comes in. The service reads through the unstructured text from our images, and rather than doing a simple substring comparison, it looks at the actual context of the text, surrounding words, sentence structure, etc. and pulls out a list of named entities, with entity types listed. So for the portrait of Dr. King, NLU would return an entity of the type PERSON, while for the high school or road, NLU would return an entity of another type – ORGANIZATION or LOCATION, for example. This context-aware disambiguation allows us to quickly and accurately tag images with the correct entities from our controlled vocabulary, transforming unstructured text about our images into structured lists of keywords and entities. It also allows us to access information about those entities in a tree, tagging our images with related information. So if NLU found “Martin Luther King Jr. High School,” we might also tag the image “Education,” while if it found Dr. King himself, we might tag the image “Civil Rights.”

In addition to entity parsing, we use NLU for myriad other functions. These include generating keyword lists from text, processing sentence structure to determine where in a caption we should insert information, automatically generating captions by parsing strings for parts of speech and other attributes, and much more. Taken together – and combined with other services and our own proprietary software – NLU is a core part of our cognitive computing offerings.

Beyond the tech itself, there are several reasons we chose IBM Watson. We like that Watson does not provide a one-size-fits-all solution but instead is designed to be customized to the use cases of any industry. More than a single solution, Watson provides a toolbox that a skilled dev team can apply to many challenges. We also like how Watson can be trained with custom data for a particular industry – we are currently working on our own visual recognition models to find attributes with relevance to stock photo licensing.

Another core reason for choosing Watson is that all our intellectual property continues to belong to us, including any custom models we build. This is a key aspect of Watson for us – especially dealing with historical imagery from world-class partners – and we’re very comfortable knowing that this is backed up by IBM, a large company with a 100-year history.

One of the biggest reasons we chose Watson, though, is that the Watson team has consistently made the effort to engage directly with our leadership and dev teams, and to really understand our industry. We appreciate IBM’s presence at the annual conference of our industry trade group, the Digital Media Licensing Association. During our 2018 conference, IBM presented about how Watson could address challenges in our industry, and I was able to meet one on one with IBM’ Seth Lytle, who learned about our company and connected us to resources (like new video processing tools) that could help provide solutions for our product lines.

In the future, we plan to continue to build our integration with Watson. We plan to continue using NLU and the Visual Recognition tools, to integrate more custom models, and to explore video recognition features. We also plan to integrate automatic transcription for processing oral histories and other audio files, and we look forward to hearing what new features the Watson team offers.