Computer Vision + Natural Language Processing = Common Sense AI

November 12, 2020

Hey, have you heard about GPT-3?  Duh, thought so.  What you probably haven’t heard is that, according to a recent article in the MIT Technology Review, that it is a mirage.  Say what?

Here is how you prove the point – if you ask GPT-3 to tell you the color of sheep, it will tell you “black” as often as “white” because its got the phrase “black sheep” in our vernacular programmed in its black box algorithm.

This highlights the fact that GPT-3 is an astoundingly good natural language processing model, but it doesn’t have “intelligence.”

The types of data used to train AI models
Well, it turns out that using computer vision along with natural language processing can help improve accuracy.

Researchers from the University of North Carolina, Chapel Hill designed an artificial intelligence technique called “vokenization” to help solve the black sheep problem.  This combines computer vision with NLP to get better results.

The easiest way to understand this is to look at the data that is used to build the various AI models. 

  • Wikipedia. This text-only English data set has 3 billion words.  This is used to train natural language models.
  • Microsoft Common Objects in Context, or MS COCO, contains only 7 million items consisting of image with a text caption describing the image. See, that 7 million is not nearly enough to train an AI model.

Enter Vokens and Vokenization
That’s where vokens come in.  “Tokens” are words used to train NLP models.  “Vokens” are the images associated with words in the caption of a visual-language model.

“Vokenization” gets around the problem of the small data sets that have images with captions like MS COCO.  It uses unsupervised learning methods to scale the tiny amount of data in MS COCO to the size of English Wikipedia. 

The Results
This results in a visual-language model that outperforms state-of-the-art models in some of the hardest tests used to evaluate AI language comprehension today.  So far, it’s nowhere near  100%, but that’s ok.  Progress is incremental.

The researchers in the study, Hao Tan and Mohit Bansal see their work as an important conceptual breakthrough in using unsupervised learning techniques in visual-language models.  Similar leaps in logic helped advanced NLP techniques to the state-of-the-art status of today, so this is a development worth following.

Easy peasy to share this story with your peeps

Level up your inbox with The Scroll

Get stories like this delivered to your inbox.

Business news focused on startups and tech. Get informed while being very mildly entertained.
No spam. No fluff. No nonsense. Ever.