Category Archives: Visually Speaking

  • Semantic Image Understanding

    In 1984, Apple unveiled the Macintosh computer, which unleashed a desktop-publishing (DTP) and word-processing revolution. Tools that had previously been used only by a small number of trained professionals were suddenly in the hands of nearly everyone, and soon became essential to many jobs, and to the general functioning in society. Mobile phones are doing the same thing with visual media. 

    It’s hard to imagine, but it took 20 years from the start of the DTP revolution until full drive indexing came to your computer. (You know, that thing you take for granted, where you can type a bit of text, and every document on your computer with that text shows up in a list?) In the interim, there was no good way to file and find specific documents, other than file and folder names. It was clunky, time consuming, and very easy to lose important stuff. 

    We are at a similar point in the development of photographic speech. We’re experiencing a flood of new files to manage, but the tools to store, tag and find are lagging far behind. In large part, this is because we don’t have a good notion of the semantics of images. 


    Semantics is loosely defined as the study of meaning in a language. As we think about speaking the language of imagery, it will be essential to get a more formalized notion of content, context and meaning. This notion needs to factor in a number of the following elements: 

    Denotative elements – This is the who, what, when, where, and why of an image’s subject matter. Many of the mature metadata tools have focused on this, starting with IPTC long before the digital photo revolution. The stock photography industry has also pushed this forward, since there was an economic reason to develop better ways to tag and search vast image collections for sales and licensing. AI tools are now driving this forward. 

    Object graph – In a language spoken with the use of objects, the path, proliferation and connections to the object become a deeply important part of understanding the meaning and importance of the image. 

    Creator knowledge and intent – It is often essential to know the intent of the photographer in order to completely understand the meaning or importance of an image. Was an image captured (and shared) in order to show a specific thing?  Was this a good thing or a bad thing? Visual media can hold a lot of information, and it can be really helpful to know which part the creator intended you to pay attention to. 

    Viewer perspective – You can’t determine meaning without factoring in the relationship between the image and the person viewing. The denotative information and the object graph help to determine if an object has meaning to me. And that meaning may be different than the meaning to others, depending on my personal graph or my cultural perspective.  


    Image semantics falls under the rubric of Informatics: the study of the interaction between people and information systems. Ultimately, we need a way to parse through images in order to find the ones that suit our needs. Sometimes this will be easy. As your needs become more complex, as your collection grows larger, and as you seek to use visual media from other collections, the semantics problem becomes harder and more important. 

    There are several structural methods to approach the discovery issue: 

    Simple search and filtering – The familiar tools we have to search our own collections will continue to be important. If you know the date taken, a simple filter may be the easiest way to find the right image. Search and filter will clearly be improved by computational tagging services, which will help as collections expand. 

    Searching within identity-aware services – When you search with Google, the search is assisted by what Google knows about you. This might be the location you’re in, which helps to find locally-relevant results. Siri and Google know a lot more about you, and can, for instance, make a guess as to whether you mean “horses” or “cars” when you search for “racing.” 

    Intelligent local agents – It’s possible that we will also see some type of intelligent search capability that runs locally in private collections and allows the library owner to know about the person searching, rather than keeping all the information locked away in a social media or giant web service. 

    Image semantics is a young field with a lot of growing to do. While the exact path is uncertain, it’s certain to grow because the problem–and the value of a solution–is growing. Making use of new tools for visual semantics will require the collection, preservation and accessibility of the media.

    Next week, we’ll take a look at the media library ecosystem – what your tools need to accomplish and how to evaluate them. 

  • Programmatic Tagging

    In today’s post, I outline some of the types of tagging that can be done automatically. 

    Let’s face it: (almost) no one wants to spend lots of time tagging images, and part of the appeal of photographic communication is to avoid tapping out written descriptions of stuff. As image collections’ growth rates accelerate, Artificial Intelligence tools are becoming more important in classifying images. Taken together, these new tools will be an essential part of creating the semantics of imagery.

    Programmatic capabilities

    Let’s take a look at some of the capabilities that fall under Artificial Intelligence and computational tagging. Some of these are bundled with each other in a service, and some are freestanding capabilities.

    Machine learning – Computers can be trained to do all kinds of visual recognition tasks—from identifying species and reading handwriting, to looking for defects in manufactured items. Machine learning is the broad category encompassing any trainable system. Some systems rely on centralized servers and databases, and some can be run locally on your own computer. 

    Machine learning tags typically come with a confidence rating. Sometimes these ratings feel a bit overconfident.

    Facial Recognition – One of the primary machine learning capabilities is facial recognition. It’s an obvious need in many different situations, from law enforcement to social media to personal image management. Some services can recognize notable people. Others are designed to be trained to recognize specific people.

    Object recognition – There are dozens of commercial services that can look at images and identify what is being pictured. These may be generalized services, able to recognize many types of objects, or they may be very specialized machine learning algorithms trained for specific tasks.  

    Situational analysis – Many of the services that can recognize objects can also make some guesses about the situation shown. This is typically a description of the activity, such as swimming or the type of environment, like an airport. 

    Aesthetic ranking – Computer vision can do some evaluation of image quality. It can find faces, blinks and smiles. It can also check for color, exposure and composition and make some programmatic ranking assessments.

    Emotional analysis – Images can be analyzed to determine if people’s expressions are happy, sad, mad, etc. Some services may also be able to assign an emotion tag to images based upon subject matter, such as adding the keyword “sad” to a photo of a funeral. 

    Optical character recognition – OCR refers to the process of reading any letters or numbers that are shown in an image. Of course, this can be quite useful for determining subject matter and content. 

    Image matching services – Image matching as a technology is pretty mature, but the services built on image matching are just beginning. Used on the open web, for instance, image matching can tell you about the spread of an idea or meme. It can also help you find duplicate or similar images within your own system, company or library. 

    Linked data – As described earlier, there is an unlimited body of knowledge about the people, places and events shown in an image collection—far more than could ever be stuffed into a database. Linking media objects to data stacks will be a key tool to understanding the subject matter of the photo in a programmatic context. 

    Data exhaust – I use this term to mean the personal data that you create as you move through the world, which could be used to help understand the meaning and context of an image. Your calendar entries, texts or emails all contain information that is useful for automatically tagging images. There are lots of difficult privacy issues related to this, but it’s the most promising way to automatically attach knowledge specific to the creator to the object.

    Natural Language Processing – NLP is the science of decoding language as humans actually use it, rather than by strict dictionary definitions. NLP allows for slang, poor grammar, metaphors and more. It’s what allows you to enter normal human syntax into a Google search and get the right result. It’s what allows a search for “Cool dog photo” to bring back this photo instead of just a dog in the snow. 

    Language translation – We’re probably all familiar with the ability to use Google Translate to change a phrase from one language to another. Building language translation into image semantics helps to make it a truly transcultural communication system. 

    All of the categories of tagging listed above are available in some form as AI services, which can be used to tag a great number of images very quickly and cheaply. Some of these tags may even be helpful. (Unfortunately, at the moment, a lot of them are either wrong or unhelpful.) There can be quite a bit of slop here.

    Machine learning services are attempting to filter out the slop with confidence ratings. Tags can be filtered according to the algorithm’s confidence in each result. While this can be helpful, in my opinion it’s not addressing the more important challenge, which is the integration of human curation with machine learning tools. As you might imagine, this is an issue we’re looking at closely, and we have some promising approaches.

    In the next post, we’ll look at the way that all these tagging tools can be brought together to create a more comprehensive way to understand image content programmatically. 

  • Mobile>Digital

    When the digital revolution hit the practice of image-making in the early 2000s, it seemed to media professionals like the world had been turned upside down. Cameras looked the same from the front, but everything about shooting, processing and delivering photos and videos changed. The mobile revolution has made the digital revolution look like a small speed bump compared to the seismic changes now happening.

    Mobile makes everyone a photographer/videographer. Digital may have taken film and tapes out of the equation, but cameras were still expensive, and digital workflow was hard and time-consuming. Mobile photography and videography is now ever-present, with cameras in nearly everyone’s pocket and sharing services a click away. With mobile, still and moving images are cheap, plentiful, easy to make, and easy to share.

    We consume imagery differently. We are now more likely to read a web page on a mobile device than on a computer, and each of these is more likely than reading a printed publication. The small screen encourages the use of images instead of text because still and moving images are much more efficient for communication and engagement.

    The mobile ecosystem creates and leverages data richness. Now that we are creating and reading on mobile, it’s easier to attach information to images and to make use of that information. This encourages visual communication that includes a data component, further accelerating the evolution of our photographic language.

    Mobile removes latency. Because mobile images are “born connected,” the time between shooting and sharing has been reduced to a matter of seconds. This has increased engagement, as photos and videos can be shared in real time. Image and video processing apps have reduced the latency between shooting and interpreting the imagery in a personalized or artistic fashion, allowing for a more organic connection between shooting and processing.


    Human beings have been able to “read” of images since the days of our prehistoric ancestors’ parietal art. In more modern times, we have come to understand the meaning of images in a certain way because photographic artists, photojournalists, documentarians and filmmakers have developed the tropes of visual storytelling. In the film era, this was particularly hard to accomplish, as each frame of film cost money to purchase and process. Moreover, knowing that you had the story captured on your unprocessed film was a hard-won skill unto itself.

    In the mobile era, the incremental media cost of taking a photo or shooting a video is basically $0, and the instant feedback that reveals success or failure can take the guesswork out of shooting. In the mobile era, taking a picture or filming a video with proper focus and exposure is within reach of nearly everyone—for free—with equipment that is typically close at hand.


    Mobile photography is increasingly blurring the lines between images, video, text and data. While you can certainly find purist enclaves in mobile communities, the trend is heading inexorably to images as multimedia objects. These can include text overlays, stickers, geodata, audio, image sequences, depth information, augmented reality elements, and more. The smartphone is an inherently data-rich environment, and all that extra stuff adds to the ability to communicate.

    People who build and manage media collections must take the changes wrought by mobile into account as they build and select systems. This changes the velocity of communication, the need to collect imagery from vastly more people, for vastly more use cases, and to erase the boundaries between media types and linkable data. It’s a tough challenge, but we are far enough into the mobile revolution to make some good guesses about what’s going to be important.

  • Images as Language

    We are in a new era of visual communication. The smartphone has enabled visual communication at every level – personal, professional, and institutional. This has had a profound impact on every aspect of still and moving images – from cameras and software, to the legal landscape, use cases, and business models. By understanding imagery as a language, we can make sense of these changes more organically.

    The rise of smartphone-based visual communication does not diminish the value of “traditional” photography and videography. As more people learn to speak these visual languages, they become more useful and important, and a new vernacular emerges—delighting some, horrifying others, and befuddling those who are not open to change.

    Think of how the development of the printing press made books so much more ubiquitous. It allowed more people to join in the creation—wresting control away from those few people who had mastered the old skills. Rather than diminishing the value of books, this popularization lead to an increase in reading and knowledge dissemination; it opened the conversation to all. And as more people were capable of publishing, new voices emerged and new uses and needs were created for the medium. The same is happening today with still and moving imagery.


    Our personal images and videos are our diaries, the expression of our identities, and our memories. They hold great value, and even if they have no monetary value, they are often priceless.

    In a corporate environment, digital imagery is essential to expressing your brand, your history, your products, and the people behind your institution. Photos and videos have also become essential business documents; they can serve as record-keeping and document—or protect from—liability. An organization’s digital media are truly its digital assets—with very real value attached. (Consider the cost to produce a photo or video shoot or your annual stock photography and/or clip-licensing budget.)

    Photos and videos are replacing written observation at an astounding pace, but in most cases, the tools and practices to work with these media are lagging behind the need. The problems that confronted photographers and videographers at the start of the digital age are now spreading to society at large. Yes, there are sharing services that can be effective channels for distribution. But most of these are going to be poor long-term repositories for your asset collection.

    As we examine Digital Asset Management practices, we’ll need to integrate the traditional practices with the increasing role of imagery in all forms of communication. Seeing these forms as part of a new language should help you make sense of what’s happening and prepare for the future.


    DAM services and cloud libraries have traditionally focused their features on communication from marketing and production departments. While those are still the primary drivers of DAM services, the balance is changing, and we need to provide services for these new, much larger use cases. We need to be able to manage the collection, classification and rights-tagging of crowd-sourced media.

    This is a big departure for most DAM systems – in many cases it’s simply been ignored. We are putting this challenge front and center in our product redesign. We feel that the same tools that can make professional visual communication more effective can also be used to power a much broader use case that includes employees, customers, and stakeholders in addition to the traditional constituencies of DAM services.