In the final post on AI metadata, we link to a presentation that I did for Roots Tech, outlining some of the practical capabilities of AI and ML in image collections. There is a lot of contextual information here, as well as some discussion about the limits of what AI can accomplish.
Most important, I show how you can try this out for yourself with a very modest investment of time and money. It’s possible to use Adobe Lightroom Classic, along with the Any Vision plugin to send tens of thousands of images through Google Cloud Vision’s tagging service. This makes it a lot easier to tell if ML is going to help you out or waste your time.
Computational tagging can also be done by means of linked or linkable data. Linked data means that there is some “key” that can connect one set of data to another. This key could be a file name, a GPS tag, the name of a person pictured or some other characteristic of a media file.
The value of linked data
Linked data is growing in importance for a few reasons.
APIs make data linking a much easier process than it used to be. And more services built on API linking are becoming available every day.
There is more information about a person, place or thing than will ever be possible to stuff into a file’s metadata, or even into a database record. A linked data approach gets around that limitation.
Sometimes, the data is housed in a place where it can’t be fully extracted. The information contained in a social media feed like Instagram is not possible to fully extract.
Sometimes, the linked data is in a state of change (e.g., someone’s employment record, or the businesses located in a particular district). The only way for this to be accurate is through a live link.
Sometimes, there are needs for multiple databases, and one single database won’t suffice. You may have a system for archiving legal documents that is separate from your media archive. When you need to connect an image with the contract or model release, it may best be done through object linking.
Capturing linked data
Sometimes, it is possible and desirable to flow linked data back into your database. We saw earlier how GPS tags can be linked to databases to discover place names, and these can be added back to the image database as IPTC place names. Here are a couple of other possibilities.
Images published to social media may allow for comments and likes to flow back into the catalog. This provides important information and enrichment, even if it does not capture the full social graph of the object.
Image matching services can be used to mine the internet for valuable information. The initial applications of web-scraping technology have been used to find copyright violations, but the use cases can go far beyond that. There is a lot of information that can be gleaned from discovering everywhere an image appears, and capturing nearby text.
A fast-moving space
DAM applications are only beginning to scratch the surface of linked data capabilities. It’s a big shift in system architecture. More importantly, it’s a big shift in a vendor’s understanding of the value and purposes of a DAM.
It’s an approach we like. Stay tuned for more developments as we begin to roll out new capabilities.
Machine Learning and other AI services can add some useful information to a visual library, but they can only tag for things they “understand”. Some subjects are relatively easy to train a computer to do. Some are very hard, and some are nearly impossible.
The tagging capabilities will be an ever-growing list, and in large part, will be determined by the willingness of people and companies to pay for these services. But, as of this writing, the following categories are becoming pretty common:
Objects shown – This was one of the first goals of AI services, and has come a long way. Most computational tagging services can identify common objects, landscapes and other generically identifiable elements.
People and activities shown – AI services can usually identify if a person appears in a photo. They typically won’t know who the person is unless it’s a celebrity, or unless the service has been trained for that particular person. Many activities can now be recognized by AI services, running the gamut from sports to work to leisure.
Species shown – Not long ago, it was hard for Artificial Intelligence to tell the difference between a cat and a dog. Now, it’s common for services to be able to tell you which breed of cat or dog (as well as many other animals and plants). This is a natural fit for a machine learning project, since plants and animals are a well-categorized training set, and there are a lot of apparent use cases.
Place shown – Even when no GPS data is included, some services can identify a location by the visual appearance of a famous building or other landmark.
Adult content – Many computational tagging services can identify adult content, which is quite useful for automatic filtering. Of course, notions of what constitutes adult content varies greatly by culture.
Readable text – Optical Character Recognition has been a staple of AI services since the very beginning. This is now being extended to handwriting recognition. And once information has been turned into text, it’s possible to translate the text into multiple languages.
Natural Language Processing – It’s one thing to be able to read text, it’s another thing to understand its meaning. Natural Language Processing (NLP) is the study of the way that we use language. NLP allows us to understand slang and metaphors in addition to strict literal meaning. For example, we can understand the phrase “how much did those shoes set you back?” NLP is important in tagging, but more important in the search process.
Sentiment analysis – Tagging systems may be able to add some tags that describe sentiments. One example: it’s getting common for services to categorize facial expressions as being happy, sad or angry. Whether they are correct is another story.
Situational analysis – One of the next great leaps in computational tagging will be true machine learning capability for situational analysis. Some of this is straightforward (e.g., “This is a soccer game”). Some is more difficult (e.g., “This is a dangerous situation”). At the moment, a lot of situational analysis is actually rule based (e.g., “Add the keyword ‘vacation’ when you see a photo of a beach”).
Celebrities – There is a big market for celebrity photos, and there are excellent training sets. A number of services do this quite well.
Trademarks and products – Trademarks are also easy to identify, and there is a ready market for trademark identification. For example, “alert me whenever our trademark shows up in someone’s Instagram feed.”
Graphic elements – ML services can evaluate images according to nearly any graphic component. This includes shapes and colors in an image. These can be used to find similar images across a single collection or on the web at large. This was an early capability of rule-based AI services, and remains an important goal for both Machine Learning and Deep Learning services.
Captions – Some services can create captions from the analysis they make. Currently, these tend to be a bit comical. But as all the capabilities above get better
Most of the tagging listed above can be incorporated into a generic AI tagging service. But some people will want a tagging tool that can identify very specific items. If you want to identify specific people who are not celebrities, you’ll need to train the system to recognize them. This is also required for most product identification services. In these cases, you’ll need an AI system that allows you to provide a set of training images, and allows you to provide feedback for accuracy. These services could make use of rule-based recognition, Machine Learning or Deep Learning, depending on the requirements.
In this post, we examine the ways that computers can be used to automatically tag image files. There are some important differences in the approaches you may use. We will start with a discussion of the two main Machine Learning methods.
Let’s examine the difference between services that create a set of static tags and those that operate as black box services. There are advantages and disadvantages of each.
As of this writing, there are dozens of computational tagging services that can automatically create tags for your images. These services can analyze an image, and return a list of likely tags that describe the visual content. This can include a description of objects, activities, people and other situational characteristics. Each of these tags is typically accompanied by a confidence score that indicates the certainty for any particular tag.
Computational tags can be written into your image database as static metadata, meaning that it won’t change unless someone tells it to. You should be able to see these tags, map them to appropriate fields, and decide to accept or reject them, just like metadata that is added by a person.
Static tags are usually provided by means of an Application Programming Interface (API). An API allows one service (e.g., a DAM application) to talk to another (e.g., a computational tagging service). The DAM can send photos for analysis, and the tagging service sends back a list of tags, usually in the form of a JSON file. The DAM application is then responsible for adding the tags to the database for each image.
The figure below shows what this JSON file looks like.
Local-based or cloud-based
Most tagging services will be based in the cloud. These services leverage massive and ever-improving databases along with high-power cloud computers. They are able to rapidly improve because they see millions of images, and may have many users providing feedback.
Some people will not want to send their images out to external services for analysis. The images may be highly confidential, or perhaps the collection manager is just uncomfortable with letting lots of images run through external services.
There are also a number of tagging services that can run on your own computer, without having to go out to the cloud. Lightroom Classic, for instance, does its face tagging on your local computer, and does not send images through its cloud. Immaga is a commercial service that can also run on your own computer.
In a black box service, the computational analysis is not a one-time operation. Instead, the images are continually reprocessed as the service gains new capabilities, or as it gains a better understanding of you and your collection. As the service learns, the search results should continue to improve. These services may never show you all the tags they currently store for an image since they expect to make a better set of tags at some point in the future.
An important part of black box functionality is the search capabilities inside the box. Conventional metadata is generally used in a filter operation (e.g., hide all images that don’t have the tag “Kensington, Maryland”). Black boxes can function more like Google, where misspellings, synonyms and related terms can produce results even when there is not an exact match.
You don’t own or control the data
When using a black box service the tags and other information typically resides within the service. You don’t own it. Instead, you lease access to it. This is a structural problem that is going to be hard to avoid, at least in the visible time horizon.
The best black boxes don’t just include a set of tags. They have deep semantic graphs of what a tag may mean. This is not something they can export to you, should you decide to leave the service. Likewise the data they have about you is probably not actionable, even if you could get a copy. (Your search history, what you like, where you go, etc).
And the semantic processing they do is also going to stay within the service. (Does “ship sinks” indicate a maritime disaster, or plumbing fixture retailing?)
For some people, particularly in the consumer realm, this lack of control may be fine. For many institutions, this can be a deal-breaker.
Good for language localization
Working with multiple languages is an inherent advantage of some black box tagging services. In many cases, the semantic understanding of an image is not tied to a particular language. Google knows that “car” in French is “voiture” so it can provide similar results. (Google also knows that someone searching on “voiture” is interested in a French-based search of cars, and may be more likely to want a Citroën than a Ford.)
As black box tagging services continue to improve, we’ll probably see them become particularly popular for collections that need to serve multilingual audiences.
Most black boxes ignore your tags
Most of the current efforts to build great black box tagging largely ignore any data that the user bothers to put on the photo. (The main exception seems to be person tagging, which uses your tags to help learn who individuals are). This means that they are often ignoring the most important data in favor of more trivial information.
Most of the examples I’ve seen seem to expect that, given enough horsepower, the machine will learn everything useful, and be able to replace the human. But there is often a lot of context or backstory that is unknowable to the machine. (Why was the picture taken? Why was it uploaded?)
I think that the problem of integrating machine learning and human tagging/curation is being underestimated. (And, yes, that’s one of the things we are working hard on.)
Which way to go?
Eventually, we’re likely to get a really useful hybrid of static tags, black boxes, crowdsource, and human curation. But it does not really exist right now. So what is the best course of action. Here are my thoughts.
Black boxes are great for consumers. They are less likely to make their own tags, and more likely to get a big boost from some basic machine learning optimization.
Tagging Services are probably better for organizations. Given the early stage of computational tagging, it’s likely that services and strategy are going to evolve relatively quickly. So I don’t think it’s time to commit to any single service for the long term. That means that “owning” the tags is important. Additionally, static tagging services allow the collection manager to monitor the service, and see when new capabilities rise to the level of usefulness. Status tags also tend to integrate better with the human tagging and curation that most collections depend on.
This week, we’re going to take a look at the ways that computers are used to make sense of a media collection. Let’s start by putting a finer point on a number of terms you hear in this space.
What is the difference between Computational Tagging, Artificial Intelligence, Machine Learning, and Deep Learning?
While the definitions of these processes have a lot of overlap, we can draw some useful distinctions.
Computational Tagging refers to any system of automated tagging that is done by a computer. This includes the metadata added by your camera. It also includes information like a Wikipedia page that could be added by simple linking.
Artificial Intelligence (AI) encompasses any computer technology that appears to emulate human reasoning. AI could be as simple as a set of rules that can create an intelligent-looking behavior (e.g., a self-driving car could be taught the “rule” that you don’t want to cross a double yellow line). And AI can include some cutting edge services as outlined below.
Machine Learning (ML) is a subset of AI that is more complex. Instead of just following a set of rules created by a programmer, in an ML environment, the system can be trained to discover the rules. An ML system for identifying species, for instance, uses a training set of tagged images to figure out what a Labrador Retriever looks like.
Deep Learning is a specific type of ML that makes use of a predictive model in its learning process. This process actually mimics the way the brain works. In deep learning, the system does not just look at results, but it uses a predictive model to train itself.
In our next post, we’ll look at two system configurations for making use of AL/ML tools.
Unlike the IPTC location fields, which provide incomplete and sometimes subjective information about a location, GPS data can provide an objectively precise position. The IPTC fields may only be able to say, for example, that you took a photo in the Tasmanian State Forest, Tasmania, Australia, but GPS data can pinpoint a location of 41° 13’ 58” S, +147° 59’ 12” E, which is a precise spot within the state forest.
Not only does GPS data enable more precision, it can stand the test of time. Place names change: countries rise and fall, people buy and sell property, earth is turned to street, and buildings are built and torn down. With map coordinates and a timestamp, you’ll never have to doubt where a photo was made. And the location tag, in conjunction with a timestamp and IPTC location names, can become an important record of the history of a place.
GPS coordinates also allow for integration in ways that are not equally possible with the IPTC fields.
GPS coordinates are not language-dependent. GPS coordinates can be accurately placed on a map.
GPS data can be more effectively integrated with databases that provide information about the location.
GPS tags can be used to generate IPTC place names automatically.
GPS coordinates allow you to record more about the exact sublocation than the IPTC field does.
GPS tags, along with a time stamp, can be used to automatically identify an event such as a football game or a disease outbreak.
About GPS Tags
GPS tags for photos can pinpoint latitude, longitude, altitude and a timestamp. GPS tags can also include information about the rate and direction of travel. GPS tags are supported by both the EXIF and IPTC standards, and we’ve come to expect that almost all smartphone photos will have a location tag.
GPS coordinates can be written in one of two common ways. Latitude and longitude can be noted in either degrees, minutes, and seconds, or degrees and decimals. These look similar, but they produce different results and should not be confused. The figure below shows both methods of notation. GPX and EXIF both use the minutes and seconds option.
Metadata embedded in the EXIF needs to be in GPS Exchange Format (GPX), an open standard XML schema for recording position and direction information. It includes latitude, longitude, altitude, timestamp, and a bunch of other stuff that’s relevant to a larger travel record (not the location of a particular photo).
It’s now common for GPS tags to be used to add place names to IPTC metadata by looking up the GPS data in a map database. As with other place name workflow, it gets harder to apply the correct term as you get to lower nodes of the tags. Country is usually easy, as is state or province. But the city level may often be represented by several different options. For instance, my “city” is described in different databases as Kensington, Silver Spring, Montgomery County or Homewood. And the bottom node is even more tricky, as discussed earlier.
Most reverse geocoding does require some human-powered quality control for the bottom two layers. Some reverse geocoding apps will let you choose among different databases for geotagging. And others may allow you to add your own custom names for particular location regions.
Smartphones and GPS devices can usually keep a tracklog which saves a set of waypoint coordinates along with a timestamp. This can be merged with an image’s timestamp to make the geotag at a later date. This is a good technique to use if your camera does not have onboard GPS capability. There are quite a few applications that can tag images from a tracklog. The figure below shows how a waypoint is saved in a tracklog.
One of the most useful bits of information you can attach to an image is the location where a photo or video was taken. This information can be used in many different ways. It helps us find images, it tells us something about the subject matter, and location tags can be a key to lots of other information. There are two primary ways to tag an image location:
IPTC metadata can include human-readable place name tags.
GPS information provides machine-readable mapping coordinates.
IPTC Location Tags
The original IPTC location tags were added to the specification in 2004 and include Country, State or Province, City and Sublocation. These human-readable, informative and hierarchical tags are an extremely useful tool in making sense of an image collection. In many cases, they are easy to classify and fill out, but there are some exceptions which can add a few wrinkles.
Ambiguity in sub-location
The top three fields are often pretty easy to fill in, but the exact location of the photo – the sub-location – often requires a judgement call. You might want to use the neighborhood, building address, or a more descriptive term like “Silas Pickering’s apartment.” In general, I suggest that you use a sub-location that is most useful to you and the other stakeholders of the collection. A street address, for many people, is less useful than the name of the building, business, or person who lives there.
There are times when your location does not fit inside this neat hierarchy. You might be at sea in international waters, or on a river that divides two states. Or perhaps you are in a national park that crosses state lines, and you are not sure which state you are in. In cases like these, it’s probably best to pick a general practice and stick to it. And in these instances, GPS information may really be the best solution.
Camera location does not match location shown
It’s also possible that you might shoot from a location that is entirely different from the one shown (e.g., shoot a picture of New York while standing in New Jersey). In general, it’s best to tag for what’s shown. The photo at the top of this post is an example. It was shot from Zambia, but mostly shows Zimbabwe.
The IPTC Extension includes location namespaces which can be used to specify both the camera location and the location shown. In creating the new fields, the IPTC organization has designated the old location fields as Legacy.
While this could be really useful for some people, it does not appear to be widely adopted. Worse, this means that there are three places where the information can be written, with the most widely used fields now listed as “Legacy.” And the new fields are not very widely supported. This looks like a case where the solution to the edge-case problems has made the overall problem harder to solve.
In the next post, we will look at GPS location data and see how it can help with precision, tagging automation and better workflow.
The curation that you do to your media library database, or the work you do in project software, will also generate some type of metadata, but this may be hard to access or make use of in other applications. Ideally, your library would keep a record of all InDesign documents in which a photo appears or all Premiere Pro documents that use a particular video file. At the moment, this is only available in a handful of enterprise creative environments. Most people will have to use collection information that is created manually to keep a record of which files get placed into projects.
Let’s start by loosely describing what I mean by “curation.” I’m using it to mean the selection process: where files are sorted, chosen and sequenced. This could be for use in a project of some kind. It could also be part of a general editing process to craft photos and video clips into a story of some sort. At its heart, it’s about the selections done to tell a story or highlight the best media.
Curation metadata in a collection typically expresses some kind of relationship between objects. “These 20 photos tell the story of an event,” or “these are all the best photos of our facilities.” Curation may simply involve adding photos to a collection. It could also include sequencing multiple types of media.
Most of the standard metadata we’ve been looking at is used to describe an individual file. None of these schemas describe relationships between multiple independent objects such as individual photos or video clips. Nor do these schemas provide a durable way to record usage information. In most cases, this is a database function, and there is limited ability to export or embed the curation work into a file’s metadata.
Making curation durable and portable
There are several approaches to making your curation work portable.
Your application may create application-specific metadata to record curation, which may be migrated to other applications through scripting.
You may be able to “hijack” another field to write the curation metadata (e.g. all images in a collection get a particular keyword.)
An application may be able to pass curation information along by means of an API.
Transferring your master curation work from one program to another is both difficult and likely to be incomplete, so it’s best to do as much of it in a single application as possible. This argues for choosing library software that can support robust curation processes, and for staying with library software for long periods of time.
XML Project files
Some project software can save edit decisions to XML for portability. Camtasia is a program I use for screencast video editing, and its project file is an XML document. Both Final Cut and PremierePro offer the capability of sharing their timeline information with the other by means of a “standardized” XML file. While these are not typically used to scrape into a catalog to record usage, they could be. In the figure below, you can see how a clip is identified.
So far, most of the schemas I’ve been describing are ones that are not application specific. There are, however, plenty of schemas out there that are specific to a particular application. There are also plenty of tools that allow you to make your own metadata schema that is specific to your personal or institutional needs. Let’s take a look.
Some applications support their own metadata schemas. This may be done to facilitate functions that no other metadata schema can support. Adobe Camera Raw Settings are an example, providing a way to note each rendering control in the program. Capture One does a similar thing with its rendering settings.
In order to be portable, this application-specific metadata needs to be exportable in some way, usually as XML, JSON or XMP. In most cases, the best way to do this is to embed the information in the file as XMP-formatted tags,or in a sidecar text file. In the figure below, we see how Photo Supreme writes the category metadata.
Over the years, Adobe has created its own metadata schemas, which started by supporting the Dublin Core fields. As with Dublin Core, some of the Adobe schema has merged into the IPTC mothership, while some of it is specifically for Adobe software. Adobe-created fields include Ratings, Hierarchical Keywords and the Mark as Copyrighted fields.
Let’s take a look at the way that hierarchical keywords work, since it’s a different type of tag structure than Star Ratings or Mark as Copyrighted. The Hierarchical Keywords field actually functions as metadata about keywords (metametadata?) Let’s see how this can work.
There are four people in the image below, and each is tagged with a keyword that lives in a hierarchy. This hierarchy helps to define exactly what the keyword means.
When you write this metadata out to standard “flat” IPTC keywords, the relationship between keywords is not recorded. Writing the hierarchical keywords to the file helps to preserve the relationship between the various keywords. The figure below shows the relationship between these two fields.
Hierarchical keywords are really valuable to help you describe what your keywords mean. And they also help you make keyword lists that are much easier to navigate since it’s not one giant list. But it’s easy for the dc:Subject and the lr:HierarchicalSubject fields to get out of sync as files are moved between collections.
Camera Raw settings
Adobe also has a robust metadata schema for Camera Raw Settings (CRS). All of the settings in Camera Raw can be written to the metadata of the file. This allows the settings to be persistent and portable even though the changes are not baked into the file.
While it’s theoretically possible for other applications to use the Adobe CRS settings, doing so is only marginally valuable. The way that all raw file converters work is unique to that piece of software.
Camera Raw Setting can be written to metadata. This table shows some of the different settings.
Saving and migrating these settings
Application-specific metadata introduces the possibility of trapping some of your work inside a specific program. Sometimes data may be trapped because the application does not offer a way to export the data for a collection. This is bad practice and should be avoidedd whenever possible. Of course, you may not realize that no export path exists until you have used an application for a long time. It’s important to gain this “pre-nup” understanding when you are deciding what applications to use.
It’s also possible that work can be trapped in an application because no other program can make use of the data. This is typically the case with rendering settings. In these cases, the decision to stop using a program may not present any good options.
In the next post, we will address the term Custom Metadata, which can have several meanings.
The IPTC schema was originally designed by the International Press Telecommunications Council for newspapers to use when transmitting images electronically. IPTC is now the standard schema used by image editing and cataloging software to describe the content and ownership of the pictures.
While the IPTC schema continues to be among the most widely supported metadata standards, it has become so large that it’s not fully supported by very many applications. In practice, this means that some fields are ignored if the software maker does not feel it is useful to their customers. Nevertheless, the IPTC schema remains the bedrock of image metadata in use today. Let’s look at the evolution of the IPTC standard.
The original IPTC schema, referred to as the Information Interchange Model (IIM), was created in 1991. It defined some useful fields for tagging images and provided a way to write that information into the header of the file. If you open the Metadata panel in Adobe Bridge, you can see the IIM fields broken out, as shown elow. File formats that can support this type of metadata include TIFF, PSD, JPEG, DNG, and many proprietary raw formats.
Here are some of the IPTC IIM fields as shown in Adobe Bridge. You’ll see that there is duplication between these fields and some of the ones below. I have only included a portion of the IIM fields to save space.
IPTC Core / IPTC4XMP
Unfortunately, the specification for IIM was limited. The file header space where the data is written is size-limited. Additionally, it was soon apparent that more fields were needed to properly describe images.
In 2004, the specification was revised in two important ways. First, additional fields were added to more fully describe an image, including the ownership and credit information. Plus, the method for embedding the data in the file was changed to make use of Adobe’s XMP technology. The XMP space in a file is elastic, allowing data of almost any size to be written there.
In the following screenshots, we show the IPTC fields as they appear in Photoshop CC. Each field has a notation of the intended use of the field. Note that some fields, such as Photographer, appear in more than one panel. In these cases, the panels both point to the same underlying field, the IPTC Photographer field.
The IPTC organization continues to add new fields to the IPTC schema. These fields address omissions and ambiguities in the original fields.
As you can see, the extensions of the IPTC schema include a lot of connectivity to other databases through the use of unique identifiers. There is a lot of flexibility here, but that also comes with some significant difficulty of implementation. Support for these fields is pretty sporadic.
The IPTC standard continues to evolve, with small yearly changes. These changes frequently include the addition of namespaces from other schemas to the photo metadata standard, such as the 2017 addition of Star Ratings from the Adobe schema.