What makes an image AI-generated? Patent diagrams for a computational Bokeh effect

Sam Goree

In this post, I look at a series of diagrams from a patent for a computational Bokeh system and look for where photographs end and “AI-generated” images begin. It is an adaptation of a “code critique” I posted to the Critical Code Studies Working Group forum in February.

Introduction

Recently, discussions surrounding AI image generation and computational creativity have come into the mainstream, often accompanied by claims of theft and exploitation (Jiang et al. 2023). These debates tend to assume a clear category boundary between images generated by AI and those created by humans. While such a boundary arguably exists in the realm of digital art, that boundary is fuzzier in the realm of photography.

In this code critique, I’d like to look at the boundary between human and computational authorship through some of the technologies used by smartphones to produce photographic images. While the technologies at work have avoided questions about authorship and copyright (authorship is attributed to the human behind the camera), many of the same issues regarding theft, competition and hegemony may be arising more quietly.

Ideally, this sort of inquiry would turn to the source code for smartphone image processing applications. Unfortunately, the methods used by these programs are important to competition between smartphone manufacturers; camera photo quality is an essential selling point, and the methods used to take “better” photos are kept under tight wraps. The only information we have regarding these algorithms comes from patent applications. So this code critique will look specifically at five technical drawings from a patent. Specifically, US Patent 11,094,041B2, “Generation of Bokeh Images Using Adaptive Focus Range and Layered Scattering” issued in August 2021 to George Q. Chen, an engineer at Samsung. These drawings illustrate a technical method for a “bokeh” or background blur effect which might be used now or in the near future in Samsung smartphones. I am a computer vision researcher and not a lawyer or patent expert, I will interpret these diagrams as a representation of a computer program which we can critique in much the same way as source code.

Figure 1

a technical drawing with boxes like "flash" "sensor" "processor" "memory" "display" and "network"

Figure 1 illustrates an example network configuration including an electronic device in accordance with this disclosure.” In other words, a diagram of a smartphone system the technology could be deployed on.

Colloquially, we talk about smartphone “cameras” as camera devices, similar to the digital cameras of the 20th century. From the user’s perspective, these cameras are very similar. They use an image sensor to capture light on command, and then apply image processing to recover a human-viewable image, which is digitally stored. However, smartphones are internet-connected computers, with processors faster than the desktop computers of a few decades ago. That means the image processing step can be highly complex, including anything from traditional signal processing to deep neural networks, and some steps may even be completed off-device, if the user has consented. These steps are part of the photographic process in between the press of the shutter button and the appearance of an image, not a later editing step.

Figures 2 and 3

two figures. The top describes a process using two cameras with steps including "rectification" "disparity detection" and "computational bokeh." The bottom shows another process with steps "determine focus range" "determine CoC curve" "Determine layers" "scatter various layers" and "blend various layers."

“Figure 2 illustrates an example process for creating a Bokeh image using an electronic device in accordance with this disclosure. Figure 3 illustrates an example process for applying computational Bokeh to an image in accordance with this disclosure.” In other words, Figure 2 shows the whole process and Figure 3 shows the computational step.

The word “Bokeh” comes from the Japanese verb 暈ける (literally “to blur”) and refers to the aesthetic quality of images which are blurred beyond the focal length. Its usage in computational image processing comes from its usage in photographic method books (such as Kopelow 1998). Skilled photographers can achieve this effect using film-based cameras by varying the focal length (distance between film and lens) and aperture diameter (size of the opening behind the lens). Smartphone cameras do not have these moving parts and thus cannot achieve the same effect optically, so they simulate it using image processing.

two photos of a metal figure in front of a book. The left shows the book "10 print chr" clearly while the right shows it blurred out.

Two photos of a bookshelf, one with a computational bokeh effect applied to blur the (hopefully familiar) book in the background.

The aesthetic appeal of blur in images highlights a tension between art and engineering. Classically, cameras and image processing algorithms were evaluated based on visual accuracy — how perfectly they captured and reproduced a visual signal. But humans take photos for a variety of reasons and often have aesthetic goals which prioritize visual indeterminacy and abstraction over accuracy.

But ironically, simulating the blurry indeterminacy of Bokeh requires more information about the three dimensional scene than the original two dimensional image. Specifically, it requires information about the depth of each pixel to distinguish the foreground from the layers of background. This information can be recovered from two images taken by cameras at the same time from slightly different perspectives, like the “wide” and “ultra-wide” camera described here, by playing a “spot the difference” game and measuring the disparities between the two images. Because of the parallax effect (think of how fast the trees move relative to the mountains when you look out a car window), the disparity map contains that depth information.

A line plot of CoC vs. Depth with a clear minimum point labeled "Focus position"

After finding the disparity map, this algorithm computes a circle of confusion curve, like this one above, which relates the depth of each pixel to the radius of the circle it will produce when out of focus. You can see this in your own vision if you hold your finger close to your face and focus on it with one eye closed. Small points in the background grow larger, the closer your finger is to your face, and the further an object is away from you, the larger its circle of confusion will be. By blurring with the right radius at each layer of the image, the algorithm produces a computational Bokeh effect which is indistinguishable from the optical version.

This bokeh effect was historically part of the specialized knowledge of a professional photographer. Understanding of the photographic process and familiarity with a camera allowed photographers to expressively produce these effects in images. Now, amateur photographers can produce the same effect by making use of the specialized knowledge that has been built into the camera software itself. These effects deskill photography, allowing amateurs to produce “professional-looking” photos.

Figures 4 and 5

Two technical diagrams. The top shows a smartphone camera interface with two sliders. The bottom shows a flow chart with boxes "retrieve focus position from user touch point" "determine object class" "initialize focus range based on average thickness of object class" "generate Bokeh preview" and "User accepts?"

“Figure 4 illustrates an example user interface for receiving touch inputs that define a depth of focus range in accordance with this disclosure. Figure 5 illustrates an example method of determining an adaptive focus range in accordance with this disclosure.” In other words, Figure 4 is a drawing of an interface where the user can tap to specify the depth they want and Figure 5 shows the algorithm behind it.

Rather than specifying the depth of field using the physical aperture and focal length, smartphone users can specify this parameter using a more intuitive tap on the object of focus. Again, though, an easier user experience requires more information from another source. In this case, the source is an object recognition algorithm, which identifies the class of object the user tapped on. This step is essential to know the range of depths that the user is interested in. From the patent, “a person object class may define the average thickness of a human to be about 0.55 meters. Various object classes can be defined based on a number of different objects, such as people, cars, trees, animals, food, and other objects.” These objects are classified using deep neural networks, likely trained on millions of photographs taken by human photographers. While Samsung’s specific dataset is not described in the patent, it likely resembles the COCO dataset which is often used for training object detection models.

Even though the user can manipulate the sliders to manipulate the inferred depth of field, the inference step imposes normative assumptions upon the photograph. When you capture a person, the depth of field should be large enough to capture their entire body, not just part. Your object of interest should be from a small family of common object classes with known thickness values. This system draws on the visual language of millions of previous photographers to determine how to process a new image. Even though the new photo does not come out of a deep neural network, its manipulation has been shaped by the contents of a large dataset.

Going through this algorithm, we see several parallels to the debates over AI art. Advances in hardware and software technology create a new way to generate digital images from signal data by making use of past works. Innovations in user interfaces make these image generating technologies easy to use by amateurs, directly competing with the professionals whose work was used as training data. But the simplicity of these interfaces requires imposing normative assumptions about what the user is trying to do, making “normal” usage easier and “abnormal” usage harder, and subtly influencing the visual aesthetics of countless future images to match the aesthetic standard imagined by the developers. But there are also significant differences; for example, these algorithms still require real light hitting a sensor, rather than a textual signal from a user. It is also harder to anthropomorphize this sort of an algorithm, with clear deterministic processing steps, versus a stochastic black box text-to-image generative model.

Bibliography

Chen, George Q. Generation of Bokeh Images Using Adaptive Focus Range and Layered Scattering. US 202111094041 B2, United States Patent and Trademark Office, August 17 2021. USPTO Patent Center https://patentcenter.uspto.gov/applications/16699371

Jiang, Harry H., et al. “AI Art and its Impact on Artists.” Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 2023.

Kopelow, Gerry. How to photograph buildings and interiors. New York, Princeton Architectural Press, 1998, p. 118-9 Accessed via the Internet Archive

Salvaggio, Eryk. “Seeing Like a Dataset: Notes on AI Photography.” Interactions 30.3 (2023): 34-37.