The Blind Pursuit of Vision
Although we seem to be teetering on the edge of market collapse, with the world holding its collective breath as we prepare for the worst, there is one buzzword in the deafening roar of buzzwords that I predict will survive whatever winter is coming our way. In fact, I predict that as buzzwords go this will be one of the biggest we’ve ever experienced, rivalling the likes of “internet”, “mobile” and “social media”.
“Spatial computing” sounds boring. It doesn’t have any particular ring to it, and many of you may not have heard of it, but it encompasses one of the greatest technical challenges (and opportunities) that the technology sector has ever faced. Whole worlds, literally or figuratively (depending on your ontology), hinge on the problems of spatial computing being solved.
The goal is to be able to write software that responds to the location of an electronic device. Sounds simple, right? Your phone already has GPS that can place you within a few meters of your exact position, so what’s the problem?
GPS has two fatal flaws that will prevent it from being the positioning system of the future with absolute certainty. I mean that. It is absolutely certain that GPS will need to be replaced as the primary positioning protocol of our devices. This will not only happen in your lifetime — there’s a chance that it may happen this decade, and almost every major tech corporation is devoting tens of millions of dollars today to come out on top of this almost unnoticed arms race.
Low Resolution Flat Earth Technology
We’ve touched on the first flaw already: GPS can only position a device within a few meters of its actual position, and it requires line of sight between the device and a relevant geo-positioning satellite. GPS doesn’t work indoors, it works poorly in confined urban environments, and the resolution of its positioning is so low that it is impossible for several devices to agree on a positioning consensus precise enough to allow for things like shared AR experiences.
If we wish to imagine a world where persistent AR layers augment the world we see through our futuristic glasses, lenses or neural interfaces we must also imagine a technology that allows every device to agree, with millimetre precision, where things are in the world.
Companies like OVR or SuperWorld that already sell the vision of an AR metaverse where you can own valuable virtual real estate, based on GPS, might just as well be selling you real estate on Mars. I predict that the fallout of their almost inevitable collapse will be an embarrassing reminder for us all that dreams have to be rooted in reality. Ask yourself how valuable virtual real estate can be if it only works outdoors and can only place items within a radius of several meters from its intended position?
GPS as a positioning system is to augmented reality what the fax machine is as a messaging protocol to Facebook. The charming visionaries excitedly selling you virtual real estate today are either almost inexcusably naive and ignorant, or chillingly cynical. In the spirit of charity I will invoke the great sci-fi writer Robert A. Heinlen and remind myself that I should never attribute to malice what could be attributed to ignorance.
“You have attributed conditions to villainy that simply result from stupidity.” — Robert A. Heinlen
The second problem eating away at GPS is that it is made to represent a two-dimensional world. Longitude and latitude only work in two dimensions, and if you go too far above or below this imagined flat surface the coordinate system breaks down. A future with autonomous aerial drones delivering coffee to your 151st floor window requires a three-dimensional map of the world, and so does any meaningful AR metaverse. There is no reason to believe that the trend of urbanisation will reverse, that buildings will stop growing taller, or that we will refrain from digging new space for ourselves underground. Future cities will be understood by volume, population density will be measured in cubics units, and GPS will be relegated to an obscure backup protocol.
By The Dawn’s Early Light
So at the dawn of the brave new world, how are the tech corporations of today tackling the problem of spatial computing? You need to look no further than Snap’s Local Lenses, or Hexagon’s acquisition of Immersal, or Apple’s mysterious LiDAR scanning cars to understand where the industry is placing its bets: computer vision and digital twin technology.
The concept is simple: create a high fidelity 3D replica of the world and use advanced computer vision to position the device relative to recognisable landmarks. Today Snap can recognise the Big Ben, tomorrow they’ll be able to recognise the coffee machine in your kitchen — or at least, this is what we have to believe if we want to embrace this approach to spatial computing. After a rather long throat clearing, we’ve arrived at the point I alluded to in the title of this post:
Computer vision and digital twins are not the future of spatial computing, and it will not be the foundation of the metaverse. To borrow and misuse a word invented by David Gelernter, let us call this approach to spatial computing the “mirrorworld” approach. I hope to demonstrate in this article that it will take many years to arrive at something that won’t be good enough for our needs, and that the inexorable tide of market forces, and the limitations of the technology itself will ensure that these massive investments into the mirrorworld will be a curious footnote in the history of computing, at best. What follows is a quick list of gripes.
Crowd-sourced Privacy Violations
For the mirrorworld approach to work we will have to create a high resolution 3D copy of our world, complete with our indoor spaces. How do we feel about the massive privacy concerns of having our homes and workplaces mapped and stored on a tech giant’s cloud? Do we imagine we will consent to the crowd-sourced mapping of the world, or is the idea that we will let our benevolent tech overlords into our homes to do the scanning for us? Or will we be satisfied only accessing the mirrorworld in public spaces?
An Argument About Semantics
“Semantic segmentation” is what Niantic calls their ability to discern buildings from trees. It’s a very impressive technological feat that allows their Lightship AR cloud to intelligently render the virtual world with an awareness of the world it is augmenting. Semantic segmentation is an incredibly valuable feature for how to render augmented reality in a convincing way, but it will need to be used for mirrorworld positioning in ways that quickly get difficult.
If we imagine that your workspace has already been scanned and faithfully replicated — what happens when you move a chair or a table? Can a person moving in front of the camera throw off your device’s ability to orient itself? We are probably many years away from having machine learning advanced enough to be able to semantically segment and reason about the world well enough that moving your office chair and monitor won’t throw off the positioning of a mirrorworld. The device will need to make informed decisions about which objects in its visual field have positions it can trust as a reference, and which object should be completely ignored when orienting.
Got The Whole World in His Hands
Aside from the monumental, but crowd-sourced, undertaking of creating a mirrorworld, storing this digital copy of the world is no small undertaking in itself. Storing it on the device itself is out of the question — something this grand could only be stored on the cloud, which means the devices connecting to the mirrorworld will have to intelligently retrieve only the parts of the world necessary for them to orient themselves where they are. Perhaps geoposition from GPS will serve as a first filter as to what information to pull from this AR cloud, but what happens when GPS is not available?
Perhaps you imagine wifi triangulation will serve as a decent fallback, but with a resolution of several meters it too can fall short and make the device’s camera unable to tell your table at the restaurant from the next. And how much time does it take to download the relevant shard of the mirrorworld to allow orientation? Will the end user settle for the metaverse needing that much time to load?
The cost considerations of storing and maintaining a mirrorworld are not negligible, and it also requires a lot from the receiving device for it to work seamlessly. Massive amounts of data will have to be stored, updated, downloaded, uploaded… and the software to just manage this aspect of the mirrorworld requires a tech team worthy of your typical decacorn.
Consider The Platypus
The founding fathers of spatial computing are relying on visual cues to allow devices to position themselves in the world. This is how we orient ourselves, after all, so why shouldn’t our machines do the same?
In Thomas Nagel’s famous philosophical paper “What’s it like to be a bat?”, Nagel explores how our available modes of perception limit our ability to imagine the world perceived any other way. Nagel asks us to imagine what it is like to be a bat, navigating by echolocation. Surely they too have some kind of spatial awareness of their surroundings?
Many animals have novel and even mysterious ways of navigating the world. Birds are thought to have a wide range of available instruments at their disposal, including tiny magnets in their ears and an otherworldly keen sense of smell. We don’t know for sure how birds navigate, but it seems absolutely certain that they have a sense of where they are in the world that is not based on vision alone.
Consider the platypus, sharks, and rays that can orient themselves in muddy waters using electrolocation. Not only can these animals locate prey and mates in their immediate vicinity, but some animals are even thought to use the Earth’s electromagnetic field to navigate across great distances. Eels, for example, are thought to use the minute disturbances in the Earth’s electric field caused by the orbiting moon to find their way to the Sargasso Sea all the way from as far north as Norway.
What senses do our electronic devices have that might be better suited for navigation and positioning than computer vision?
Triangulating a Solution
The indoor positioning industry has used many novel approaches to positioning without making use of unavailable satellites. Some of the more innovative ones, like Trusted Positioning, map the geomagnetic fields in a space to allow for positioning with a 1–3 meter accuracy. Although this approach is incredibly innovative, and has many fantastic use cases, that level of accuracy simply won’t do for an AR metaverse.
If you’ve ever wondered why your phone says that you’ll get a better GPS position when you turn on your wifi, you might have learned about triangulation, a common trope in crime shows and action movies, where the location of a device is calculated in reference to its distance from other known points.
When you turn on your wifi you don’t actually get a better GPS signal — rather, your phone instead triangulates against a database of known wifi router locations that you yourself have inadvertently helped populate simply by moving around in the world with your phone on. Companies like Skyhook, Apple and Google have proprietary databases of wifi routers that they can reference when your electronic device has access to several of them at once.
Wifi and Bluetooth triangulation obviously require some hardware setup, but in today’s urban environments it is increasingly likely to work with minimal effort. However, both of these methods, as a result of the wavelengths that they communicate on, suffer from the same problem of precision. You simply can’t get the level of precision required for an AR metaverse by using the wavelengths available to wifi or bluetooth.
A challenger appears
The new UWB (ultra-wideband) chip in modern phones might be the undiscovered underdog hero of the spatial computing arms race. Intended for relative positioning between two devices, the UWB chip communicates at an incredibly high frequency and fidelity, allowing two devices to know their relative position with an accuracy of a few centimeters. Apple uses it in their new AirTag and the NearbyInteraction API family, and companies like Pozyx, Zebra and infsoft provide tagging solutions using UWB as well.
The limitation of UWB is its range, just like with wifi and bluetooth. Today Apple recommends devices to be within 9 meters of each other for optimal results, but UWB as a technology can actually outrange wifi with a total range of up to 200 meters. It’s intended use is relative positioning, but perhaps we can learn from the passive triangulation approach we discussed earlier in this article that there may be deeper implications of the technology. Can UWB triangulation be the silver bullet that solves spatial computing, a new geospatial system to replace GPS in urban and indoor environments? Can it be made accurate enough to serve as the foundation for an AR metaverse?
At Auki Labs we think the answer is yes, and we’ve started building the world’s first multipeer continuous calibration system with probabilistic consensus algorithms to allow for passive and continuous positioning in the metaverse — even for blind devices without a camera. Ephemeral mesh networks of relative positions will reference a Skyhook-like database of known stationary UWB devices, and turn relative positions into absolute geopositions orders of magnitude more precise than GPS.
This approach will be cheaper to host, because it does not need to store (or even scan) the entire world’s digital twin, cheaper to run, because it does not need expensive cameras and on-device computation for computer vision, and more accessible to IoT devices.
If we succeed, we believe it changes everything. Auki Labs is on a mission to help every person and device find their place in the world. Literally.