Thursday, August 15, 2024

My Thoughts on Amazon/Audible Virtual Voice Audiobooks as an Author and Publisher

As many of you are likely already aware, I have been participating in Amazon's Virtual Voice audiobook beta for the past several months.

Thus far, I've published three of my books via the system, and intend to publish several more over the next six to twelve months.

I have also recorded, produced, and published a traditional, self-narrated audiobook via Amazon's ACX platform, a task I completed shortly before being invited to participate in said beta, so I feel reasonably confident making direct comparisons between those two experiences, which I believe will be useful in articulating and validating some of the points I'd like to make here.


My first Virtual Voice audiobook, "309," as it appears on Audible's website.


To address the elephant in the room: "Is Amazon's Virtual Voice technology AI?" In my professional opinion, as someone who worked as a full-stack software engineer for over twenty years, I would say the answer to that question is a definitive "No." That said, I do understand why many people might make assumptions or have concerns about the product and its mechanisms given the current state of affairs in the technology industry at large, so I will do my best to expand on that "no" a bit in a way that I hope will be understandable and relatable to those of you out there reading this who may lack the technical expertise to make such determinations.

First and foremost, an important clarification: In recent years, the term "AI" has been increasingly misused and misinterpreted by a variety of entities as a sort of catch-all for algorithmic operations that can and should be more accurately described as advanced automation and data-driven prediction. A big part of this has been perpetuated by the technology industry's (in my opinion) foolish and misguided attempt to turn a technical term and theoretical premise into a marketing buzzword. The simple fact of the matter is that the fundamental technologies at the core of everything from chat bots, to automated image and video manipulation and generation, to "spontaneous" code creation, documentation, and modification have existed in various forms for decades. Yes, the ways those building blocks are being arranged and leveraged have enhanced and changed over the years, as has the computational capacity of the systems running such code, but the vast majority of what we're seeing pitched and reported on as "AI" these days is only a natural evolution of software development practices that have been "in play" for ages.

Another important distinction that is often overlooked, which I believe is of particular relevance with regard to Amazon's Virtual Voice tech is the amount of human intervention required for such "AI" tools to produce useful and acceptable results. I suspect that a big part of what allows people to make the leap of calling something "AI" is the lopsided ratio of input versus output that often occurs when using such tools. 

Historically, and indeed until fairly recently, it was necessary to provide a computer with an amount of time, effort, and data at least comparable, and more often than not far in excess of the end result or "finished product" one was attempting to have the computer help them produce. Programmers might spend days, or weeks, or months, or even years writing code to generate specialized results that might well only ever be used in a single context (i.e. to solve a single problem,) and creatives might spend as much time using such software to produce content, be it writing a book with a word processor, or producing a video with a non-linear video editor, or recording a song with a digital audio workstation. Now, systems exist that can produce at least somewhat convincing text, or audio, or even video as a result of a simple, one-line prompt delivered as conversational language (i.e. how humans typically speak to each other - not code.) It's no wonder why many people, particularly those who don't understand the fundamental technologies driving such systems, tend to view such things as "magical" or ascribe "intelligence" to them.

Of course, any reputable computer scientist can spot the difference between such "tricks" and true artificial intelligence (i.e. sentience) from a mile away but most people aren't computer scientists and therein lies the problem. If something appears to be intelligent and the people using it lack the ability to conclusively determine otherwise, it's very easy for the vast majority of users to simply shrug and say, "I guess we have AI now."

Take it from someone who knows. We do not. What the tech industry calls "AI" in 2024 is merely a result of the mind-boggling amounts of data and computing power that humanity has amassed over the past fifty or so years since computers became ubiquitous. All these things could have been done in the 1970s, or at the very least the 1990s as there were some important developments with regard to programming principles such as object-oriented design and multithreaded parallel processing in that era, if we'd had the data sets and CPUs/GPUs we do today.

And of course, that's not to mention all the outright theft and unauthorized use of human-generated content and intellectual property that's been so-often "leveraged" to "create" the essential data sets for most if not all of the "AI" systems currently being touted as "revolutionary innovations" by the industry.

But I digress...

My point is that Amazon's Virtual Voice tech, which to the best of my knowledge is not marketed as AI but is often assumed to be, is not a small-input/massive-output system akin to something like a chat bot. It's a tool wherein the person using it has to make a considerable effort to "teach" the system how to effectively read the individual book being produced. For each of the three books I've released with Virtual Voice, I had to spend dozens of hours listening to them repeatedly, tweaking the timing, pronunciation, and reading speed of hundreds of individual words, paragraph breaks, and phrases to produce the results a customer hears when listening to the completed audiobook. This is largely because the system (good and impressive as it is) is far from perfect and indeed contains shortcomings that I was unable to work around completely and had to "live with" in the same way I would when using any other creative tool. That said, I'm fairly confident that the tool will improve over time and I do firmly believe that the results it can produce (even now) are at the very least acceptable if sufficient effort is made to make them sound as good as possible, which is what I have done and why I feel good about using Virtual Voice to publish audio editions of my works.

The other (I imagine) obvious question many of you might have is: "Why are you using this tool when you have the ability to produce your own audio?" There are several answers to that:

  1. Having worked in the technology industry for a long time, I'm very aware of the profound impacts that new tech can have on other industries. I'm therefore always keen to study and assess such things as quickly as possible to determine whether or not they might prove a help or hinderance to the things I'm attempting to do as a creator. That made participating in a beta such as this a no-brainer for me but there's admittedly a bit more to it.
  2. Having just completed the work on the audiobook for my first novel, "The Big Men," just prior to the invite, I was very aware of how much time and effort such a tool might save me, and the possibilities it might open up, such as having a female voice read "309," a book featuring a female main character. That was something I wanted to experiment with and an opportunity I couldn't in good conscience let pass.
  3. The bottom line is that, had I not been invited into the Virtual Voice beta, I would have simply gone on recording and producing my own audiobooks with my own voice, as I did my first, (admittedly at a much slower rate,) so I felt completely comfortable exercising that option, knowing that the only person I'd be taking any work from would be myself. 
All that said, I wouldn't begrudge anyone else with other options choosing to make use of a tool like Virtual Voice for any reason. If you are happy with the results you're able to produce with it and you firmly believe that your readers will be happy (or at least content) with those results as well then, by all means, do what you think is best with your work. I certainly don't have the budget to pay for anyone else to narrate my books, so if I were unable or unwilling to do so myself, a tech like this could absolutely be a godsend and, even if I could afford traditional narration, it would still be my right as the creator of the works to put them out into the world in whatever way, and using whatever tool(s,) I see fit.

To wrap things up, I've been largely happy with my Virtual Voice audiobook publishing experience so far and I see a lot of potential for things to improve considerably over the next few years as the technology and its capabilities mature and grow. There have definitely been some rough spots in terms of technical issues, negative customer reactions, and instances where I've wished for the system to be a bit more capable than it currently is but all things considered, I'd say it's absolutely worth a look for any author or publisher pondering audiobook production options.

My current audiobooks:

  • The Big Men (self-narrated)
  • 309 (female Virtual Voice)
  • Shards (male Virtual Voice)
  • The Nemesis Effect (male Virtual Voice)