My name is Kimberly Rashell, and I have been a broadcast captioner for the past 13 years. I have worked for two of the largest captioning firms in the U.S., and I have also worked as an independent contractor for multiple smaller firms. In my career, I have captioned anything and everything from local news to live Olympic events.
The court reporting and captioning college that trained me was very cognizant of the needs of Deaf and hard of hearing viewers. From the very first day, it was drilled into my head that comprehension and readability trumps verbatim every day. We have one – and only one – chance to make sure that the captions we are providing make sense to someone who cannot hear what was said.
The Shift to ASR
Recently, it has been announced by multiple media groups in the U.S. and Canada that they are moving away from using skilled captioners to provide captions on live television broadcasts.
Instead, they have decided that Automatic Speech Recognition (ASR) has developed to the point that it can be used. This is nothing but a cost-saving measure on the part of the media groups, who would rather buy a software once than pay a skilled person each time they need captioning.
This roll-out of ASR into multiple media outlets in the U.S. and Canada has raised some serious red flags to those who depend on captions for equal access to audio information. While ASR can do a fine job of translating the spoken word into written ones in an absolutely perfect situation — where there are no background noises, no accented speech, and only one person talking at a time — it fails miserably in most real-world scenarios.
Verbatim Isn’t Enough
One of the repeated claims about ASR’s superiority to a skilled captioner is that it is verbatim. Since the roll-out of ASR, we have seen examples proving that ASR is most definitely NOT verbatim, but even more than that, in the context of the captioning environment, being verbatim is not enough.
The words spoken are only part of the story. ASR is missing the tone and the context that leads to a deeper understanding of the meaning behind those spoken words.
The Human Element
I strive every day to write the exact words I hear. However; because I am human, I also have the mental flexibility to understand that I don’t know how to spell Debbie witness’ last name, especially if it is unique or has an unexpected spelling. So, until or if I can see her name on the screen, she will just be “Debbie” or “the witness” or “the witness” or “the neighbour.”
The same thing is true for the suspect who was identified five minutes ago with the foreign name that can be commonly spelled five different ways.
Also, because I am human and can understand the nuances of language. When an anchor is trying to be funny and comes back from a story about a cat playing the piano and says, “Wasn’t that just purr-fect?,” I can spell it that way so the person who can’t hear the sarcasm in the anchor’s voice and rolled “R” can be in on the joke.
People make up words all the time for comedic effect or to aggrandize something. There are also local and regional names belonging to parks, buildings, streets, storms, and people that have unique spellings or pronunciations. When an ASR system hears an unfamiliar word, it will do its best to decipher what it thinks it hears and break it up into multiple smaller words that have absolutely nothing to do with the context.
ASR is not listening to context or trying to make the words actually make sense to a reader. It is simply hearing syllables and trying to make a word match each syllable as it is being spoken, whether it makes sense or not.
Because I am human, I can discern background noises, and can tell the difference between multiple voices, I can identify who is talking and can identify subtle shifts in tone that indicate we’ve moved on to a new topic. Those indications look like “>>” for a new speaker (if I don’t know for sure who is speaking), >> Nancy (if I know for sure who is talking), and “>>>” for a new topic.
ASR is unable to insert these into the captions. Anchors and reporters on your local news regularly speak over 200 words per minute. The captions are only two or three lines long. This means you have only a split second to see the words on the screen before they are gone forever.
Imagine trying to speed read a book that has no paragraphs, chapters, or quotation marks. Without those reading cues to help you find your place in the story, it will be exhausting trying to figure out who is talking about what and when.
Because I am human and have 40-plus years of grammar and punctuation under my belt, I know when to add a period, comma, or dash to help a reader understand context.
Since ASR is not listening for context, it does not care whether grammar or punctuation are correct. So, again, imagine reading that same novel but with no correct punctuation. It’s just page after page of words scrunched together in a block in the middle of a page. It would be absolutely exhausting to decipher any meaning from that at all.
Now, put that on a screen two lines at a time and give yourself barely two seconds to read each line. The mental gymnastics required to make heads or tails of the meaning of those words, even if it is verbatim, will exhaust even the best of us.
source: Captioning Key
Because I am human, I can also give the reader information about what they may be missing in the background that is essential to the story being shown, like [Gunshots], [Crying], [Applause], [Doorbell], [Phone Ringing].
All those audio cues that happen, sometimes, without the hearing person even really consciously acknowledging they happened, help you understand what you see on the screen.
Why did that reporter suddenly jump and turn their body away from the camera? Oh, because there were gunshots behind her.
ASR, will not be adding these audio cues.
Because I am human, I can drop the stutters, false starts, okays, ums, and throw-away words that help slow down the reading speed enough to allow a reader to actually comprehend what is being said.
ASR does not have this ability. ASR is trying to make a word for every utterance it hears. When people are already speaking 200-300 words per minute, the “um” and “ah” in the middle of the text can distract from meaning and comprehension.
A Captioner’s Role – Better Than Verbatim
As a captioner, I am not creating a legal record that will stand forever. My words are here and gone. I have just one chance to get it right, one chance to make sure the content is delivered coherently.
So, no, even though I strive to be a verbatim writer every day, sometimes I have to be better than verbatim. My job is not just to deliver words. It’s to deliver content. For the reasons I’ve outlined above, ASR is not up to that task, not by a longshot.
Consumers Must Demand Quality
Anyone who relies on captioning should speak up and demand equal access. Contact local stations who are putting out these subpar and unreadable captions. Let them know you rely on correct and complete information.
Don’t let these media groups make a buck on your backs by replacing skilled human captioners with computers. They do not care whether you understand the news. They care about how much money they are saving by using a computer instead of a person.
Complain when you see captions you can’t make sense of. It’s your right to have equal access to the same information available to the hearing public. Until the stations understand how many people rely on quality captions, they will do everything they can to save a dollar, including removing your access to information.