If a 10-second message is sliced into 10 overlapping 2-second pieces, then 10 different humans across the globe have to parse the few words out of each piece and send it back. The software (running on an in-house computer) can then process the overlapping transcriptions and put together the full text, even applying automatic spell- and grammar-checkers to smooth out the result.
What a terrible solution. Don't you see the problems with that? If you split a sentence into shorter parts then you lose both the context and the meaning of bigger entities (idioms, compound words/adjectives, phrasal/prepositional verbs). You will have trouble with homophones too (the input is audio, after all). Taking as an example part of your first sentence, imagine that you have to transcribe "...the globe have..." into text. Without the "across" you wouldn't know whether to transcribe as "glove" or "globe".