
Raghad Alsulami
2
further. Chapters are self-contained, allowing readers to start at any point
without disrupting their overall reading experience. However, I personally
recommend that readers interested in learning how to build their own MT
systems start with Chapter 2, which focuses on data, then proceed to Chapter 4
for insights into the inner workings of NMT, and later explore the more
advanced topics in Chapters 6, 7, and 10. Following this sequence will, I hope,
develop an incremental and comprehensive understanding of building, training,
and optimizing MT systems. As we see throughout the book, the role of
translators becomes more crucial than ever in this critical data-driven AI age.
The authors thus point to some open issues in MT in the afterword, where
human input is not just supplementary but rather indispensable.
In Chapter 1, the authors begin by taking us on a seventy-odd-year
journey through MT history, starting from Warren Weaver’s 1949
memorandum, moving to rule-based and statistical MT, and leading to the
more recent developments of NMT and Large Language Models (LLMs). The
authors touch on the shortcomings of previous MT paradigms and remind us
at the end of the chapter that even with recent breakthroughs in AI, MT is still
far from being a solved problem, as some commentators would claim. Both
NMT and LLMs are data-driven and in need of large-scale high-quality
resources, something that is not equally guaranteed for the roughly 7,000
languages spoken today. The paucity of resources is one reason for the authors’
belief that AI advances will not replace translators, a stance they firmly
maintain through the entire book.
Having introduced MT in the first chapter, the authors shift their focus in
Chapter 2 to data, the cornerstone of any MT system development. In a very
accessible language, the authors present various sources of data that could be
used for building MT systems including translation memories, open-access
repositories, data harvested from the web, and synthetic data. Data-related
issues such as alignment, toxicity, bias, ownership, and data insufficiency are
also discussed, along with a few data postprocessing steps. Dedicating an entire
chapter to data is a commendable aspect of the book, as it addresses key
questions that readers are likely pondering. Where can one find reliable data?
How much data is sufficient for building a well-performing MT system? Who
holds ownership of the data? And what options are available when large
volumes of high-quality data are simply out of reach? By tackling these
foundational points upfront, the book paves the way for a smoother transition
into the more advanced chapters on building MT systems.