Digital methods have revolutionized many aspects of the study of pre-modern Chinese literature, from the simple but transformative ability to perform full-text searches and automated concordancing, through to the application of sophisticated statistical techniques that would be entirely impractical without the aid of a computer.
While the methods themselves have evolved significantly – and continue to do so – one of the most fundamental prerequisites to almost all digital studies of Chinese literature remains access to reliable digital editions of these texts themselves.
Volunteers located around the world correct mistakes and add modern punctuation to the texts as time allows and according to their own interests – typically hundreds of corrections are made each day.
As libraries have increasingly come to recognize the value of digitizing historical works in their holdings, many institutions with significant collections of Chinese materials have committed themselves to large-scale scanning projects, often making the resulting images freely available over the internet.
While an enormously positive development in itself, for many scholarly use cases this represents only the first step towards adequate digitization of these works.
Scanned images of the pages of a book make its contents accessible in seconds rather than requiring a time-consuming visit to a physical library, but without a machine-readable transcription of the contents of each page, the reader must still navigate through the material one page at a time – finding a particular word or phrase in the work, for example, remains a time consuming task.
While Optical Character Recognition (OCR) – the process of automatically transforming an image containing text into digitally manipulable characters – can produce results of sufficient accuracy to be useful for full-text search, OCR inevitably introduces a significant number of transcription errors which can only be corrected by manual effort, particularly when applied to historical materials which may be handwritten, damaged, and faded.
Proofreading the entire body of material potentially available – likely amounting to hundreds of millions of pages – would be prohibitively expensive, but omitting the proofreading step limits the utility of the data.
Variation in instances of the character “書” in texts from the Siku Quanshu.