Digitized text is any kind of text that can be read on a computer, but that text can take many forms. The treasure of digital history presents an incredible boon to historians, offering possibilities for online research and teaching that would have been unimaginable just a few years ago. Capturing more information and sampling more frequently makes digitizing more expensive. It takes longer to gather and transmit more complete information, and it costs more to store it. The simplest format is a “page image” produced by scanning a printed page or a roll of microfilm. These digital photocopies have three major advantages and an equal number of serious drawbacks. First, you can create them quickly and easily. Second, good page images closely represent the original. The page image of the WPA life interview mentioned earlier not only shows you the handwritten insert of the editor but also indicates precisely where he inserted it. Third, page images give a “feel” for the original.
Document mark-up predates the Internet or even computers. Traditionally, copy editors manually marked up manuscripts for typesetters, telling them, for example, to set the chapter title in “24 Times Roman.” Computerized typesetting meant that mark-up had to be computer readable, but the specific codes differed depending on the software program. IBM, the dominant computer company of the 1960s and 1970s, made extensive use of the Generalized Markup Language, which emerged in 1986 after a long process as the international standard.
After rapid development of data processing technologies in the United States in the first half of the 20th century, it was pretty apparent that a standard code that could handle all of the characters for the English language needed to be developed for the interchange of data. The ASCII code was established. ASCII stands for American Standard Code for Information Interchange and was made to achieve compatibility between all the various types of data processing. Computers can only understand numbers, so this ASCII code consists of numerical representations of characters, such as the letter “a”, or an action of some sort, such as returning to the next line. The ASCII character code was limited, so other character code sets had to be created for other languages. As all of these other code sets were being created, problems developed as they began to try to use them together, and no single encoding system could contain enough characters. Toward the end of the 1980’s, work began on creating a single unified character set by two independent groups. These were the International Organization for Standardization (ISO) and the Unicode Project, which was organized by a consortium of multi-lingual software manufacturers mainly from the United States. They established Unicode which was able to cover all the characters for all the different writing systems of the world, including all the symbols, punctuation marks, and all the other characters used in writing text. Sometimes it is important to introduce new signs into the Unicode table. The € symbol was introduced fairly recently and so most of the code tables were more or less full. Unicode was the character set that seemed to be the easiest to accommodate this new symbol. Unicode then allocated a blank cell to the new character and told everyone what the character code of the Euro symbol was.