We now zoom in on the direct pathway between visual word inputs (orthography) and verbal speech output (phonology), using a much larger set of words comprising most of the monosyllabic words in English (nearly 3,000 words). By learning on such a large selection of words, sampled according to their frequency of occurrence in English, the network has a chance to extract the "rules" that govern the mapping between spelling and sound in English (such as they are), and thus be able to successfully pronounce nonwords.
English is a particularly difficult language from a pronunciation perspective, as anyone knows who has tried to acquire it as a second language. There are very few (if any) absolute rules. Everything is more of a partial, context-dependent regularity, which is also called a subregularity. For example, compare the pronunciation of the letter i in mint and hint (short i sound) to that in mind and find (long I sound). The final consonant (t vs. d) determines the pronunciation, and of course there are always exceptions such as pint (long I sound).
One way to classify how strong a regularity is, is to count how many other letters the pronunciation depends upon. A complete exception like pint or yacht depends on all the letters in the word, while mint vs. mind depends on one other letter in the word (the final t or d). There are many silent letter examples, such as the final e in many words. A nice subregularity is the letter m, which depends on whether there is an n next to it, in which case it goes silent, as in damn, column, or mnemonic. Many other consonants can be silent with varying degrees of subregularity, including b (debt), d (handsome), h (honest), l (halve), p (coup), r (iron), s (aisle), t (castle), w (sword), and z (rendezvous).
Another factor that determines how much context is required to pronounce a given letter is the preponderance of multi-letter groups like th (think), which have a particular regular pronunciation that differs from the individual letters separately. Other examples of these include: sch (school), tch (batch), gh (ghost), ght (right), kn (knock), ph (photo), wh (what). One of the most context sensitive set of letters is the ough group, as in though, tough, cough, plough, through, nought, where the pronunciation varies widely.
So English is a mess. The constructed word ghoti is a famous example of how crazy it can get. It is pronounced "fish", where the gh is an f sound as in tough, o is an i sound as in women, and ti is a sh sound as in nation.
For any system to be able to have any chance of producing correct pronunciation of English, it must be capable of taking into account a range of context around a given letter in a word, all the way up to the entire word itself. An influential early approach to simulating spelling to sound in a neural network (Seidenberg & McClelland, 1989) used a so-called Wickelfeature representation (named after Wayne Wickelgren), where the written letters were encoded in pairs of three. For example, the word "think" would be encoded as thi, hin, and ink. This is good for capturing context, but it is a bit rigid, and doesn't allow for the considerable amount of regularity in individual letters themselves (most of the time, an m is just an m). As a result, this model did not generalize very well to nonwords, where letters showed up in different company than in the real words used in training. A subsequent model by Plaut, McClelland, Seidenberg, and Patterson (1996) (hereafter PMSP) achieved good nonword generalization by representing input words through a hand-coded combination of individual letter units and useful multi-letter contexts (e.g., a th unit).
Figure \(9.8\): Word reading as a process of spatially invariant object recognition. Words show up in different locations in the input, and the next level up, equivalent to the V4 level in the object recognition model, extracts more complex combinations of letters, while also developing more invariant representations that integrate individual letters or multi-letter features over multiple different locations. The IT level representation then has a fully spatially invariant representation of the word (as a distributed representation integrating over individual letters and letter groups), which then provides a nice mapping to the phonological output.
We take a different approach in our spelling-to-sound model (Figure 9.8), leveraging ideas from the object recognition model that was explored in the Perception Chapter. Specifically, we saw that the object recognition model could learn to build up increasingly complex combinations of features, while also developing spatial invariance, over multiple levels of processing in the hierarchy from V1 through IT. In the context of word recognition, these complex features could include combinations of letters, while spatial invariance allows the system to recognize that an m in any location is the same as any other m (most of the time).
One compelling demonstration of the importance of spatial invariance in reading comes from this example, which made the rounds in email a few years ago:
- I cnduo't bvleiee taht I culod aulaclty uesdtannrd waht I was rdnaieg. Unisg the icndeblire pweor of the hmuan mnid, aocdcrnig to rseecrah at Cmabrigde Uinervtisy, it dseno't mttaer in waht oderr the lterets in a wrod are, the olny irpoamtnt tihng is taht the frsit and lsat ltteer be in the rhgit pclae. The rset can be a taotl mses and you can sitll raed it whoutit a pboerlm. Tihs is bucseae the huamn mnid deos not raed ervey ltteer by istlef, but the wrod as a wlohe. Aaznmig, huh? Yaeh and I awlyas tghhuot slelinpg was ipmorantt! See if yuor fdreins can raed tihs too.
Clearly this is more effortful than properly spelled text, but the ability to read it at all indicates that just extracting individual letters in an invariant manner goes a long way.
|Nonword Set||ss Model||PMSP||People|
|Glushko exceptions raw||79.0||72.1||78.3|
|Glushko exceptions alt OK||97.6||100.0||95.9|
|McCann & Besner ctrls||85.9||85.0||88.6|
|McCann & Besner homoph||92.3||N/A||94.3|
|Taraban & McClelland||97.9||N/A||94.3|
Table \(9.1\): Comparison of nonword reading performance for our spelling-to-sound model (ss Model), the PMSP model, and data from people, across a range of different nonword datasets as described in the text. Our model performs comparably to people, after learning on nearly 3,000 English monosyllabic words.
To test the performance of this object-recognition based approach, we ran it through a set of different standard sets of nonwords, several of which were also used to test the PMSP model. The results are shown in Table \(9.1\).
- Glushko regulars -- nonwords constructed to match strong regularities, for example nust, which is completely regular (e.g., must, bust, trust, etc).
- Glushko exceptions -- nonwords that have similar English exceptions and conflicting regularities, such as bint (could be like mint, but also could be like pint). We score these items either according to the predominant regularity, or also including close exceptional cases (alt OK in the table).
- McCann & Besner ctrls -- these are pseudo-homophones and matched controls, that sound like actual words, but are spelled in a novel way, for example choyce (pronounced like choice), and the matched control is phoyce.
- Taraban & McClelland -- has frequency matched regular and exception nonwords, for example poes (like high frequency words goes or does), and mose, like lower frequency pose or lose.
The results indicate that the model does a remarkably good job of capturing the performance of people's performance on these nonword reading sets. This suggests that the model is capable of learning the appropriate regularities and subregularities that are present in the statistics of English pronunciation.
- Open Spelling to Sound to explore the spelling-to-sound model, and test its performance on both word and nonword stimuli.