How Any Words Do You Know
Introduction
Researchers dealing with quantitative aspects of language are oft asked how many words a typical native speaker knows. The question is raised not merely by lay people but also by colleagues from various disciplines related to linguistic communication processing, development, conquering, and education. The answer usually starts with a deep sigh, followed by the explanation that the number depends on how a word is defined. As a consequence, in the literature one finds estimates going from less than 10 thousand to over 200 chiliad (run across below). In this paper, we attempt to give practical answers for American English depending on the definition of a word, the linguistic communication input an private is exposed to, and the historic period of the private.
Terminology
In this text, we volition need a number of terms related to words. For the readers' ease we summarize them here.
Give-and-take Types vs. Discussion Tokens
Word types refer to unlike word forms observed in a corpus; tokens refer to the total number of words in a corpus. If the corpus consists of the sentence "The cat on the roof meowed helplessly: meow meeooow mee-ee-ooow," then it has nine give-and-take types (the, cat, on, roof, meowed, helplessly, meow, meeooow, and mee-ee-ooow) and x word tokens (given that the discussion type "the" was observed twice). Somewhat surprisingly, in some word counts the words "The" and "the" are considered as two different word types because of the capital letter in the start token of "The." If such do is followed, the number of word types reported nearly doubles.
Alphabetical Give-and-take Blazon
This is a word type consisting only of letters. In the example in a higher place mee-ee-ooow would be deleted. The cleaning to go to alphabetical give-and-take types in add-on involves eliminating the distinction between capital letters and lowercase letters. So, the words "GREAT" and "peachy" are the same alphabetical type.
Lemma
Uninflected word from which all inflected words are derived. In virtually analyses is limited to alphabetical word types that are seen past the English community every bit existing words (e.k., they are mentioned in a dictionary or a grouping of people on the web apply them with a consequent meaning). In general, lemmas exclude proper nouns (names of people, places, …). Lemmatization too involves correcting spelling errors and standardizing spelling variants. In the pocket-sized corpus example we are using, there are six lemmas (the, true cat, on, roof, meow, and helplessly).
Give-and-take Family
A group of lemmas that are morphologically related form a give-and-take family. The diverse members are most always derivations of a base of operations lemma or compounds fabricated with base of operations lemmas. In our modest example corpus the lemmas "the, cat, on, roof, and meow" are all base of operations lemmas of different families, but the lemma "helplessly" tin can be simplified to "help."
Below we volition run into what the various definitions of "words known" mean for vocabulary size estimates. Offset, we hash out how many words there are to learn (in English).
In Theory, the Number of Word Types in a Linguistic communication is Infinite
Kornai (2002) argued that the number of discussion types in a language is dizzying considering linguistic communication users constantly coin new words.1 This is linked to the observation that the number of word types increases as a function of the corpus size. All else equal, the number of word types will be smaller in a small-scale corpus than in a big corpus, every bit new types add upwards the more words a person (or machine) processes. When the very get-go words of a corpus are candy, each word is a new type. Very apace, all the same, word types start to repeat (e.thousand., the word "the" occurs in nearly every judgement) and the increase in word types slows downwards. The more than words processed already (i.e., the larger the corpus size), the less likely the next word will be a new blazon, because nearly discussion types have already been encountered.
Herdan (1964) and Heaps (1978) argued that the function linking the number of discussion types to the corpus size has the shape of a power office with an exponent less than one (i.e., it will be a concave function). This function is shown in the upper part of Figure 1. It is known as Herdan's law or Heap'south law (Herdan described the function first, only Heap'due south book had more than impact). We will telephone call the function Herdan's police in the remainder of the text.
             
          
FIGURE 1. Figures illustrating Herdan'south or Heap's law. The (Superlative) figure shows how the number of give-and-take types would increment if the law were a power police with exponent 0.5 (i.eastward., square root). The (Bottom) figure shows the aforementioned information when the axes are log10 transformed. Then the concave function becomes a linear office, which is easier to work with.
Mathematicians prefer to present power functions in coordinates with logarithmically transformed axes, considering this changes the concave office into a linear office, which is easier to work with, equally shown in the bottom office of Figure one. Kornai'southward (2002) insight was that if the number of word types is limited, and so at a sure point Herdan's law will pause down, because the puddle of possible word types has been wearied. This volition be visible in the curve becoming flat from a certain corpus size on.
Kornai (2002) verified Herdan'south law for corpora upward to 50 meg word tokens and failed to find any flattening of the predicted linear curve, indicating that the pool of possible discussion types was nevertheless far from exhausted. Since Kornai'south (2002) analysis, corpora of vastly greater size take been released and when Brants and Franz (2006) fabricated the offset ane.025 trillion discussion corpus bachelor based on the English language internet webpages at that time, they verified that Herdan's constabulary still applied for a corpus of this size. Brants and Franz (2006) counted 13.half-dozen 1000000 word types in their corpus, with no indication of a stop to the growth.
A look at the types in Brants and Franz's (2006) corpus reveals that a great deal of them consist of alphabetical characters combined with non-alphabetic character signs, most of which no native speaker would take as constituting a give-and-take in English (like to the word "mee-ee-ooow" in the case above). In order to confirm that the growth in types is not the result of these arbitrary combinations of characters, some cleaning is required. One such cleaning is to lowercase the corpus types and limit them to sequences involving the letters a–z only (Gerlach and Altmann, 2013). As indicated to a higher place, we call the resulting entries 'alphabetical word types.' Effigy 2 shows the increase in the number of alphabetical word types (N) every bit a function of corpus size (M)2 (Gerlach and Altmann, 2013). The data are based on the Google Books corpus of over 400 billion words (Michel et al., 2011).
             
          
FIGURE ii. Herdan's or Heap's law applied to the Google books corpus of over 400 billion words. It shows the number of alphabetical unigrams (N) as a function of the corpus size (M). The individual data points stand for the various years in the corpus. The black line is the cumulative corpus over the years. The red line is the growth in N on the basis of the equations derived by Gerlach and Altmann (2013). Co-ordinate to Herdan's or Heap's law, the function between Yard and N should be a straight line (as both scales are logarithmic; base 10). Yet, the curve shows a aperture at N = 7,873 (Thou ≈ iii,500,000; see also Petersen et al., 2012). As a result, unlike equations were proposed for the two parts of the bend. Source: Gerlach and Altmann (2013).
As can be seen, the curve shows an unexpected alter at N = seven,873 (M ≈ 3,500,000) and so continues with a steady increase upwards to the full corpus size. Gerlach and Altmann (2013) explained the change at North = vii,873 as the point at which the core vocabulary of English is incorporated. As long as this vocabulary is non exhausted, the increase in new word types per grand words in the corpus is higher than after this indicate. A factor contributing to the larger increment of word types in the get-go may be the existence of function words. Function words form a close course of some 300 words, needed for the syntax of the sentence. Most of them are high frequency words (i.e., likely to be encountered early in a corpus).
The equation describing Herdan's law above the flection point describes the extended vocabulary and is the one we will employ for our estimates of the total number of alphabetical types that can potentially be encountered. It is defined as follows:3
This equation predicts that there are virtually 9.6 million alphabetical word types in Brants and Franz'southward (2006) corpus of 1.025 trillion words. To arrive at 15 meg alphabetical types, a corpus of ii.24 trillion words would be required.
Toward a Businesslike Answer 1: How Many Alphabetical Types are People Likely to Have Encountered?
Although it is correct that a language can comprise an unlimited number of alphabetical types, this does not address the question how many word types people are likely to know, or, in other words, how large their vocabulary is. For this, we need applied answers to the following two questions: 'How many alphabetical types are people probable to accept encountered in their life?' and 'How many of these alphabetical types do nosotros consider as words of a linguistic communication?' We volition start with the first question.
The number of alphabetical types that people can run into is limited by the speed with which they process language. Equally we will evidence, the existence of corpora of hundreds of billions of word tokens should non mislead us into thinking that such exposure is possible in a lifetime. In addition, information technology is important to consider individual differences: Not anybody consumes language at the same speed. Nosotros will distinguish between three theoretical cases: (a) a person who gets all language input from social interactions, (b) a person who gets all language input from television programs, and (c) a person who gets all language input from constant reading. Every bit nosotros will see, this stardom gives rise to major differences in the number of words encountered. In addition, we must consider age: All else equal, older people will have encountered more words.
Mehl et al. (2007) asked a group of male and female students to wear microphones during the day and recorded 30 due south of sound of every 13 min (the duration was limited to assure the participants' privacy). On the basis of the data gathered, Mehl et al. (2007) estimated that their participants spoke about sixteen,000 give-and-take tokens per day, with no significant difference betwixt men and women. Assuming that participants listened to the same corporeality of word tokens, the full input from social interactions would exist equal to 16,000 × 2 × 365.25 = eleven.688 1000000 word tokens per yr. For a 20-year-old, this adds up to about 234 million give-and-take tokens (assuming full input from day one)iv; for a person of threescore years, it grows to slightly more than 700 million word tokens. Applying Gerlach and Altmann's (2013) equation, this would predict that a 20-year-old has encountered 84,000 alphabetical word types while a 60-year-quondam has encountered 157,000 alphabetical word types. In all likelihood, the corresponding vocabularies would be smaller than the number of encountered types. Ane reason is that give-and-take diversity is considerably larger in written materials (on which Gerlach and Altmann's estimate is based) than in spoken linguistic communication (Hayes, 1988; Cunningham and Stanovich, 2001; Kuperman and Van Dyke, 2013). Another reason is that 1 cannot presume all word types to be memorized at their first run across.
Van Heuven et al. (2014) sampled all subtitles from BBC1, the British public broadcaster'due south nigh popular TV aqueduct. On the basis of this corpus, it tin can be estimated that the yearly input for someone who constantly watches the channel, is 27.26 million give-and-take tokens per year (i.e., more than twice the input from social interactions). This results in a total input of 545 meg word tokens for a 20-year-old and 1.64 billion word tokens for a 60-year-old. The numbers are likely to be overestimates, as broadcasts go on for some twenty h per mean solar day.
Reading rate is estimated between 220 and 300 give-and-take tokens per minute, with large individual differences and differences due to text difficulty and reading purpose (Carver, 1989; Lewandowski et al., 2003). If we accept the upper limit and assume that a record reader reads 16 h per twenty-four hours, this gives a total of 300 ∗ 60 ∗ sixteen ∗ 365.25 = 105 meg give-and-take tokens per year. For a 20-year-old this adds up to 2.1 billion word tokens, and a 60-yr-old has seen six.3 billion word tokens. Applying Gerlach and Altmann'due south (2013) equation, this would imply that these 2 people accept encountered 292,000 and 543,000 alphabetical word types, respectively.
In summary, based on our assumptions about the amount of tokens encountered in different modalities and on the relationship between word tokens and give-and-take types, we can roughly approximate the number of alphabetical word types one has probable encountered: A 20-year-one-time exposed exclusively to social interaction volition take encountered around 81,000 alphabetical types, while a 20-twelvemonth-old exposed not-stop to text volition take encountered around 292,000 different alphabetical types. For a sixty-yr-old, the corresponding estimates are 157,000 and 543,000 alphabetical types, respectively. As nosotros volition see in the adjacent sections, nosotros would not normally consider all these alphabetical types as words of a language.
Toward a Pragmatic Answer ii: From Alphabetical Types to Lemmas
Table 1, which shows an arbitrary extract of the corpus types in the Google Books Corpus (Michel et al., 2011), makes two things clear. First, a large number of alphabetical types encountered in big corpora are proper nouns (names). Unlike common nouns, verbs, adjectives, or adverbs, nearly proper nouns are understood to hateful the same in different languages. In other words, knowledge of proper nouns is independent of knowledge of a particular language. Every bit the notion of a vocabulary usually refers to the body of words used in a detail language, information technology makes sense to exclude proper nouns from the counts of words known in a language.
             
          
Tabular array 1. Excerpt from the word list of Google Books Ngram viewer.
The 2d thing we see when nosotros look at the list of alphabetical types from a large corpus, is that it contains many alphabetical types reflecting regional differences (e.g., English vs. American English), typographical errors, spelling mistakes, and words from other languages. As we do not consider these to be part of the target language, it is clear that they must be excluded from the word counts also.
A farther reduction is possible by excluding all regular inflections. In general, verbs in English language have four forms ('display, displays, displayed, and displaying') and nouns have two forms ('divination and divinations'). Some adjectives have unlike forms for the positive, the comparative and the peak ('gentle,' 'gentler,' 'gentlest'). The ground form ('display,' 'divination,' 'gentle') is called the lemma.5 When a listing of vocabulary types in English (excluding names) is lemmatized, the number of lemmas is about sixty% of the original list.
A straightforward technique to estimate the number of lemmas in a language is to analyze dictionaries. Goulden et al. (1990), for instance, analyzed Webster's Third New International Dictionary (1961), chosen considering information technology was the largest not-historical dictionary of English. The preface of the dictionary said it had a vocabulary of over 450,000 words.half-dozen When the inflections were excluded, the number decreased to 267,000 lemmas. Of these, 19,000 were names and 40,000 were homographs. The latter are singled-out meanings of words already in the list. For instance, words like "bat, bill, tin can, content, fan, fine, lead, lite, long, spring" have more than one entry in the dictionary because they take more than one unrelated meaning. Subtracting the names and the homographs leaves an approximate of 208,000 "distinct" lemma words in the Webster's Third New International Dictionary (1961). These include multiword expressions, such equally phrasal verbs (give in, give up) multiword prepositions (along with) and fixed expressions with a meaning that cannot be derived from the constituting words (boot the bucket).
Some other way to gauge the number of lemmas in English language in the absence of proper nouns, spelling variants, spelling errors, and unaccepted intrusions from other languages, is to brand employ of lists designed past people who have a detail interest in compiling a more or less exhaustive list of English words: scrabble players. The Collins Official Scrabble Words (2014) guide contains 267,751 entries, going from 'aa, aah, aahed, …' to '…, zyzzyvas, zzz, zzzs.' The Official Tournament and Club Word (OTCW) List from the North American Scrabble Players Association includes simply 178,691 entries going from 'aa, aah, aahed, …' to '…, zyzzyva, zyzzyvas, zzz.'7 Since the lists are tailored to scrabble users, they do not contain words longer than 15 letters or shorter than two letters (the minimum number of letters required on the starting time move in Scrabble). The list includes inflections, however. These can exist pruned with an automatic lemmatizer (Manning et al., 2008). Then the number of lemmas in the Collins list reduces to some 160,000.
Toward a Pragmatic Answer 3: From Lemmas to Word Families
When one looks at lists of lemmas, it chop-chop becomes clear that they still incorporate a lot of redundancy, as shown in Table two. Words grade families based on derivation and compounding. Knowledge of one word from a family helps to understand the meaning of the other members (although it may not be plenty to produce the word) and to learn these words.
             
          
Table 2. Extract from a lemma list showing the being of word families.
The power of morphological families has been investigated most extensively in 2d linguistic communication didactics (Goulden et al., 1990), where it was observed that participants were frequently able to understand the meaning of non-taught members of the family on the basis of those taught. If you know what 'diazotize' means, you besides know what 'diazotization, diazotizable, and diazotizability' stand for, and you have a pretty skilful thought of what is referred to with the words 'misdiazotize, undiazotizable, or rediazotization.' Similarly, if you know the adjective 'effortless,' you lot will understand the adverb 'effortlessly' (you tin even produce information technology). Yous will as well quickly observe that the describing word 'effortless' consists of the substantive 'effort' and the suffix '-less.' And if you lot know the words 'flower' and 'pot,' you sympathise the meaning of 'flowerpot.'
Psycholinguistic enquiry (Schreuder and Baayen, 1997; Bertram et al., 2000) has also pointed to the importance of discussion family unit size in word processing. Words that are office of a large family ('graph, human being, work, and fish') are recognized faster than words from small families ('wren, vase, tumor, and squid').
Goulden et al. (1990) analyzed the 208,000 distinct lemma entries of Webster's Third New International Lexicon (1961) described higher up. Of these, they defined 54,000 as base words (referring to word families), 64,000 as derived words, and 67,000 as compound words (in that location were too 22,000 lemmas that could non be classified). The person, who has spent most free energy on defining discussion families, is Nation (e.g., Nation, 2006)8. His electric current list has pruned (and augmented) Goulden et al.'s (1990) list of 54,000 base words to some 28,000 give-and-take families, of which x,000 suffice for most language understanding.
Every bit is the example with all natural categories, the boundary between base words and derived words is not clear (see Bauer and Nation, 1993, for guidelines). Although near of the fourth dimension the categorization is straightforward, there are a few k borderline instances. For instance, what to do with the trio 'abbey, abbot, and abbess'? They are not morphologically related according to nowadays-day morphological rules, but they are historically related and the similarity in meaning and orthography helps to empathize and learn the three of them. A similar question can be asked virtually 'move and motility': Are they function of the same family or not, given their pocket-sized orthographic and phonological similarity? Authors disagree which of these are base words and which not, leading some scholars to conclude that the transition from lemmas to give-and-take families creates more defoliation than clarity (Schmitt, 2010). Therefore, it is interesting to do the counts for both definitions.
How Many Lemmas and Word Families are Known Co-ordinate to the Literature?
When we look at the diverse attempts to estimate the vocabulary size of an adult (typically an undergraduate student), we clearly see the impact of the various definitions given to "words" (Table iii). Nearly all estimates limit words to lemmas, every bit divers above. In addition, most make some further reduction by using diverse definitions of "discussion families." Every bit a result, the estimates range from less than 10 thousand words known to over 200 g words mastered.
             
          
TABLE 3. Various estimates of the number of English words known by adults (typically first-twelvemonth university students), together with the way in which "words" were defined and the job used.
Unsurprisingly, the highest number comes from a study (Hartmann, 1946) in which no pruning of lemmas (or inflections) was washed. Hartmann (1946) selected 50 words (1 from the same relative position on every fortieth folio) from the unabridged Merriam Webster's New International Dictionary (2d edition) and administered the words to 106 undergraduate students. The participants were asked to supply the meanings of the words without time limit. Given that the dictionary contained 400 thousand words and the students on average were able to give definitions for 26.9 of the l words, Hartmann concluded that they had a vocabulary of 215,000 words. As we saw above, this judge includes proper nouns, inflected forms, derived words and compounds.
Goulden et al. (1990) also started from (a later edition of) Webster'south International Lexicon only reduced the number of "interesting words" to the 54,000 base of operations words discussed above. They selected 250 of these words and presented lists of 50 to a full of xx university graduates, who were native speakers of English and over the age of 22. Students just had to betoken whether they knew the word. The estimated number of basic words known ranged from 13,200 to 20,700 with an average of 17,200.
Milton and Treffers-Daller (2013) used the same words equally Goulden et al. (1990) but asked participants to give a synonym or explanation for each discussion they knew. Participants were mainly beginning twelvemonth academy students. With this procedure and population Milton and Treffers-Daller (2013) obtained an estimate of nine,800 word families known.
How Many Lemmas and Give-and-take Families are Known? a New Written report
To supplement the existing estimates, nosotros ran a new written report on a much larger scale, both in terms of words tested and in terms of people tested.
Stimulus List
As the authors before us, we chop-chop came to the conclusion that not all words in the Webster lexicon and the Collins scrabble lists are of interest, considering they include many names of plants, insects, and chemical substances, which come up close to proper nouns for non-specialists. In addition, creating a stimulus list on the basis of copyright protected sources creates a trouble for complimentary distribution of the materials, which is needed if one wants scientific discipline to exist cumulative.
To tackle the in a higher place issues, we decided to build a new listing of lemmas 'worthwhile to exist used in psycholinguistic experiments,' based on discussion frequency lists (in particular the SUBTLEX lists from Brysbaert et al., 2012 and Van Heuven et al., 2014) and free lists of words for spell checkers. The master criterion for inclusion was the likely interest of not-specialists in knowing the give-and-take. Compared to the Collins scrabble list, the list does not include the lemmas 'aa, aah, aal, allii, aargh, aarrgh, aarrghh, aarti, aasvogel, ab, aba, abac, abacterial, abactinal, abactor, …, zymolysis, zymome, zymometer, zymosan, zymosimeter, zymosis, zymotechnic, zymotechnical, zymotechnics, zymotic, zymotically, zymotics, zythum, zyzzyva, and zzz.' It does, however, include the words 'a, aardvark, aardwolf, abaca, aback, abacus, …, zymogen, zymology, and zymurgy.' All in all, the list contains 61,800 entries and is attached as Supplementary Materials to the present article. Although the listing was compiled for research purposes, we are fairly confident it contains the vast majority of reasonably known English language words (in American English language spelling)nine, only we agree that it would be possible to add well-nigh the same number of more specialist words.
To see where nosotros would become and to get for the maximum difference between the lemma list and the give-and-take family unit listing, we decided to interpret the family size of our 61,800 lemma list maximally. That is, all words that could reasonably exist derived from a base discussion were considered to exist part of the base of operations word's family. Compound words were separate in their constituting families, unless the significant of the compound could not be derived from the constituents (as in 'honeymoon, huggermugger, and jerkwater'). This resulted in xviii,269 word families (see the Supplementary Materials). With less strict criteria, the list could easily be enhanced to 20,000 families or fifty-fifty 25,000.10 On the other manus, while reading the list of families, one is constantly tempted to prune fifty-fifty further (so that a list of xviii,000 may be achievable as well). The basic finding, however, is that English words boil downwardly to a list of building blocks not much larger than 20,000 words, when names and acronyms (which are often names likewise) are excluded.11 The rather pocket-sized number of word families is testimony to the tremendous productivity of language.
Participants and the Vocabulary Test Used
To meet how many of our list of 61,800 lemmas are known, we presented them to participants with a examination similar to the one used by Goulden et al. (1990). Because of the massive internet availability nowadays, it was possible to present the entire list of words to a much larger and heterogeneous group of people (Keuleers et al., 2015).
The test we ran12 follows the pattern described in Keuleers et al. (2015). First, participants were asked a few questions, the most of import of which for the nowadays discussion were about their age, whether English was their native linguistic communication, and whether they took higher education. Afterwards answering the questions, the participants were shown a random list of 67 words and 33 not-words and asked to indicate which words they knew. The non-words were generated with the Wuggy algorithm (Keuleers and Brysbaert, 2010). Participants were warned about the non-words and they were told they would be punished for yeah-responses to these letter of the alphabet strings. To correct the scores for guessing, the proportion of yes-responses to the non-words was subtracted from the proportion of aye-responses to the words. And so, someone who selected fifty/67 words and said yes to i/33 non-words, obtained a score of 0.746 - 0.033 = 0.713, indicating that they knew 71.3% of the words. Beyond participants, data were nerveless for the full list of 61,800 lemmas and over 100,000 non-words.
Participants taking part in the vocabulary examination consented to their data beingness used for scientific analyses at the word level. At no signal they were asked to identify themselves, and so that information gathering and data use were not linked to individuals. Participation was voluntary, did non put strain on the participants, and could be stopped at any time. This procedure is in line with the General Upstanding Protocol followed at the Faculty of Psychology and Educational Sciences of Ghent Academy.
All in all, we tested 221,268 individuals who returned 265,346 sessions. The number of sessions is higher than the number of participants because quite a few participants took the vocabulary test more than once (each session had a dissimilar sample of words and not-words). In gild not to requite undue weight to individuals (some participants took the test more 100 times), nosotros limited the analyses to the first three sessions if participants completed multiple sessions.
Results
Effigy 3 shows the percentage of lemmas known to native speakers of American English as a function of historic period and education level (come across Keuleers et al., 2015, for similar findings in Dutch). As can be seen, the knowledge of words increases with age and didactics. As a matter of fact, the consequence of historic period very strongly resembles the power function that would be expected on the basis of Herdan'due south law (Effigy ane): Because older people have come up across more than words than younger people, they also know more words. This suggests that people rarely forget words, fifty-fifty when they don't know the (verbal) pregnant any more. Indeed, a examination in which persons are asked to spot a word amongst non-words is thought to be a good judge of premorbid intelligence in elderly people and patients with dementia (Baddeley et al., 1993; Vanderploeg and Schinka, 2004; but also see Cuetos et al., 2015 for prove that dementia patients are deficient at spotting words amid non-words, in particular words that are low in frequency, depression in imageability, and acquired after in life).
             
          
Effigy iii. Percentage of lemmas known as a function of age and educational level. The solid black line shows the median percentage known as a role of age. It shows a steady increase upwards to the historic period of 70 (for the people who took part in the study). The gray zone indicates the range of percentages between percentile five and percentile 95. The bear on of education level is shown in the lines representing the medians of the diverse groups.
The median score of xx-yr-olds is 68.0% or 42,000 lemmas; that of lx-year-olds 78.0% or 48,200 lemmas. This corresponds to a difference of 6,200 lemmas in 40 years' time (or about one new lemma every two days). The difference between teaching levels is besides substantial and likely illustrates the impact of reading and studying on vocabulary knowledge. Indeed nigh of the difference between the education levels seems to originate during the years of study.
Figure 4 shows the same information for the base words (the word families).13 An interesting observation here is that the overall levels of word knowledge are lower. A 20-year-former knows 60.8% of the base words (for a total of eleven,100 base words), and a threescore-year-erstwhile knows 73.2% (or xiii,400 base of operations words). The lower per centum of base of operations word cognition is to be expected, because well-known lemmas tend to come from large word families. And so, the 325 lemmas including 'man' (craftsmanship, repairman, congressman, …) are better known than the four lemmas including 'foramen' (foramen, foraminiferous, foraminifera, and foraminifer). The former add much more weight to the tally of lemmas known (325 vs. iv) than to the tally of base of operations discussion known (i vs. 1). As a result, quite a lot of lemmas can be known based on the mastery of a limited number of prolific base words.
             
          
Figure 4. Percentage of base words (word families) known as a function of historic period and educational level. The solid black line shows the median pct known equally a office of age. It shows a steady increase upwards to the historic period of 70 (for the people who took role in the study). The gray zone indicates the range of percentages between percentile v and percentile 95. The touch of instruction level is shown in the lines representing the medians of the various groups.
The Estimates
The findings and then far are summarized in Table 4 and translated into reasonable estimates of words known. They show that the estimates depend on (a) the definition of word known, (b) the age of person, and (c) the amount of language input sought past the person. For the number of alphabetical types encountered, the low end is defined as a person who but gets input from social interactions; the high end is a person who constantly reads at a footstep of 300 words per minute. For the number of lemmas and base words known the depression cease is divers as percentile five of the sample we tested, the median equally percentile fifty, and the high end as percentile 95.
             
          
TABLE four. Estimates of the words known past xx-year-olds and sixty-yr-olds at the low end and the high stop.
The number of lemmas known arguably is what most people spontaneously associate with the answer to the question 'how many words are known.' The number of base words (word families) mastered indicates that these lemmas come up from a considerably smaller stock of building blocks that are used in productive ways. Detect that the boilerplate number of words known by a 22-twelvemonth-old (17,200), as estimated past Goulden et al. (1990), lies in-betwixt our estimated number of lemmas known (42,000) and our estimated number of base words known (11,100), in line with the ascertainment that Goulden et al. (1990) used less invasive criteria to ascertain their base words.
Multiplying the number of lemmas by ane.vii gives a crude judge of the total number of word types people sympathise in American English when inflections are included.14 The divergence between alphabetical types encountered and the number of inflected forms known gives an judge of names and unknown words people run across in their lives. As Nation (2006) argued, it is easier to understand how a xx-year-one-time could have learned 11,100 base words (which amounts to i.7 new words per day if learning commences at the historic period of 2) than to understand how they could have acquired one.seven∗42,000 = 71,400 new word forms (inflections included), which amounts to nearly xi new words per day (see Nagy and Anderson, 1984, for such an estimate).
The estimates of Table four are for receptive vocabulary (understand a word when it is presented to you). Productive word knowledge (existence able to apply the give-and-take yourself) is more limited and estimated to be less than half receptive noesis. The difference between receptive and productive give-and-take noesis increases equally the words get less frequent/familiar (Laufer and Goldstein, 2004; Schmitt, 2008; Shin et al., 2011).
Limitations
It is unavoidable that the estimates from Tabular array four are approximations, dependent on the choices made. All we can do, is be transparent well-nigh the ways in which nosotros derived the figures and to make our lists publicly bachelor, and then that other researchers can adapt them if they experience a need to do so.
A first limitation is the list of 61,800 lemmas we used. Although we are reasonably certain the list contains the vast majority of words people are probable to know, there are ample opportunities to increment the list. Equally indicated above, the Collins scrabble list could be used to more than than double the number of entries. We are fairly confident, however, that such an increase will not change much in the words known past the participants (see besides Goulden et al., 1990). The words nosotros are near likely to have missed are regionally used common words and recently introduced words. On the other paw, a recent study past Segbers and Schroeder (2016) in German language illustrates the importance of the word list started from. These authors worked with a list of 117,000 lemmas and on the basis of a procedure similar to Goulden et al. (1990) concluded that native German adults know 73,000 lemmas of which 33,000 are monomorphemic (which comes closest to our definition of base of operations lemmas). A comparison of both lists past a German–English language researcher would be informative to see why the estimates are larger in High german than in English. German has more than single-word compounds than English language (which partly explains the higher number of lemmas known in German than in English), but this should not affect the number of monomorphemic lemmas known.
A second limitation is that our listing does not include meaningful multiword expressions. Expressions such as 'have to, give in, washing machine, salad spinner, kick the bucket, at all, …' were excluded from our lemma listing. Such sequences are particularly important when the pregnant of the expression is not clear from the individual words. Martinez and Schmitt (2012) reported 505 such non-transparent expressions amidst the 5,000 most frequent words in English, suggesting that the estimates of Table 4 should be increased past 10% to include multiword expressions that cannot exist derived from the constituent words.
A 3rd limitation is that our definition of words does non accept into account the fact that some words have multiple senses and sometimes fifty-fifty meanings. Words like 'heed, miss, sack, second' have several meanings that are unrelated to each other. Goulden et al. (1990) estimated that fifteen pct of the entries in Webster's Tertiary were homographs (word with multiple meanings), of which two thirds did not differ essentially from the previous entry (i.e., the difference was a difference in sense rather than meaning). So, a reasonable guess would be that 5% of the words are ambiguous in their meaning and our measure has no manner to assess whether the participants were familiar with the various meanings. Even worse, our assessment says nearly nothing about how well the participants know the various words. They were only asked to select the words they knew. Estimates of vocabulary size are smaller if more demanding tests are used, every bit can be seen in Table 3 [e.g., compare the judge of Zechmeister et al. (1995) to that of D'Anna et al. (1991) and the judge of Milton and Treffers-Daller (2013) to that of Goulden et al. (1990)]. Indeed, agreement the meaning of words is not a binary, all-or-nothing miracle but a continuum involving multiple aspects (Christ, 2011). Every bit such, our estimates should be considered every bit upper estimates of word knowledge, going for width rather than depth. The estimates also bargain with receptive word cognition (agreement words other people use). Equally indicated above, productive knowledge is idea to exist roughly half of receptive word knowledge.
Finally, our listing excludes names (and acronyms). An interesting question is how many names people are likely to know. In principle, these could run in hundreds of thousands (certainly for older people reading a lot, as shown in Table 4). On the basis of our experiences, all the same, nosotros believe that the number is more likely to be in the tens of thousands or even thousands (depending on the person). For instance, when we probed a big segment of the Dutch-speaking population almost their cognition of fiction authors with a exam like to the vocabulary test described higher up, we saw that few people know more 500 author names (out of a full of 15,000 collected from a local library).xv Hidalgo and colleagues set up upward a website about globe famous people16. Thus far, the number includes some eleven,000 names. It would exist interesting to examine how many of these are effectively known by diverse people. The smaller estimates agree with the observation of Roberts et al. (2009) that the number of social contacts people have is limited to some 150, although of grade people know many more names (of humans and places) than the individuals they have personal relationships with.
Conclusion
Based on an assay of the literature and a largescale crowdsourcing experiment, nosotros judge that an average twenty-year-old educatee (the typical participant in psychology experiments) knows 42,000 lemmas and four,200 multiword expressions, derived from 11,100 word families. This noesis can exist as shallow as knowing that the give-and-take exists. Because people learn new words throughout their lives, the numbers are college for 60-twelvemonth-olds. The numbers besides depend on whether or not the person reads and watches media with verbal content.
Author Contributions
All authors listed, have fabricated substantial, direct and intellectual contribution to the work, and canonical it for publication.
Supplementary Material
The Supplementary Material for this commodity can be found online at: https://world wide web.frontiersin.org/commodity/x.3389/fpsyg.2016.01116
Conflict of Interest Statement
The authors declare that the inquiry was conducted in the absence of whatever commercial or financial relationships that could be construed as a potential conflict of interest.
Footnotes
- ^ At the same time, the number of word types in a language is combinatorially limited by phonotactic constraints and practical limits on give-and-take length. As a conservative instance, a language with 10 consonants and five vowels would have 550 possible CV and CVC syllables. Combining those syllables would result in more than 50 trillion possible discussion types upward to five syllables long. Our use of the term boundless should be considered in this lite.
- ^ The Google Books Ngram database has a lower limit of 41 occurrences before a give-and-take is included in the database. This roughly corresponds to a frequency of ane per 10 billion words and decreases the number of unigrams drastically, given that nearly half of the unigrams observed in corpus have a frequency of i (Baayen, 2001). A similar threshold was used by Brants and Franz (2006).
- ^ Notice that 1/ane.77 = 0.565, then that the increment in N as a function of M above Due north = vii,873 is slightly higher than the square root of 1000.
- ^ Notice that this number is an upper limit, given that people do non have total input from day one and given that they are unlikely to produce meaningful new words when they speak themselves. Readers are free to calculate alternative estimates if they do not feel comfortable with the definitions we use.
- ^ Not all verb forms are inflections, equally some are used frequently as describing word (appalled) or noun (cleansing). Also noun plurals can have a different meaning and, therefore, lemma status (aliens, spectacles, and minutes). For more information, come across Brysbaert et al. (2012).
- ^ This number is higher still in afterwards editions of the lexicon and it is not hard to discover claims of fifty-fifty more than twice this number. According to a visitor that monitors English websites, Global Language Monitor, the English linguistic communication has nearly 1.050 million words, with fourteen.seven new words created each day (all having a minimum of 25,000 citations from various geographical regions; (http://www.languagemonitor.com/number-of-words/number-of-words-in-the-english language-language-1008879/, August, 13, 2015). Needless to say, the vast majority of the entries are names and transparent derivations, compounds and multiword expressions.
- ^ The Collins list too includes slang words, such as devo, lolz, obvs, ridic, shizzle, and inflections of derivations, such every bit abjectnesses and abnormalisms.
- ^ http://www.victoria.air-conditioning.nz/lals/most/staff/paul-nation
- ^ If readers know of interesting words non included in the listing, please send them to MB.
- ^ Nation (2006) reduced the word list of the British National Corpus (100 1000000 words) to 82,000 lemmas and 28,000 word families, keeping words like 'nominee, non-entity, norovirus, northbound, and northern' as dissever discussion families rather than assigning them to the families 'nominate, entity, virus, and n.'
- ^ There are only five words of 16 letters in the list of basic words (caryophyllaceous, chryselephantine, prestidigitation, tintinnabulation, and verticillastrate). All others are in the scrabble list.
- ^ http://vocabulary.ugent.be/
- ^ For the calculation, cognition of word families consisting of spring morphemes (zymo-, -smash, …) was estimated on the ground of the full word with the highest recognition rate (zymology, blastocyst, etc).
- ^ If lx% of the English word forms are lemmas and 40% inflected forms, the full number of words tin be derived from a lemma list by multiplying the number of lemmas by 1/0.sixty = 1.7.
- ^ In hindsight, we could have expected this number. Given that few people read more than than 1 book per week, the total input for a 20-year-old is only 1040 books (if they started reading from 24-hour interval one). In addition, several of these books volition come up from the same authors
- ^ http://pantheon.media.mit.edu/
References
Anderson, R. C., and Nagy, West. E. (1993). The Vocabulary Conundrum. Technical Report No 570 from the Heart for the Study of Reading. Urbana, IL: University of Illinois.
Google Scholar
Baayen, R. H. (2001). Word Frequency Distributions. Berlin: Springer Science and Concern Media. doi: x.1007/978-94-010-0844-0
CrossRef Full Text | Google Scholar
Baddeley, A., Emslie, H., and Nimmo-Smith, I. (1993). The spot-the-word examination: a robust estimate of verbal intelligence based on lexical conclusion. Br. J. Clin. Psychol. 32, 55–65. doi: 10.1111/j.2044-8260.1993.tb01027.x
CrossRef Full Text | Google Scholar
Bertram, R., Baayen, R. H., and Schreuder, R. (2000). Effects of family size for complex words. J. Mem. Lang. 42, 390–405. doi: 10.1006/jmla.1999.2681
CrossRef Full Text | Google Scholar
Brants, T., and Franz, A. (2006). Web 1T v-gram Version 1 LDC2006T13. DVD. Philadelphia, PA: Linguistic Information Consortium.
Brysbaert, K., New, B., and Keuleers, E. (2012). Adding Part-of-Oral communication information to the SUBTLEX-US word frequencies. Behav. Res. Methods 44, 991–997. doi: 10.3758/s13428-012-0190-4
CrossRef Full Text | Google Scholar
Christ, T. (2011). Moving by "right" or" incorrect": toward a continuum of young children'southward semantic knowledge. J. Lit. Res. 43, 130–158. doi: 10.1177/1086296X11403267
CrossRef Total Text | Google Scholar
Cuetos, F., Arce, N., Martínez, C., and Ellis, A. Due west. (2015). Give-and-take recognition in Alzheimer's disease: effects of semantic degeneration. J. Neuropsychol. doi: x.1111/jnp.12077 [Epub alee of print].
CrossRef Full Text | Google Scholar
Cunningham, A. Due east., and Stanovich, G. E. (2001). What reading does for the mind. J. Straight Instr. 1, 137–149.
Google Scholar
D'Anna, C. A., Zechmeister, E. B., and Hall, J. W. (1991). Toward a meaningful definition of vocabulary size. J. Lit. Res. 23, 109–122. doi: 10.1080/10862969109547729
CrossRef Total Text | Google Scholar
Gerlach, K., and Altmann, Eastward. G. (2013). Stochastic model for the vocabulary growth in natural languages. Phys. Rev. 10 three:021006. doi: ten.1103/PhysRevX.three.021006
CrossRef Full Text | Google Scholar
Goulden, R., Nation, I. South. P., and Read, J. (1990). How large can a receptive vocabulary be? Appl. Linguist. eleven, 341–363. doi: 10.1093/applin/11.iv.341
CrossRef Full Text | Google Scholar
Hartmann, Thou. W. (1946). Further prove on the unexpected large size of recognition vocabularies among college students. J. Educ. Psychol. 37, 436–439. doi: ten.1037/h0056310
CrossRef Full Text | Google Scholar
Hayes, D. P. (1988). Speaking and writing: distinct patterns of word choice. J. Mem. Lang. 27, 572–585. doi: 10.1016/0749-596X(88)90027-7
CrossRef Full Text | Google Scholar
Heaps, H. S. (1978). Information Retrieval: Computational and Theoretical Aspects. San Diego, CA: Academic Press.
Google Scholar
Herdan, K. (1964). Quantitative Linguistics. London: Butterworths.
Google Scholar
Keuleers, East., and Brysbaert, Thousand. (2010). Wuggy: a multilingual pseudoword generator. Behav. Res. Methods 42, 627–633. doi: 10.3758/BRM.42.3.627
CrossRef Full Text | Google Scholar
Keuleers, Thou., Stevens, M., Mandera, P., and Brysbaert, Thou. (2015). Give-and-take knowledge in the crowd: measuring vocabulary size and discussion prevalence in a massive online experiment. Q. J. Exp. Psychol. 68, 1665–1692. doi: 10.1080/17470218.2015.1022560
CrossRef Full Text | Google Scholar
Kornai, A. (2002). How many words are there? Glottometrics four, 61–86.
Google Scholar
Kuperman, Five., and Van Dyke, J. A. (2013). Reassessing word frequency as a determinant of give-and-take recognition for skilled and unskilled readers. J. Exp. Psychol. Hum. Percept. Perform. 39, 802–823. doi: 10.1037/a0030859
CrossRef Full Text | Google Scholar
Laufer, B., and Goldstein, Z. (2004). Testing vocabulary cognition: size, forcefulness, and estimator adaptiveness. Lang. Learn. 54, 399–436. doi: 10.1111/j.0023-8333.2004.00260.x
CrossRef Full Text | Google Scholar
Lewandowski, L. J., Codding, R. S., Kleinmann, A. Due east., and Tucker, K. Fifty. (2003). Assessment of reading rate in postsecondary students. J. Psychoeduc. Assess. 21, 134–144. doi: 10.1177/073428290302100202
CrossRef Full Text | Google Scholar
Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge, MA: Cambridge Academy Press. doi: x.1017/CBO9780511809071
CrossRef Full Text | Google Scholar
Mehl, G. R., Vazire, S., Ramírez-Esparza, N., Slatcher, R. B., and Pennebaker, J. West. (2007). Are women really more talkative than men? Science 317:82. doi: 10.1126/science.1139940
CrossRef Full Text | Google Scholar
Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science 331, 176–182. doi: 10.1126/science.1199644
CrossRef Total Text | Google Scholar
Milton, J., and Treffers-Daller, J. (2013). Vocabulary size revisited: the link betwixt vocabulary size and academic accomplishment. Appl. Linguist. Rev. four, 151–172. doi: x.1515/applirev-2013-0007
CrossRef Full Text | Google Scholar
Nagy, Due west. Due east., and Anderson, R. C. (1984). How many words are there in printed school English? Read. Res. Q. nineteen, 304–330. doi: 10.2307/747823
CrossRef Full Text | Google Scholar
Nation, I. Due south. P. (2006). How large a vocabulary is needed for reading and listening? Tin can. Mod. Lang. Rev. 63, 59–82. doi: ten.3138/cmlr.63.1.59
CrossRef Full Text | Google Scholar
Nusbaum, H. C., Pisoni, D. B., and Davis, C. K. (1984). Sizing up the Hoosier Mental Dictionary: Measuring the Familiarity of 20,000 Words. Oral communication Research Laboratory Progress Report. Bloomington, IN: Indiana University.
Google Scholar
Petersen, A. M., Tenenbaum, J. North., Havlin, Southward., Stanley, H. E., and Perc, M. (2012). Languages absurd equally they expand: allometric scaling and the decreasing need for new words. Sci. Rep. 2:943. doi: 10.1038/srep00943
CrossRef Full Text | Google Scholar
Roberts, S. G., Dunbar, R. I., Pollet, T. 5., and Kuppens, T. (2009). Exploring variation in active network size: constraints and ego characteristics. Soc. Networks 31, 138–146. doi: ten.1016/j.socnet.2008.12.002
CrossRef Full Text | Google Scholar
Schmitt, N. (2008). Review article: instructed second linguistic communication vocabulary learning. Lang. Teach. Res. 12, 329–363. doi: ten.1177/1362168808089921
CrossRef Full Text | Google Scholar
Schmitt, N. (2010). Researching Vocabulary: A Vocabulary Inquiry Transmission. Basingstoke: Palgrave Macmillan. doi: 10.1057/9780230293977
CrossRef Total Text | Google Scholar
Schreuder, R., and Baayen, R. (1997). How complex simplex words can be. J. Mem. Lang. 37, 118–139. doi: 10.1006/jmla.1997.2510
CrossRef Full Text | Google Scholar
Segbers, J., and Schroeder, South. (2016). How many words practise children know? A corpus-based estimation of children'south total vocabulary size. Lang. Exam. doi: 10.1177/0265532216641152
CrossRef Full Text | Google Scholar
Shin, D., Chon, Y. 5., and Kim, H. (2011). Receptive and productive vocabulary sizes of high school learners: what side by side for the basic word listing? Engl. Teach. 66, 127–152.
Google Scholar
Vanderploeg, R. D., and Schinka, J. A. (2004). "Estimation of premorbid cognitive abilities: problems and approaches," in Differential Diagnosis in Adult Neuropsychological Assessment, ed. J. H. Ricker (New York, NY: Springer Publishing), 27–65.
Google Scholar
Van Heuven, W. J. B., Mandera, P., Keuleers, Eastward., and Brysbaert, M. (2014). Subtlex-UK: a new and improved word frequency database for British English language. Q. J. Exp. Psychol. 67, 1176–1190. doi: 10.1080/17470218.2013.850521
CrossRef Full Text | Google Scholar
Webster'due south 3rd New International Lexicon (1961). Webster's Third New International Dictionary. Springfield, MA: Merriam-Webster Inc.
Google Scholar
Zechmeister, Due east. B., Chronis, A. M., Cull, West. L., D'Anna, C. A., and Healy, N. A. (1995). Growth of a functionally important lexicon. J. Lit. Res. 27, 201–212.
Google Scholar
Source: https://www.frontiersin.org/articles/10.3389/fpsyg.2016.01116/full
0 Response to "How Any Words Do You Know"
Postar um comentário