Semantic Variation in Idiolect and Sociolect: Corpus Linguistic Evidence from Literary Texts Author(s): Max M. Louwerse Source: Computers and the Humanities, Vol. 38, No. 2 (May, 2004), pp. 207-221 Published by: Springer Stable URL: http://www.jstor.org/stable/30204935 . Accessed: 27/05/2013 10:34 Your use of the JSTOR archive indicates your acceptance of the & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/.jsp
. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please
[email protected].
.
Springer is collaborating with JSTOR to digitize, preserve and extend access to Computers and the Humanities.
http://www.jstor.org
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
Computers and the Humanities 38: 207-221, 2004. @ 2004 Kluwer Academic Publishers. Printed in the Netherlands.
207
SemanticVariation in Idiolect and Sociolect: Corpus Linguistic Evidence from LiteraryTexts MAX M. LOUWERSE Department of Psychology, Institutefor Intelligent Systems, University of Memphis, 202 Psychology Building, Memphis, TN 38152, USA E-mail:
[email protected]
Abstract. Idiolects are person-dependent similarities in language use. They imply that texts by one author show more similarities in language use than texts between authors. Sociolects, on the other hand, are group-dependent similarities in language use. They imply that texts by a group of authors, for instance in of gender or time period, share more similarities within a group than between groups. Although idiolects and sociolects are commonly used in the humanities, they have not been investigated a great deal from corpus and computational linguistic points of view. To test several idiolect and sociolect hypotheses a factorial combination was used of time period (Modernism, Realism), gender of author (male, female) and author (Eliot, Dickens, Woolf, Joyce) totaling 16 corresponding literary texts. In a series of corpus linguistic studies using Boolean and vector models, no conclusive evidence was found for the selected idiolect and sociolect hypotheses. In final analyses testing the semantics within each literary text, this lack of evidence was explained by the low homogeneity within a literary text. Key words: author identification, coherence, computational linguistics, content analysis, corpus linguistics, idiolect, latent semantic analysis, literary period, sociolect
1. Introduction
Writersimplicitlyleave their signaturein the documentthey write,groupsof writersdo the same. Idiolectsare similaritiesin the languageuse of an individual, sociolectssimilaritiesin the languageuse of a communityof individuals. Although various theoretical studies have discussed the notion of idiolectsand sociolects(Eco, 1977;Lotman, 1977;Fokkemaand Ibsch, 1987; Jakobson, 1987)and those theoriesare widely acceptedin fields like literary criticism(Fokkemaand Ibsch, 1987),semiotics(Eco, 1977;Sebeok, 1991)and sociolinguistics(Wardhaugh,1998), hypothesesderivedfrom those theories have not often been empiricallytested. The presentstudy will test some of these hypotheses,using differentcomputationalcorpuslinguisticmethods. 2. Idiolects,Sociolectsand LiteraryPeriods Both idiolect and sociolectdependon the linguisticcode the writeruses. On top of this linguisticcode other codes (e.g. narrativestructures)can be built
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
208
MAX M. LOUWERSE
(Eco, 1977;Lotman, 1977;Jacobson, 1987).These complementarylinguistic codesallowfor textsto be culturalized.The bestexamplesof theseculturalized texts are artistictexts. Thesetexts are thus secondarymodelingsystemsmade accessibleby the primary(linguistics)modeling system. What is so special about aesthetictexts is that the authorwill try to deviatefrom currentlyacceptedcodes. By deviatingfromthe normtextsbecomeaesthetic.Thisway the deviationgraduallybecomesthe norm of a groupand by deviatingfrom the establishednormnew aesthetictexts will deviate(Martindale,1990). In practiceit is verydifficultto determinethesemultipleencodings.On the one hand,to determinethe idiolector sociolectfroma literarytext, one has to look at the complementarylanguage codes. On the other, however, the productof the multiplemodelingsystemsis just one linguisticsystem. Fokkemaand Ibsch(1987)arguethat althoughthe text usuallydoesn'tyielddata about complementarylanguagecodes, we are likely to find differencesin the languagecode by comparingtexts with differentcomplementarycodes (e.g. the timeperiod).In otherwords,on the one handa top-downapproachcould analyzethose texts that share certainaspects (e.g. time of first publication) and reporttheirsimilarities.On the otherhand, a bottom-upapproachcould comparelinguisticcodes of differenttexts, and reportpredictionsabout the idiolectsand sociolects.The currentstudy will use both. We start with the top-down approach,following Fokkema and Ibsch's (1987) theory of Modernistconjectures.Accordingto Fokkema and Ibsch historicaldevelopmentschange the way we think and hence will likely have an impact on the cultural system. For instance, historical events around WWI led to principal political changes and psychological and scientific depression.Similarly,WWII createdanotherbreak in world history and in our thinking.It is thereforenot surprisingthat Fokkema and Ibsch distinguish two literaryperiods on the basis of these historicalbreaks. The first rangesfrom approximately1850 to 1910 and is called Realism.The second rangesfrom approximately1910 to 1940 and is called Modernism(see also Wellek and Warren,1963). By analyzinga numberof literarytexts writtenduringthis 30-yeartime frame,Fokkemaand Ibschare able to definea Modernistcode. This code is a selection of the syntactic,pragmatic,and semanticcomponentsof the linguisticand literaryoptionsthe authorhas available.The semanticcomponent receivesby far most attentionin their study. The Modernistsemanticcode consists of three centralsemanticfields:awareness,detachmentand observation. These fields can be visualized as concentriccircles that form a first semanticzone. The field awarenessconsistsof wordslike awarenessand consciousness. The semantic field of observationconsists of words like observation, perception and window. Finally, detachment consists of words like deperson-
alization and departure.In addition to this first zone of semantic fields a second zone can be distinguished.This zone contains neutral semantic
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
SEMANTIC VARIATION IN IDIOLECT AND SOCIOLECT
209
fields related to the idiolect of the author. A third zone, finally, contains semantic fields that are at the bottom of the Modernist semantic hierarchy,includingeconomy,industry,nature,religion,agriculture.In addition, fieldslike criminality,psychology,science,sexualityand technologythat were alreadypresentin pre-Modernistliteratureare expandedin Modernist texts. Throughout their study Fokkema and Ibsch show that literary texts written by authors in the period 1910-1940 share the pragmatic,syntactic and semanticcomponentsof the Modernistcode. The notion of Modernist code has variousimplications.First of all, it assumesthat those texts written within the Modernisttime frame(e.g. 1910-1940)shareparticularlanguage features,includinga prominentrole for the selectedsemanticfields.Secondly, the notion of Modernistcode impliesthat those literarytexts writtenwithina certaintime frameshareparticularlanguagefeatures(periodcode). Thirdly, groups of authors share languagefeatures(what we earliercalled sociolect) that could be defined in differentways: chronologicallyas Fokkema and Ibsch did, but other ways are also possible. For instance, we could group authorsby gender.Finally,if groupsof authorssharelanguagefeatures,texts writtenby an individualauthormust sharelanguagefeatures(whatwe earlier called idiolect). Accordinglywe can formulatefour hypotheses:(1) an idiolecthypothesis that predictsthat linguisticfeaturesin texts by one author should not significantlydiffer from each other, whereasthose from texts by differentauthors should; (2) a sociolect-genderhypothesis'that predictsthat linguistic featuresof texts writtenby male authorsshould not significantlydiffer,but they should differfrom texts writtenby female authors;(3) a sociolect-time hypothesispredictingthat texts writtenwithina particulartime frameshould not differ, but texts between time-frames should; (4) a Modernist-code hypothesisthat predictsthat Modernisttexts should not only show homogeneity and differ from Realist texts, but they should also show a higher frequencyof certain semantic fields. It needs to be kept in mind though that these hypothesesare stated accordingto a stringentcriterion.For instance,it is of coursepossiblefor one authorto shiftin stylebetweendifferent periods (Watson, 1994). In the first experiment,the four hypotheses are tested using the frequencyof semanticfields occurringin a series of literary texts.
3. Study 1: SemanticField ComparisonsUsing a BooleanModel Fokkema and Ibsch (1987) suggest a word frequencyanalysis to test the Modernist-codehypothesis.In our first study this generallyacceptedcorpus linguisticmethodis used, by takingword frequencyas a measureof semantic
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
210
MAXM. LOUWERSE
distinction.Such methodcan be identifiedas a Boolean model (Baeza-Yates and Ribeiro-Neto, 1999). This model has very precise semantics using a binary decision criterion.It is the most commonlyused method in content analysisand has been extensivelyused in corpuslinguisticsin general(Biber, 1988), in social psychology (Pennebaker,2002) and in literary studies in particular(see Louwerseand Van Peer, 2002). The four hypothesesoutlined in the previous section (Modernist-codehypothesis, the sociolect-gender hypothesis, sociolect-timehypothesis and the idiolect-hypothesis)will be testedusing the frequencyof wordsin each of the semanticfieldsidentifiedby Fokkema and Ibsch (1987).
3.1. MATERIALS A total of sixteen texts were selectedfor the analysisfollowing a 2 (literary period)x 2 (gender)x 4 (texts per author)design. The selectionof authors followed Fokkemaand Ibsch (1987).At the same time the choice of authors and texts was constrainedby the availabilityof electronicversions of these texts (hence the focus on Englishtexts only) and the preferreddesign (four correspondingtexts from one authorin each cell). Fokkemaand Ibsch (1987, pp. 192, 203) considerGeorge Eliot and CharlesDickens as representatives for Realist authors.For the literaryperiod ModernismVirginiaWoolf and JamesJoycewereselected(Fokkemaand Ibsch, 1987,p. 10).Table I gives an overviewof the sixteentexts classifiedby period and gender,indicatingyear of publicationand numberof words. Despite the various text archiveinitiatives(e.g. ProjectGutenberg,The OxfordText Archive,The OnlineBooks Page) findingelectronicversionsof texts from authorsdiscussedin Fokkema and Ibsch (1987)and findingfour texts from each authorremainsa daunting task. Ratherthan being seen as the final completeset of corpora,the sixteen selectedtexts should be consideredas a representativesample to study the relevantresearchquestions.
FIELDS 3.2. SEMANTIC
All thirteensemanticfields Fokkemaand Ibsch identifyas characteristicfor Modernist texts were used in this study: consciousness, observation, detachment, agriculture, criminality, economy, industry, nature, psychology, religion, science, sexuality and technology. Two graduate students in cognitive psychology populated the thirteen semantic fields with lemmata. A total of 592 lemmata were created from two sources. Roget's thesaurus was the source for the majority of the lemmata (59%). By selecting each of the semantic fields as a keyword in the thesaurus,
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
211
SEMANTICVARIATIONIN IDIOLECTAND SOCIOLECT Table I. Overview of 16 corpora used Period
Gender
Author
Texts
Year of publication
Number of words
Realism
Female
George Eliot
Silas Marner
1861
75,632 20,863 322,594 214,441 162,025 140,389 363,323 304,907 81,550 80,236 83,562 73,300
Male
Modernism Female
Male
Brother Jacob Middlemarch Mill on the Floss Charles Dickens Oliver Twist Tale of Two Cities David Copperfield Pickwick Papers Virginia Woolf Mrs. Dalloway The Waves Orlando To the Lighthouse James Joyce Exiles
1860 1872 1860 1838 1859 1850 1836 1925 1931 1928 1927 1918
Dubliners
1914
31,067 71,790
Portrait*
1916 1922
90,086 271,722
Ulysses
* Portrait is used as an abbreviation for Portrait of the Artist as a Young Man.
large numbersof semanticallyrelatedwords were found. A second source was the WordNet database(41%of the lemmata),a large semanticnetwork of nouns and verbs(Fellbaum,1998).By using the label of the semanticfield as a hypernymin WordNet, all relatedhyponymswere selected. Obviously,in a Boolean model whereprecisesemanticsis crucialthe actual word form is essentialand lemmataalone do not suffice.Therefore,for each of the 592 lemmata correspondingderivations and inflections were generated,resultinginto a total of 1461word forms. 3.3. RESULTS AND DISCUSSION
To for differenttext sizes, a normalizationproceduretransformed the raw frequencyto a basis per 1000words of a text (Biber,1988).The four hypotheses (idiolect, sociolect-gender,sociolect-time,and Modernist-code) were then tested on the frequencyof the semanticfieldsin each of the sixteen texts. First, it needs to be establishedwhetherthere are differencesbetween all sixteentexts. If there are not, aggregatingacross authors,genderor time periodwould be futile.As predicteddifferenceswerefound betweenthe texts
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
212
MAX M. LOUWERSE
(H= 234.951,df= 15, p < 0.001, N= 23,376).To test the idiolect-hypothesis, between-groupscomparisons (texts between authors) as well as withingroups comparisons(texts by one author) were made. As predictedby the idiolect-hypothesis,the frequencyof semanticfieldsindeed differedbetween authors(H= 18.49, df= 3, p < 0.001, N= 23,376). A Mann Whitney U pair wise analysis, however, showed that this differencewas due to comparing Dickens with Eliot, Woolf or Joyce (U > 0.01, Z=-3.361, p < 0.001, N= 11,688),whereasthe idiolect hypothesispredictsthat differenceswould occur between all authors. No significantdifferenceswere found between Eliot and Joyce, Eliot and Woolf or Woolf and Joyce. Identicalresultswere obtained for those semantic fields limited to the first zone (awareness, observation,detachment).The predictedbetween-groupsdifferencesshouldbe accompaniedby a lack of within-groupdifferences.However,within-groups comparison showed that texts by Eliot, Woolf and Joyce differedin frequency of semanticfields (all Hs > 34.47, df= 5,844,p < 0.001). Only the texts by Dickens confirmedthe idiolect hypothesiswith no differencesin the frequencyof the semanticfields,resultingin only verylimitedfor the idiolect-hypothesis. Contraryto what was predictedby the sociolect-genderhypothesis,no differenceswere found between the female authors (Eliot, Woolf) and the male authors(Dickens,Joyce).Moreover,significanteffectswerefound both within the male authorsand femaleauthors(H > 107.389,df= 7, p < 0.01, N= 11,688), suggestinga lack of homogeneityin gender and falsifyingthe sociolect-genderhypothesis.When the analysis only took into the first zone semanticfields,an effectbetweenthe genderof authorswas found (U= 0.01, Z= -2.550, p = 0.011, N= 9,184). Although this would the
sociolect-genderhypothesis,no homogeneitywithinmale authorsand female authors was found (H > 32.118, df= 7, p < 0.01, N= 4,592).
For sociolect-timehypothesisa differencewas found betweenthe Realist texts and the Modernist texts (U= 0.01, Z= -4.076, p < 0.001, N= 23,376).
Nevertheless,as with the sociolect-genderhypothesis,this for the sociolect-hypothesiswould only be meaningfulif thereis homogeneityin the frequencyof the semanticfieldsbetweenthe textswithina period.But in both the Realist texts and the Modernist texts, differencesbetween texts were found (H > 79.351, df= 7, p < 0.001, N= 11,688). The lack of homogeneityin Realist texts on the one hand and in Modernisttexts on the other,falsifiesthe sociolect-timehypothesis.Consequently, no conclusive is found for the Modernist-codehypothesis,despite the fact that the significant difference between Realist texts and Modernist texts does show the predicted pattern. A higher frequency of the semantic fields is found in Modernist texts (Mean= 0.0023, SE= 0.001) than in Realist texts (Mean=.00247, SE=0.001) but this pattern is ed by the Dickens and Woolf texts only and not by the other texts. In fact, the fre-
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
SEMANTIC VARIATION IN IDIOLECT AND SOCIOLECT
213
quencyof semanticfieldsin Eliot is almost as high as the frequencyof fields in Woolf. Similarly,frequencyof fields in Dickens is almost as high as the frequencyof fields in Joyce. Although Fokkema and Ibsch's hypotheseshave strictly been followed, one could argue that the choice of the texts and contents of the semantic fields distortsthe picture.To for this possibilityeach of the sixteen texts was split into two halvesand each of the halveswerecomparedusing a Wilcoxon Signed Ranks test. No significantdifferencewas found for any of the texts, except for Eliot's Middlemarch (z= -4.47, p < 0.001; N= 1461) Dickens' Copperfield(z = -7.014, p < 0.001; N= 1461) and Joyce's Ulysses
(z= -6.853, p < 0.001;N= 1461).Thereis the option of removingthesetexts from the analysis. However, given the importanceof these texts for their respectivecategories,the importanceof equal cell sizes and the difficultiesin findingelectronicversions of the requiredtexts, we have to run the risk of makinga Type II errorin this study. What can be concludedso far?Should all four hypothesesbe abandoned becauseof a lack of evidencefrom the semanticfield frequenciesbetweenthe corpora? One problem in this study is the method. One of the obvious drawbacksof a Boolean model is its precisesemantics(see Baeza-Yatesand Ribeiro-Neto,1999).The binarydecisioncriterionmeansthat if a word form is not found in the exact format as specifiedit will returna null result. It is feasiblehoweverthat the semanticfield is generallypresentin a paragraph ratherthan in the form of an exact string-match.The paragraphwould then semanticallyapproachthe semanticfield without a specificword matching the keyword. Similarly,a field might be present in the text but only by numbersof words loosely associatedwith the keywordsused for the population of the semanticfield. In other words, some kind of semanticgrading scale is desirable.This is what is investigatedin the second study.
4. Study 2: Semantic Field ComparisonsUsing a Vector Model
To overcome the limitations of binary decision making (Boolean model), degrees of similaritiesbetween the selected semanticfields and texts were measuredusing a vectormodel. One of the vectormodelscommonlyused in computationallinguisticsis latent semanticindexing(LSI), also called latent semanticanalysis(LSA). LSA is a statistical, corpus based, technique for representing world knowledge. It takes quantitative information about co-occurrences of words in paragraphs and sentences and translates this into an N-dimensional space. Generally, the term 'document' is used for these LSA units (paragraphs or sentences), but to confuse terminology, we will use 'text units' here. Thus, the
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
214
MAXM. LOUWERSE
input of LSA is a large co-occurrencematrixthat specifiesthe frequencyof each word in a text unit. LSA maps each text unit and word into a lower dimensionalspace by using singular value decomposition.This way, the initially extremelylarge co-occurrencematrix is typicallyreducedto about 300 dimensions.Eachwordnow becomesa weightedvectoron K dimensions. The semanticrelationshipbetweenwordscan be estimatedby takingthe dot product (cosine) betweentwo vectors.What is so specialabout LSA is that the semantic relatednessis not (only) determinedby the relation between words, but also by the words that accompanya word (see Landauerand Dumais, 1997).In otherwords,like consciousnessand mindwill have a high cosine value (are semanticallyhighly related)not becausethey occur in the same text units together, but because words that co-occur with one equally often co-occur with the other (see Landauerand Dumais, 1997; Landaueret al., 1998;Baeza-Yatesand Ribeiro-Neto,1999). The method of statisticallyrepresentingknowledge has proven to be useful in a range of studies.It has been used as an automatedessay grader, comparingstudentessayswith ideal essays(Landaueret al., 1998).Similarly, it has been used in intelligenttutoringsystems,comparingstudent answers with ideal answersin tutorials(Graesseret al., 2000). LSA can measurethe coherencebetween successivesentences(Foltz et al., 1998). It performsas well as students on TOEFL (test of English as a foreign language) tests (Landauer and Dumais, 1997) and can even be used for understanding metaphors(Kintsch,2000). In this second study we thereforeused the populated semanticfieldsand comparedthem not to the texts as in study 1, but to the semanticLSA spaces of those texts. 4.1. MATERIALS The same sixteen texts from the authors Eliot, Dickens, Woolf and Joyce were used. For each text a semanticspace was createdusing the default of 300 dimensions (see Graesser et al., 1999; for a most recent view see Hu et al.,
2003). The weightingfor the indexwas kept to the defaultlog entropy. Similarly,the defaultfeatureof disregardingcommon words like functional items was used. The size of the text units was generallykept at paragraphs, exceptin the case for dialogswhen lineswerechosenas text unit size,with the size of each semanticspacerangingfrom 600 text units to 1700text units per text.
4.2. SEMANTICFIELDS
The same thirteen semantic fields were used as in the first study with the same population of lemmata (N= 592) and word forms (N= 1,461).
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
SEMANTICVARIATIONIN IDIOLECTAND SOCIOLECT 4.3.
215
RESULTS AND DISCUSSION
After the LSA spaces per text were created,the 1,461 words forms for the thirteen semanticfields were comparedwith the LSA space, resultingin a cosine value between 0 and 1. The very large number of data points (i.e., numberof wordforms x the numberof text unitswithineachtext)calledfor a moremanageableanalysis.Therefore,a sampleof 65,000datapointsper LSA outputfilewererandomlyselectedusinga simplerandomsamplingtechnique. The idiolect hypothesis predicted significant differencesbetween texts from differentauthors,but no significantdifferencesbetweenthe texts of one author. As predicted,between-authorgroups differedfrom each other (F(1, 1040000)=31.82, p < 0.001), with cosine values highest for Eliot (Mean Cosine = 0.040, SD = 0.059) and Dickens (Mean Cosine = 0.040, SD = 0.044), lowest for Woolf (Mean Cosine= 0.012, SD= 0.056), with Joyce in between
(Mean Cosine= 0.039, SD = 0.06). However,contraryto this prediction,texts written by Eliot showed significant differences between them (F(3, 260000)= 4.72, p < 0.003), as did texts by Dickens (F(3, 260000)= 10.16, p < 0.01) and Joyce (F(1, 260000)= 11.49, p < 0.001). Only the texts by Woolf seemedto be more homogeneous(F(1, 260000)= 2.61, p = 0.05). The sociolect-genderhypothesis predicted that texts by male authors would differ from those by female authors, whereas no differenceswere predictedbetweentexts withineach of these two groups.Indeed,a difference was found between these two author groups (F= 1, 1040000)=5.15, p= 0.023), with higher cosine values for female authors (Mean Cosine= 0.039, SD = 0.050) than for male authors (Mean Cosine = 0.029, SD = 0.059).
However,betweenthe texts within each of the groups significantdifferences were also found (Male: F(1, 520000)=62.28, p < 0.001), Female: F(1, 520000)= 31.23,p < 0.001). The sociolect-periodhypothesis predictedthat no differenceswould be found betweenthe texts within a period. Cosine values betweentexts of the Realist authors indeed did not show a difference(p= 0.5), but contraryto what was expectedvalues betweenModernisttexts did show significantdifferences(F(1, 520000)= 8.67,p = 0.003). In addition,as predicted,differences betweenthe two time periodswere found (F(1,1040000)=89.16,p < 0.001). However, whereasthe Modernist-codehypothesispredictedthat the values for the semanticfields would be higher in the Modernisttexts than in the Realist texts, an opposite effect is found with higher cosine values for the Realist texts (Mean Cosine= 0.040, SD=0.051) than the Modernist texts (Mean Cosine= 0.026, SD= 0.059). In fact, this effect can be found for all
possibleinteractionsbetweenthe Realistsand Modernisttexts. Similarto the findingsin the previousstudy identicalresultswere found for the core set of three semantic fields (consciousness, observation, detachment) as for the
overall set of semanticfields.
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
216
MAX M. LOUWERSE
In this second study similarresultswerefound as study 1. Comparingthe semanticfields to the LSA spacesof the texts ratherthan to the texts themselves allowed for a degree of similarity, but again results showed both betweenas well as within group differences.The only exceptionto this was the Realisttextsnot showingdifferenceswithinthe groupitself.Although it is difficult to draw conclusions about the Modernist-codehypothesis withoutconfirmationof the idiolectand sociolecthypotheses,the Modernistcode effectthat was found to show an effectthat did not matchthe prediction with a higheraveragecosine value for Realisttexts than for Modernisttexts. In any case, no unambiguousevidence was found for any of the four hypotheses.This would suggest a lack of empirical for the claims made by Fokkema and Ibsch (1987). However, it is still possible that the idiolect and sociolect hypotheses hold and that only the Modernist-code hypothesisshouldbe rejected,becauseof the selectedsemanticfields.In other words, we might still be able to find semanticsimilaritiesbetweengroups of texts (idiolect,sociolect)but these similaritiesmight not be contingenton the semanticfields.The idiolectand sociolecthypothesesmay then be falsifiedby a particularselectionof semanticfields,but not by the full semanticspace of the texts. This option is what is exploredin a third study. 5. Study: Between-TextComparisonsUsing a VectorModel Insteadof comparinga predefinedlist of semanticfieldsto words in each of the corpora(study 1) or the semanticspacesof those corpora(study2), LSA spacesof each text werecomparedwith each other.In otherwords,each text unit (paragraphor sentence)in each text was comparedwith each text unit (paragraphor sentence)of anothertext, resultingin a cosine value for each comparison.The higherthe cosine value, the more similarthe text units are (ranging from 0 to 1). According to the idiolect hypothesis the semantic universesof textsby one authordo not show differences,whereasthe semantic universesof textsbetweenauthorsdo. Similarly,within-genderor within-time texts are expectednot to differ,but between-genderor between-timetexts are. Similaritiesin cosine values betweentexts indicatehomogeneityof the content. In addition,high cosine values are indicatorsof semanticsimilarities. 5.1 MATERIALS
The same LSA spacesof the sixteentexts were used as those createdfor the second study. 5.2. RESULTSAND DISCUSSION
In this study(LSA spacesof) textswerecomparedto othertexts insteadof to a word list as in the first studies, resulting in 256 (16 x 16) sets of cosines
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
SEMANTIC VARIATION IN IDIOLECT AND SOCIOLECT
217
representingthe semanticrelationshipbetween texts. A comparisonof the author-matchingtexts with the author-non-matchingtexts showed differences in all four cases (All Fs (1, 6000000)2 25,1250,p < 0.001). When the cosine values (indicatingsimilarityin content) were comparedper idiolect, texts by Dickens differedmore betweenthemselvesthan betweentexts from other authors.The same is true for texts by Joyce. In other words, only for half of the authors (Eliot and Woolf) the author-matchingtexts had higher averagecosine values than the author-non-matchingtexts. In orderto test the sociolect-genderhypothesis,texts by maleauthorswere comparedwith the texts by femaleauthors.Differencesbetweengroupswere found, suggesting evidence for the sociolect-gender hypothesis (F(1, 3640000)= 392989.6 < 0.001). However,as with the unpredictedresults in the idiolect hypothesis, significantdifferenceswere also found within each gender group (All Fs (1, 1820000)2 31747.8,p < 0.001). Overall,texts by female authorshad a higheraveragecosine value, suggestinga resemblance in content, than texts by male authors (female: Mean Cosine=0.135, SD=0.121; male: Mean Cosine=0.058, SD=0.103). As predicted by the sociolect-periodhypothesis, significantdifferences were found betweenRealist-matchingtexts versusModernist-matching texts < (F(1, 3640000)= 12246.763,p 0.001). But again, unexpecteddifferences were also found within each period (All Fs (1, 1820000)2 1579.459, p < 0.001). Interestingly,averagecosine values were higherfor Realisttext than for Modernisttexts (Realist: Mean Cosine=0.107, SD= 0.128; Modernist= 0.093, SD= 0.110). This suggeststhat despitethe fact that there are differencesbetweenthe Realist texts, they are semanticallymore similarto each other than Modernisttexts are. In sum, for some authors(Eliot and Woolf) similarityin content can be found, ingan idiolect hypothesis. For other authors (Dickens and Joyce) texts differwithin one author. This findingis even more interesting when we look at the sociolect-genderhypothesis.Texts by female authors show more similaritiesthan texts by male authors. Similaritiesin content were also found in the sociolect-timehypothesis:In both Realist and Modernist texts more similaritieswere found between the texts within a period than betweenperiods.Furthermore,Modernisttexts show a greaterdiversity when compared to each other than Realist texts, suggested by the lower cosine values for the formercomparedto the latter.
6. Study4: Within-TextComparisonsUsing a VectorModel Up to now, we have found no conclusiveevidencefor the idiolect-hypothesis or either of the sociolect-hypotheses.Should we thereforeabandon all four hypotheses? So far we have assumed that there is homogeneity in the
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
218
MAXM.LOUWERSE
semanticswithin a text. This has largelybeen edby the within-text analysiswhen the two text halvesof each texts werecompared.The question howeveris two what extent this assumptionis correct.It might be the case that within a text there are differencesbetweenthe semantics.If that is the case, the lack of evidencefor the idiolect and sociolect hypothesesmight be explainedby the heterogeneitywithin a text. At the same time, the hypotheses can be tested by comparinghomogeneityvalues by author, genderand period. For this purpose an LSA analysiswas carriedcomparingeach text unit (i.e. paragraphor sentence)to every other text unit (paragraphor sentence)withina text. If texts are generallysemanticallyconsistent(the content of the text units in the text is similar),higher cosine values will be found. Texts that differin the semantics,and are thereforesemanticallyinconsistent, will have lower cosine values.
6.1. MATERIALS The same LSA spacesof the literarytexts from the second and third studies were used.
6.2. RESULTS ANDDISCUSSION As in the previousstudy,the numberof datapointswas reducedby randomly selecting 65,000 cosine values per text using a simple random sampling technique.An ANOVA comparingthe idiolects showed a significantdifference between the four authors (F(1, 1040000)- 2650.68,p < 0.001). Contraryto what was predicteddifferenceswerealso found betweenthe texts for each of the authors (Eliot: (F(3, 260000)= 1305.17,p < 0.001; Mean Cosine=0.034, SD=0.07; Dickens: (F(3, 260000)=645.62, p < 0.001; Mean Cosine=0.020, SD=0.06; Woolf: (F(3, 260000)-167.80, p < 0.001; Mean Cosine=0.019, SD= 0.045;Joyce:(F(3, 260000)= 2899.20,p < 0.001; Mean Cosine= 0.021, SD = 0.073). This again suggestsno for the idiolect hypothesis.In the LSA comparisonbetweenthe texts of one authordescribed in the previousstudy,most homogeneitywas found in the texts by Eliot. This effect was replicatedin the internalhomogeneityanalysis,suggestedby the highest LSA cosine values. For the sociolect-genderhypothesisa significanteffectwas found between gender (F(1, 1040000)-=1699.02,p < 0.001), with texts written by female authorshaving highercosine values (Mean Cosine= 0.026, SD= 0.058) than those written by male authors (Mean Cosine=0.020, SD=0.068). In addition, differences were found for within-gender texts (female: F(1, 520000)= 1929.72,p < 0.001;male:F(1, 520000)= 1660.43,p < 0.001).
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
SEMANTIC VARIATION IN IDIOLECT AND SOCIOLECT
219
As for the sociolect-periodhypothesis, differenceswere found between periods (F(1, 1040000)= 2563.28,p < 0.001), but also betweenthe authors within a period (Realist:F(1, 520000)= 4788.1,p < 0.001; Modernist:F(1, 520000)=61.08, p < 0.001). In the previous analysis we saw that Realist texts share more semantic concepts between them. Similarly, for the semanticsbetweenpartsof the Realisttexts cosine valuesare higherthan for Modernisttexts (Realist:Mean Cosine= 0.027, SD = 0.067;Modernist:Mean Cosine = 0.020, SD = 0.060).
An explanationfor the resultswe have found in the previousstudiesmight indeedlie in the internalsemantichomogeneity.This analysisreplicatedthe findingin study 3 that Modernisttexts seem to be more diversethan Realist texts. This is an importantfindingfor corpus linguisticanalysesof modern literary texts in general, but also for the validity of the Modernist-code hypothesis.If it is true that Modernistauthorsexperimentmore with their literaryproducts(see Fokkema and Ibsch, 1987), then it is still possible to keep up a Modernisthypothesis:Certainsemanticfields might still be more prominentin these texts. However, their overall frequencyis low because Modernisttexts miss the homogeneityRealist texts have.
7. Conclusion We tested hypothesesinitially brought forwardby in Fokkema and Ibsch (1987), who argued that selected authors use selected semanticfields. The word frequencyof the contents of these fields would predictfrequencypatterns in idiolect and literary period. We tested idiolect, sociolect-gender, sociolect-timeand Modernist-codehypothesesderivedfrom this study using Boolean models and vector models. A total of 16 literarytexts were used balancedacrossauthor(Eliot, Dickens,Woolf, Joyce),gender(female,male) and literaryperiod (Realism, Modernism).Two models were used to test these hypotheses,a binaryBoolean model and a scalingvector model. Both methods are very common the field of corpus linguistics(see Louwerseand Van Peer, 2002 for an overview). Initial Boolean analyses suggestedno evidence for any one of the four hypotheses,possiblybecauseof the semanticfieldsthat were selectedand the Boolean method that was used. Results were replicatedin a vector model using the semanticfields. A vector analysis,comparingthe generalcontent betweenthe groupsof texts and comparingthe variouspartswithineach text, showed that the semantic homogeneity in literary texts is an important confoundingvariable.Becauseof this, drawingconclusionsfrom a literary text as a whole, ratherthan its parts might be problematic.A vector model can partly solve this problem,by takinginto everypart of the text. But drawingconclusionsfrom semanticsimilaritieswithin an author can be
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
220
MAX M. LOUWERSE
equallyproblematic,becauseauthorstend to changetheirstyle and semantic space betweentexts. Similarly,semanticsimilaritieswithin a literaryperiod are difficultto determinebecauseof the overallvariations.As pointed out in the beginningof this study, the lack of internalhomogeneityin one text, between texts and between authors can be explained by the (semantic) deviation from the norm the author tries to establish.These variationsare exactly what makes the idiolect and sociolectof literarytexts unique, and is in fact what makes those texts literary. Acknowledgements
This researchwas partiallyedby the National ScienceFoundation (SBR 9720314,REC 0106965,REC 0126265,ITR 0325428)and the Institute of EducationSciences(IES) (R3056020018-02).Any opinions, findings,and conclusionsor recommendationsexpressedin this materialare those of the author and do not necessarilyreflectthe views of the fundingagencies. Note i Whereas the Modernist-code hypothesis, sociolect-time and idiolect hypotheses are directly derived from Fokkema and Ibsch (1987), the sociolect-gender hypothesis is not. However, given the theory of a group code, a sociolect-gender hypothesis seems justified.
References Baeza-Yates R., Ribeiro-Neto B. (eds.) (1999) Modern Information Retrieval. ACM Press, New York, 513 p. Biber D. (1988) Variation Across Speech and Writing. Cambridge University Press, Cambridge, UK, 315 p. Eco U. (1977) A Theory of Semiotics. Indiana University Press, Bloomington, 368 p. Fellbaum C. (ed.) (1998) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 500 p. Fokkema D., Ibsch E. (1987) Modernist Conjectures. A Mainstream in European Literature 1910-1940. Hurst, London, 330 p. Foltz P.W., Kintsch W., Landauer T.K. (1998) The Measurement of Textual Coherence with Latent Semantic Analysis. Discourse Processes, 25, pp. 285-307. Graesser A., Wiemer-Hastings P., Wiemer-Hastings K., Harter D., Person N., and the Tutoring Research Group. (2000) Using Latent Semantic Analysis to Evaluate the Contributions of Students in Autotutor. Interactive Learning Environments,8, pp. 149-169. Hu X., Cai Z., schetti D., Penumatsa P., Graesser A.C., Louwerse M.M., McNamara D.S. and the Tutoring Research Group (2003) LSA: The First Dimension and Dimensional Weighting. Proceedings of the 25th Annual Conference of the Cognitive Science Society. Erlbaum, Mahwah, NJ. Jakobson R. (1987) Linguistics and Poetics. In Jakobson R. (ed.), Language in Literature. Harvard University Press, Cambridge, MA, pp. 62-94.
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions
SEMANTICVARIATIONIN IDIOLECTAND SOCIOLECT
221
Kintsch W. (2000). Metaphor Comprehension: A Computational Theory. Psyhonomic Bulletin and Review, 7, pp. 257-266. Landauer T.K., Dumais S.T. (1997) A Solution to Plato's Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104, pp. 211-240. Landauer T.K., Foltz P.W., Laham D. (1998) Introduction to Latent Semantic Analysis. Discourse Processes, 25, pp. 259-284. Lotman J. (1977) The Structure of the Artistic Text. University of Michigan, Ann Arbor, 300 p. Louwerse M.M., Van Peer W. (eds.) (2002) Thematics.:InterdisciplinaryStudies. John Benjamins, Amsterdam/Philadelphia. 430 p. Martindale, C. (1990) The Clockwork Muse. Basic Books, New York, 411 p. Pennebaker J.W. (2002) What Our Words Can Say : Towards a Broader Language Psychology. Psychological Science Agenda, 15, pp. 8-9. Project Gutenberg, [http://www.ibiblio.org/gutenberg]. Sebeok T.A. (1991) A Sign Is Just a Sign. Indiana University Press, Bloomington, 178 p. The Online Books Page, [http://onlinebooks.library.upenn.edu]. The Oxford Text Archive, [http://ota.ahds.ac.uk]. Wardhaugh R. (1998) An Introduction to Sociolinguistics. Blackwell, Oxford, 464 p. Watson G. (1994) A Multidimensional Analysis of Style in Mudrooroo Nyoongah's Prose Works. Text, 14, pp. 239-285. Wellek R., Warren A. (1963) Theory of Literature. Cape, London, 382 p.
This content ed from 132.248.9.8 on Mon, 27 May 2013 10:34:35 AM All use subject to JSTOR and Conditions