Monday, July 31, 2017

Some notes on Corpus Linguistics and their Criticism

My friend Paige Morgan, Digital Humanities Librarian at the University of Miami, recently suggested Laurence Anthony's tool AntConc for corpus linguistic analysis that is well supported and straightforward to accomplish thanks to online tutorials and YouTube videos, some of them even by the maintainer himself.

As so often, from that seed one can spiral out into similar corpus-investigating tools, such as the online Voyant Toolkit, which makes statistical properties of texts visible and accepts cut-and-pasted text for quick analysis. I am not clear yet how to use some of tools, such as Cirrus-display or the term-berry, but my example was rather short and had little in terms of repeating words.

Because Corpus Linguistics has been going on for a while now, there are a couple of good articles or even books on how to construct corpora on Amazon, mostly targeting linguists however, with a few exceptions.

There has also been criticism in the community, distinguishing corpus linguistics from discourse analysis. The recent exercise that I went through for the purposes of this blog, on Ernest Gellner's book Plow, Sword and Book, effectively plays off the discourse analytic stance against the corpus linguistic one, as I now suspect. The most recent book I was able to obtain that splits the difference comes from the analysis of academic writing.

There have been more principled attacks against corpus linguistics, for example from Noam Chomsky, for example discussed here, which see in corpus linguistics the mistaken assumption that data will eventually induce itself into a theory. (That sounds familiar ....)

There are also people who simply try to explain how corpus linguistics came about and developed, e.g. Karen Fort at INIST.fr. Fort reminds her readers of the incident reported in Hill 1962, when Chomsky overplayed his hand and denying that perform could be used with mass-nouns in English, citing himself as a native speaker as an authority. The British national Corpus revealed however that perform magic is indeed just such a construction. (I would argue that Chomsky fell into a recognition/recall trap here, overestimating the latter based on his excellence at the former. Still---bad for him.)

A more detailed review with 20th century references from the University of Lancaster can be found here. It cites British linguists like H. Widdowson, who in 2000 wrote an article entitled The Limitations of Linguistics Applied in Applied Linguistics, 21/1:3-25.
For, obviously enough, the computer can only cope with the material products of what people do when they use language. It can only analyse the textual traces of the processes whereby meaning is achieved: it cannot account for the complex interplay of linguistic and contextual factors whereby discourse is enacted. ... In reference to Hymes' components of communicative competence (Hymes 1972), we can say that corpus analysis deals with the textually attested, but not with the encoded possible, or the contextually appropriate.  (no page number provided in extract)
Though Widdowson is talking about language learning, he makes a point that bears repeating in a larger context:
... the textual product that is subjected to quantitative analysis is itself a static abstraction. The texts which are collected in a corpus have a reflected reality: they are only real because of the presupposed reality of the discourses of which they are a trace.  (no page number provided in extract)
That seems to me to be the entry point for the importance of discourse analysis. In fact, Widdowson later summarizes his point as
... corpus linguistics provides us with the description of text, not discourse. (no page number provided in extract)
The document than provides a rejoined by M Stubbs,  from 2001, entitled Texts, corpora and problems of interpretation: A response to Widdowson, published in Applied Linguistics 22/2. pp.149-172.
Corpus linguistics therefore investigates relations between frequency and typicality, and instance and norm. It aims at a theory of the typical, on the grounds that this has to be the basis of interpreting what is attested but unusual. (no page number provided in extract)
And more fully later on:
Frequency is not necessarily the same as interpretative significance: an occurrence might be significant in a text precisely because it is rare in a corpus. But unexpectedness is recognizable only against the norm. (no page number provided in extract)
Stubbs notes that this insight is especially important to note conventionality:
... [A] major finding of corpus linguistics is that pragmatic meanings, including evaluative connotations, are more frequently conventionally encoded than is often realized (Kay 1995; Moon 1998; Channell 2000). (no page number provided in extract)
Concepts of convention and norm raise problems in the not infrequent cases when interpretations diverge. (no page number provided in extract)
Stubbs cites the case of cronies in corpus-based dictionaries, but emphasizes that the analysis is made possible by the empirical aspect of corpus studies.

In a 2016 paper in Dialogic Pedagogy by Richards and Pilcher (which works with a somewhat static distinction of objective and subjective language systems) quote Ädel (2010), p.48, who worries about "the inevitable focus on surface forms in corpus work" as well as "the risk of focusing exclusively on the word and the phrase level when using computer-assisted methods" p.49.
It it was, in contrast [to the previously stated, RCK] accepted that the usage and the meaning of the language was creative [IS2], individual [IS1], and only represented the inert hardened crust of the language [IS4] then the linguist would be unable to analyse it isolated from the context in which it was used. (A128)
Except, what choice do the historians have? Voloshinov (or Mikhail Bakhtin, however the debate around Morris 1984 comes out, cf. A123) criticized the departure of linguistics from the antiquarian concerns, where
the ancient written monument [is considered] ... the ultimate realium" (Voloshinov, 1973, p.73, cited in Richards & Pilcher, A123).
Pilcher and Richards cite Bakhtin in observing:
Fundamental to the meaning of language in such a view ... are dialogue and context. The importance of dialogue (Bakhtin 1981, 1986) means that language consists of a stream of unfinished utterances that is continually evolving and is never completed. (A129)
Context is with Bakhtin referred to as a linked chain of previous utterances (A129). Pilcher and Richards mostly focus on spoken language here (e.g. the contribution of intonation to interpreting the word `well`), but some of their concerns are true in larger situations also. Citing Fecho, they write
... to expect that just because you and I are using the same term or phrase that we have a consensus understanding of its meanings is to deny that context and experience have anything to do with our understandings (Fecho, 2011, p.19; cited A130)
Corpus linguistics then generates frequency lists of
... decontextualized signifiers, which in turn are only evidence of past thoughts. (A130)
There was finally also a paper by Nelya Koteyko trying to distinguish the different forms of discourse used in science and their applicability to corpus linguistics, but I did not quite catch the main drift.

Reading with Context in Mind: Plough, Sword and Book (Part 1)

The following discussion analyses a few pages of material from Ernest Gellner's Plough, Sword and Book: The Structure of Human History, London (Collins Harville) 1988. The idea behind the exercise is to distill out in exemplary fashion the contextual form of book contents that makes some of the strategies of statistical or pattern-based NLP less helpful than one might hope.

In addressing the problem of the role that primitive man plays in modern political thought (the chapter is entitled, "Which way will the Stone Age Vote swing?"), Gellner analyzes the way in which some philosophers talk about previous social states and their impact on morality.
In between the two extremes---candid fictional reconstructionists and paid-up professional anthropologists---there are other who, while not professional specialists in the area of early man, nevertheless intend their affirmation about him [i.e. early man] to be realistic, not mere fictions, but who wish them, all the same, to point a moral for the conduct of our own social life. (p.26)
The fact that this is Gellner's analysis means that this paragraph is his own, and he owns the words in it and the thoughts as well. Thus, if Gellner, to concoct an example, were to deny ever using the word "realistic", one would be justified in pointing to this paragraph and contradicting that assertion.
For instance, one of the profoundest and most influential of prophets of modern economic and other liberalism is F. A. Hayek. Hayek's analysis of the options and perils of modern society do in fact dovetail with a sharply delineated vision of the primitive social order and ethos. On his view, the strong social morality of early man and its survival in contemporary society constitute a positive danger to us: (p.26)
This paragraph is a potential mixture; it starts out as something that Gellner says about Hayek, but toward the end it becomes possible that Gellner uses diction that is more properly considered Hayek's than Gellner's. This is because Gellner is trying to reconstruct Hayek's thought, and when people do that, they often use the words that the author to be reconstructed employs. For example, if the expression "positive danger" strikes readers as interesting, this paragraph would be ill-suited to determine if Hayek or Gellner would use that expression.

Gellner then quotes Hayek, using a block-quote with reduced font-size to indicate that this is so (p.26):
There is ... so far as present society is concerned, no "natural goodness", because with his innate instincts man could never have built up the civilisation on which the numbers of present mankind depend for their lives. To be able to do so, he had to shed many sentiments that were good for the small band, and to submit to the sacrifices which the discipline of freedom demands but which he hates. The abstract society rests on learnt rules and not on pursuing perceived desirable common objects; and wanting to do good to known people will not achieve the most for the community, but only the observation of its abstract and seemingly purposeless rules.^(Fn6) (p.26)
Notice that this paragraph needs to be attributed to Hayek---as footnote 6 tells us, here indicated by ^(Fn6), the passage is from page 20 of Hayek's work The Three Sources of Human Values, published by the London School of Economics and Political Science in London in 1978. Thus, Hayek is the one who talks about submitting to sacrifices, and the discipline of freedom, not Gellner. (Gellner may of course share that view, but we have so far not seen any textual material to support such a supposition.)

Gellner then prepares to quote another passage from Hayek, which he relates to the first passage via an editorial comment on the stance that Hayek has taken.
And, should anyone not understand what this implies in practice, let it be spelt out: (p.26)
So even though Gellner is pointing out a relationship in Hayek's thought, Gellner is at least to some extent using his own words to express that relationship and thereby give an interpretive bias to the reading of the Hayek passage.

The following quote is again distinguished by a smaller font and a blockquote presentation, marking it as a direct quote from a writing by Hayek (minus the editorial [The] that Gellner felt should be inserted).
... the long submerged innate instincts have again surged to the top. [The] demand for a just distribution in which organised power is to be used to allocate to each what he deserves is thus strictly an atavism, based on primordial emotions. And it is these || widely prevalent feelings to which prophets, moral philosophers and constructivists appeal by their plans for the deliberate creation of a new type of society.^(Fn7) (pp.26f)
The passage is from the same work, as we learn from Fn 7, but two pages earlier, p.18. Notice that this casts an odd light on Gellner's transitional phrasing, because having the implication of a thesis precede the thesis is an odd way to present an argument, and cannot really be called a spelling-out, given that most people read from the low to high page numbers.

Gellner now tries to give Hayek a strong interpretation, one that makes plausible why someone would agree to Hayek's claims. That means, we should expect to encounter Hayek's thinking presented in Gellner's words with Hayek's phrases sprinkled throughout.
The picture is striking and suggestive. Men must have lived in something like "bands", groups too small to be capable of imposing abstract and impersonal rules, for a long time---during the overwhelming majority of generations since the inception of humanity, however that inception may be dated. So, on this view, throughout most of our history our situation instilled in us an ethic which is directly opposed to all that is innovative, creative, progressive in human civilization. Hence, civilization is based on the overcoming, not so much of our lowest instincts, but on the contrary of all that had usually been held to be moral: the social impulses of mankind, the tendency to cooperate with fellows in the pursuit of shared aims. Respect for abstract and incomprehensible rules must replace love of fellow men and community and a sense of shared purpose, if civilization is to emerge and survive. (p.27)
We can see immediately that this is a paraphrase of Hayek's thinking; so far, Hayek has only spoken of  "abstract and seemingly purposeless rules", which under Gellner's reconstruction morph into "abstract and impersonal rules" or "abstract and incomprehensible rules", neither of which is far away from Hayek's meaning, but clearly not synonymous with it either.

After all, rules that are seemingly without purpose may still be personal and comprehensible---for example, the rule that after each 1000-point drop in the Dow-Jones Industrial Average, the CEOs of the Fortune 500 have to apologize on public TV and do 20 push-ups. That rule is both personal and comprehensible, just not very conducive to any purpose that we associate with economics.

It remains possible that Gellner was citing phrases from other sections of the Hayek paper that he did not quote directly; we lack the information to determine this. But we can already see that the complex interplay of commentary ("The picture is striking and suggestive"), expansion of ideas (the speculation on the small groups, with the word `bands`, a term that Hayek uses without quotation marks, now in double-quotes) and summaries of arguments ("the tendency to cooperate with fellows in the pursuit of shared aims") makes it very difficult to attribute phrases and ideas to either Hayek or Gellner directly. At the same time, I cannot shake the feeling that, if someone cited the sentence
Respect for abstract and incomprehensible rules must replace love of fellow men and community and a sense of shared purpose, if civilization is to emerge and survive. (p.27)
as Gellner's point of view, Gellner would probably protest, calling such use a case of being quoted out of context, and seeing himself as primarily elucidating Hayek---even if Hayek should object to the rules being labeled "incomprehensible".

It is important to note that Gellner strengthening Hayek's stance by paraphrasing and working to give it additional plausibility is not a fault at all, but one of Gellner's qualities as a good writer, as someone who tries to compact Hayek's writing for an audience unfamiliar with Hayek's views and takes the arguments of Hayek seriously.

Not very surprising, Gellner continues to unfold Hayek's thought, formally repeating the methods of the previous paragraph, elucidating and bringing in new notions from Hayek's extensive oeuvre.
In Hayek's vision, an unplanned, unintended culture, which was the fruit neither of conscious reason nor of animal instinct, had somehow arisen, and it alone made possible that automatic mechanism of response to need, that sustained innovative improvement, which is manifested in and fostered by the market. ... || ... Seeing how recent precarious our liberation from over-socialization is, it is surprising that we are not even more thoroughly in thrall to atavistic sociability than Hayek fears. It is a social ethic and cohesiveness, not their absence, which are our greatest threat. (pp.27f)
Even though Gellner explicitly labels the exposition as Hayek's (his "vision"), he is now bringing in terms that are not licensed by the quotes and that presumably come from the other parts of Hayek's oeuvre, key among which the notion of the market, or the notion of natural selection. (Gellner errs in not providing footnotes for the source of these concepts in Hayek's writing, but increasingly, footnotes have come to be viewed as pedantic and no longer in need of precision.)

At the same time, the paraphrasing with its subtle shifts of meaning continues; Hayek in the quotes at least never spoke of "animal instinct", but called them "innate instincts". For example, with Steven Pinker we could claim that language belongs to the innate instincts, but clearly is not a good example of an animal instinct, give how few (or none, if one follows Pinker) animals share our form of language.

Since the point of the exercise is not to understand either Hayek or Gellner's argument, just how the thoughts and the linguistic presentations of these thoughts co-occur in a specific text, we will not delve into the elided middle section of that paragraph of Gellner's (see the ellipsis in the quotation from pp.27f above) and continue straight to the next paragraph of Gellner's exposition, on page 28.
Hayek's way of presenting our general condition differs from what might be called the simplest or classical formulation of laissez faire liberalism, in his conscious stress on the cultural preconditions of an open or market society. The classical formulation suggests that its only important condition is a political one: a just, effective and unrapacious state must be present, a political authority which uses its power to keep the peace and uphold the rules, and does not use it simply to despoil civil society. Hayek's new way of defining the problem makes him insist that mere political order is not enough, that a certain kind of abstract culture is also required, the emergence of a sense of and respect for abstract rules, and a detachment from communal, cooperative ends. The Hidden Hand can operate only in a suitable cultural milieu, amongst men who are not too sociable, men who respect rules rather than social aims. (p.28)
At this point, Gellner is contrasting two stances simultaneously, what he calls the classical formulation of liberalism and Hayek's form. These two stances are distinguished, according to Gellner, in their emphasis, the one putting the accent on the political and the other on the cultural preconditions.  Because the paragraph has the potential for losing even the interested reader, Gellner has begun to use emphasis (italics) in the typography to distinguish the key terms that identify the stances. Such an emphasis would have been even more effective if he had not been forced, by convention, to also italicize the French (i.e. non-English) phrase "laissez faire" in the same paragraph.

We observe that Gellner's use of italics for emphasis is in general ambiguous to its precise meaning; on p.27, he used it to mark the introduction of important economic term of the market. That market's regulatory force, the Hidden Hand, however, receives no such emphasis and is capitalized as an agentive force instead.

Our little excursion is almost completed at this point, as we turn to the last paragraph in Gellner's consideration of early man. Gellner turns to the social theory of Hayek's one-time colleague at the London School of Economics and former compatriot, Karl Popper.
A very similar sense of the struggle, not with destructive animal instincts, but on the contrary with an oppressively social morality and the deep feelings which underline it, is also found in the social thought of Karl Popper.^(Fn8) This Hayek/Popper vision might well be called the Viennese Theory. One may well wonder whether it was not inspired by the fact that, in the nineteenth century, the individualistic, atomized, cultivated bourgeoisie of the Habsburg capital had to contend with the influx of swarms of kin-bound, collectivistic, rule-ignoring migrants from the easter marches of the Empire, from the Balkans and Galicia. Cosmopolitan liberals had to contend, in the political sphere, with the emerging breed of national socialists. This "Viennese" vision is an inversion, a denial all at once of romanticism -- it elevates || Gesellschaft (society) over Gemeinschaft (community) --- and of Marxism. Marx anticipated the restoration, rather than the overcoming, of the alleged social proclivities of early man. (pp.28f)
The literature referenced for Karl Popper in Footnote 8 is his classic The Open Society and Its Enemies, London 1945. We observe that we again encounter the use of italics to highlight terms, e,g, national socialists such as the Austro-Fascists as compared to economic socialists in the sense of the followers of Karl Marx, as well as non-English terminology, such as the German words Gesellschaft or Gemeinschaft.

In this paragraph, the reader has to pay careful attention to attribute stances properly. Not only is Gellner discussing Popper, but he is aligning Popper with Hayek and even combining them into a single theory or "vision" (cf the Viennese Theory versus the "Viennese" vision).  At the same moment, for the first time in the selection that we have been analyzing, Gellner actually engages in historiography (we will get to this problem soon), so he is reporting background information to justify his postulate of a Viennese Theory. Thus, because he is not describing the stance of the theory, but its genesis, Gellner owns this paragraph and its terminology.

To summarize: In roughly three pages, Gellner has given us an exposition of at least two intellectual stances that he may or may not share, namely those of F.A. Hayek and of the classical formulation of laissez-faire liberalism; Popper really received one sentence only and was not given a single quote. Gellner has presented these positions in quotes, which are typographically distinct, and in paraphrases, some of which make inexact use of the terminology found in the reconstructed texts. Gellner has interleaved these reconstructions which his comments and in the case of the Vienna Theory, even with an origins story that draws additional background information into the text. Notice that the brevity of the sample did not allow some other occurrences one should expect in a book the size of Gellner's, such as direct quotes of other writers' interpretations of Hayek or Popper, or paraphrases of others' reconstructions of the texts under consideration.

It is perhaps not surprising then that the typical techniques of corpus linguistics cannot succeed readily in such a setting; even though there is a sense in which Gellner's book is a corpus of conceptual stances. Perhaps the reading is not distant enough.

Sunday, July 30, 2017

Contextual Interpretation

I am trying to put together lists of examples that show why munging together large amounts of textual data can often run afoul of tricks and traps that do not assist in historical analysis, or perhaps other digital humanities as well.

  • Shifts in the Meaning of Words 
    • "mother-in-law" in Pride and Prejudice actually means the stepmother (Jack Goody, Production and Reproduction, p.53)
    • "making love" means for a man to be talking with an unmarried woman in Victorian England (e.g. Ginger Susan Frost, Promises Broken: Courtship, Class, and Gender in Victorian England, 1995, p69) with the intent of espousing her


These specific cases are instances of the discourse being not identified properly, e.g. in its mode or in its temporal delineation. But there are more detailed comments we can make about the discursive nature and the context of statements give a suitable example.

The problem of the range of the Discourse

During a discussion of couples' interactions in the New Yorker, the author reminded the reader that the range of a discourse in presidential politics and the presidential White House extends to the previous occupants and their actions as well.
On Tuesday, after Melania [Trump] appeared again to reject the President [Donald Trump], this time on the tarmac in Rome with a slick “down low, too slow” move, Pete Souza, President Obama’s official photographer, posted a photo to his Instagram account of Barack and Michelle tenderly holding hands in Selma, Alabama, a gesture that needed no interpretation.
This is an example of the kind of interaction that is difficult to track or detect without establishing the precise discourse that the item belongs to. Here models of layers of discourse that need to be attended to are crucial.

Eventually, the Washington Post made it clear at a description level, by linking to these (and other) clips and photos, providing the interpretation for those that had missed the discourse contributions. So the hope of large scale ingesting of documents for interpretation is that discourse contributions that are clever in the way that Souza's was will eventually have the kind schoolmaster who spells out what the others suspected. (In some sense, the historians often end up in that role.)

Of course, not every hand-holding couple posted that day is a commentary on the Trumps' situation, but most likely, the George W. Bushes' holding hands would have been, within a specific window of time, of course.

Appendix