As so often, from that seed one can spiral out into similar corpus-investigating tools, such as the online Voyant Toolkit, which makes statistical properties of texts visible and accepts cut-and-pasted text for quick analysis. I am not clear yet how to use some of tools, such as Cirrus-display or the term-berry, but my example was rather short and had little in terms of repeating words.
Because Corpus Linguistics has been going on for a while now, there are a couple of good articles or even books on how to construct corpora on Amazon, mostly targeting linguists however, with a few exceptions.
There has also been criticism in the community, distinguishing corpus linguistics from discourse analysis. The recent exercise that I went through for the purposes of this blog, on Ernest Gellner's book Plow, Sword and Book, effectively plays off the discourse analytic stance against the corpus linguistic one, as I now suspect. The most recent book I was able to obtain that splits the difference comes from the analysis of academic writing.
There have been more principled attacks against corpus linguistics, for example from Noam Chomsky, for example discussed here, which see in corpus linguistics the mistaken assumption that data will eventually induce itself into a theory. (That sounds familiar ....)
There are also people who simply try to explain how corpus linguistics came about and developed, e.g. Karen Fort at INIST.fr. Fort reminds her readers of the incident reported in Hill 1962, when Chomsky overplayed his hand and denying that perform could be used with mass-nouns in English, citing himself as a native speaker as an authority. The British national Corpus revealed however that perform magic is indeed just such a construction. (I would argue that Chomsky fell into a recognition/recall trap here, overestimating the latter based on his excellence at the former. Still---bad for him.)
A more detailed review with 20th century references from the University of Lancaster can be found here. It cites British linguists like H. Widdowson, who in 2000 wrote an article entitled The Limitations of Linguistics Applied in Applied Linguistics, 21/1:3-25.
For, obviously enough, the computer can only cope with the material products of what people do when they use language. It can only analyse the textual traces of the processes whereby meaning is achieved: it cannot account for the complex interplay of linguistic and contextual factors whereby discourse is enacted. ... In reference to Hymes' components of communicative competence (Hymes 1972), we can say that corpus analysis deals with the textually attested, but not with the encoded possible, or the contextually appropriate. (no page number provided in extract)Though Widdowson is talking about language learning, he makes a point that bears repeating in a larger context:
... the textual product that is subjected to quantitative analysis is itself a static abstraction. The texts which are collected in a corpus have a reflected reality: they are only real because of the presupposed reality of the discourses of which they are a trace. (no page number provided in extract)That seems to me to be the entry point for the importance of discourse analysis. In fact, Widdowson later summarizes his point as
... corpus linguistics provides us with the description of text, not discourse. (no page number provided in extract)The document than provides a rejoined by M Stubbs, from 2001, entitled Texts, corpora and problems of interpretation: A response to Widdowson, published in Applied Linguistics 22/2. pp.149-172.
Corpus linguistics therefore investigates relations between frequency and typicality, and instance and norm. It aims at a theory of the typical, on the grounds that this has to be the basis of interpreting what is attested but unusual. (no page number provided in extract)And more fully later on:
Frequency is not necessarily the same as interpretative significance: an occurrence might be significant in a text precisely because it is rare in a corpus. But unexpectedness is recognizable only against the norm. (no page number provided in extract)Stubbs notes that this insight is especially important to note conventionality:
... [A] major finding of corpus linguistics is that pragmatic meanings, including evaluative connotations, are more frequently conventionally encoded than is often realized (Kay 1995; Moon 1998; Channell 2000). (no page number provided in extract)
Concepts of convention and norm raise problems in the not infrequent cases when interpretations diverge. (no page number provided in extract)Stubbs cites the case of cronies in corpus-based dictionaries, but emphasizes that the analysis is made possible by the empirical aspect of corpus studies.
In a 2016 paper in Dialogic Pedagogy by Richards and Pilcher (which works with a somewhat static distinction of objective and subjective language systems) quote Ädel (2010), p.48, who worries about "the inevitable focus on surface forms in corpus work" as well as "the risk of focusing exclusively on the word and the phrase level when using computer-assisted methods" p.49.
It it was, in contrast [to the previously stated, RCK] accepted that the usage and the meaning of the language was creative [IS2], individual [IS1], and only represented the inert hardened crust of the language [IS4] then the linguist would be unable to analyse it isolated from the context in which it was used. (A128)Except, what choice do the historians have? Voloshinov (or Mikhail Bakhtin, however the debate around Morris 1984 comes out, cf. A123) criticized the departure of linguistics from the antiquarian concerns, where
the ancient written monument [is considered] ... the ultimate realium" (Voloshinov, 1973, p.73, cited in Richards & Pilcher, A123).Pilcher and Richards cite Bakhtin in observing:
Fundamental to the meaning of language in such a view ... are dialogue and context. The importance of dialogue (Bakhtin 1981, 1986) means that language consists of a stream of unfinished utterances that is continually evolving and is never completed. (A129)Context is with Bakhtin referred to as a linked chain of previous utterances (A129). Pilcher and Richards mostly focus on spoken language here (e.g. the contribution of intonation to interpreting the word `well`), but some of their concerns are true in larger situations also. Citing Fecho, they write
... to expect that just because you and I are using the same term or phrase that we have a consensus understanding of its meanings is to deny that context and experience have anything to do with our understandings (Fecho, 2011, p.19; cited A130)Corpus linguistics then generates frequency lists of
... decontextualized signifiers, which in turn are only evidence of past thoughts. (A130)There was finally also a paper by Nelya Koteyko trying to distinguish the different forms of discourse used in science and their applicability to corpus linguistics, but I did not quite catch the main drift.