Corpora and Historical Texts

Corpus linguistics has always been concerned with studying language by gathering evidence of language use. Historical corpus linguistics concerns itself with studying the language of some speech community at a specific point in the past (sometimes called synchronic analysis) or studying language change through time (sometimes called diachronic analysis). One reason to gather evidence of historical language usage is to better understand historical texts.

In the introductory chapter of their textbook, “The Routledge Handbook of Corpus Linguistics,” Anne O’Keeffe and Michael McCarthy situate the beginning of corpus linguistics in the work of those who constructed the first concordances of sacred texts. (A concordance is an index of every occurrence of a word in a given text.) One of these was Cardinal Hugo of St. Caro, who in 1230 CE with the aid of a team of 500 Dominican monks created a concordance of every word that occurs in the Vulgate Bible. Another was Isaac Nathan b. Kalonymus, who constructed the first concordance of the Hebrew Bible (c. 1437-45). Similar concordances of important texts, both sacred and profane, would follow.

Today the use of electronic linguistic corpora bears little resemblance to the painstaking, by-hand construction of a concordance. But the decision to situate the beginning of corpus linguistics in the work of early concordance makers is instructive. Both the concordance maker and the corpus linguist are concerned with supporting claims about language and meaning with evidence of usage.

Linguists rely on historical corpora to gather evidence of historical usage and to monitor language change. Language is in a constant state of change, but it does not change at a predictable rate or in a predictable way. At least some information about our own language use may not be available to us via introspection. And most of us cannot claim to have reliable instincts about the way words were used before we were born. In order to make claims about meaning of words in a historical text, it is helpful to have evidence of the way those words were used in context, in a similar timeframe, by a similar speech community.

To evaluate claims about the way language was used in the Founding Era, we need to have some evidence of that language use. Until recently, there was no corpus of the Founding Era. The Corpus of Historical American English (“COHA”) was the best available corpus of historical American English usage, but its texts extended back only to 1810. The recent advent of the Corpus of Founding Era American English (“COFEA”) makes it possible to evaluate claims about the way words were used in the Founding Era with usage evidence from that era. (Though the COFEA has a number of acknowledged limitations in terms of its representativeness. It privileges the writing of those associated with the drafting of the Constitution and, like any historical corpus, privileges the writing of those whose writing was preserved. Users should be mindful of these limitations when drawing conclusions from language evidence from the COFEA.)

Language does not exist in a vacuum, and, in my view, it would be difficult to evaluate claims about historical usage and meaning without some understanding of the historical setting in which a given text was drafted and some understanding of the specific circumstances of its drafting. In the context of constitutional interpretation, Professor Lawrence Solum has proposed attempting to triangulate the meaning of constitutional provisions with usage evidence from corpora, immersion into the “linguistic and conceptual world of the authors and readers of the constitutional provision being studied,” and “studying the record framing, ratification, and implementation” of the Constitution. This sort of triangulation may involve collaboration among lawyers, historians, and linguists—similar to collaborations among historians and linguists that are already taking place.

But what about precedent? Isn’t it possible for settled caselaw to remove the question of the meaning of a given provision of the Constitution beyond the realm of debate? Maybe. In a recent article, Professor Will Baude discusses the notion of Constitutional Liquidation, which he characterizes as circumstances where post-Founding practice has settled a constitutional dispute. Liquidation, in this framework, requires: (1) textual indeterminacy, (2) a course of deliberate practice, and (3) settlement through acquiescence of dissenters and public ratification.

Historical corpora may be able to provide language evidence to help evaluate claims about the textual determinacy of a given provision. And it may not be difficult to generate a list of constitutional interpretive disputes that cannot be said to have been settled in any meaningful way. For example, until very recently, the scope of the Foreign Emoluments Clause and meaning of the word “emoluments,” in Article I, Section 9, Clause 8, had not been tested in a major case.

However, even assuming that it is possible for the meaning of a constitutional provision to be settled for all time by precedent, language evidence from linguistic corpora of the period may still have immense value. This is simply because the Constitution is an enormously important historical text. Understanding how its language was used by its drafters, ratifiers, and the public that it governed is inherently valuable.

But what does it mean to use evidence from a linguistic corpus in the context of constitutional interpretation? Are advocates of law and corpus linguistics treating language as a fact about the world, and merely engaged in empirical analysis, or are they searching for justification of their prior normative commitments? In her article, Legal Corpus Linguistics and the Half-Empirical Attitude, Professor Anya Bernstein contends that corpus linguistics in linguistics is empirical because “the corpus it studies and the patterns it uncovers represent the language it analyzes.” By contrast, Professor Bernstein argues that legal corpus linguistics “rests on normative claims: that the corpus it studies represents the language that should guide our understanding of the law, and that the patterns it uncovers should influence our interpretation of legal texts.” I disagree that law and corpus linguistics necessarily rests on normative claims in the way that Professor Bernstein argues. (Though I cannot possibly adequately respond to this thought-provoking article in a few short paragraphs here, I do want to respond to a few points that I think are relevant to the subject of this colloquium.)

Speaking for myself only (but as an advocate of using corpus evidence in legal interpretation), only two commitments guide the decision to turn to corpus evidence in legal interpretation: First, that claims about language should be evaluated with language evidence. And second, that linguistic corpora are an effective way to gather language evidence. No other commitment is necessary to find valuable information in a corpus, and, as far as I know, the law and corpus linguistics “movement” doesn’t have any other such commitments. Specifically, nothing in the idea of using corpora in legal interpretation forecloses consideration of legislative history, or statutory purpose, or pragmatic considerations, or statutory dynamism. Using corpus linguistics does not require a commitment to original public meaning, original methods originalism, or living constitutionalism.

Law and corpus linguistics (or legal corpus linguistics) is simply a commitment to evaluate language claims with language evidence. In my view, what distinguishes academic corpus linguistics at large from law and corpus linguistics (or what should distinguish them) is (1) the types of questions being asked, and (2) how the answers to those questions are used.

With respect to the types of questions being asked, when judges invoke the ordinary meaning of a legal text, they implicate a fact about the world that can be investigated with evidence. Why do proponents of legal corpus linguistics talk so much about ordinary meaning, plain meaning, and original meaning? Because that is the way that judges phrase their claims about language.

Often, ordinary meaning is framed with reference to how ordinary people would use the words or phrases in question. A common objection is that it was not “ordinary people” who drafted the legal text in question, but a specialized, lawyerly class, who drafted, enacted, and codified it.

I think it is fair to characterize the choice of whose linguistic conventions should govern an interpretive task as a normative decision. In making this decision, there may be some good reasons to choose ordinary meaning. In his recent text, Interpreting Law, Professor William S. Eskridge Jr. suggests a number of such justifications for judges’ appeals to ordinary meaning, including the notion that “[a] polity governed by the rule of law aspires to have legal directives that are known to the citizenry, that are predictable in their application, and that officials can neutrally and consistently apply based upon objective criteria.” Maybe that is right. But it is the question of whether appealing to ordinary meaning is a good idea that is normative. The attempt to determine how words are ordinarily used based on evidence from a corpus is not.

How people use (or used) language is (sometimes) a verifiable fact. The choice of relevant speech communities may be a normative, jurisprudential one, but once that choice is made, questions are posed that may be answered with evidence from a corpus. And these questions may be answered using the same corpus methods that are used in other, non-legal contexts.

Accepting the judge’s (tacit or explicit) choice of speech community, I think, is part of what Professor Bernstein means by the “half-empirical attitude.” Shouldn’t we be questioning the judge’s characterization of the proper speech community? Yes. I think we should be. This is why my co-author and I acknowledge here that “there may be circumstances where language evidence from different regions, industries, social backgrounds, or racial, ethnic, or community identities may appropriately be brought to bear on an interpretive question. And we emphasize that this concern is not a weakness but a strength of corpus tools, which allow the interpreter to construct a corpus of any appropriate speech community and analyze the language of that community.”

This use for linguistic corpora in legal interpretation was anticipated by Professor Laurence Solan, who observed: “When the legal system decides to rely on the ordinary meaning of a word, it must also determine which interpretive community’s understanding it wishes to adopt. This choice is made tacitly in legal analysis, but becomes overt when the analysis involves linguistic corpora . . . .”

Thus, I agree with Professor Bernstein completely to the extent she suggests that one of the major, potential contributions of using corpus linguistics to legal interpretation could be helping courts make better informed choices about which speech communities’ linguistic conventions are relevant to a particular interpretive task and the ways in which speech communities’ linguistic conventions differ. Congress can command individuals (as with federal criminal laws) and it can command federal agencies. There is no reason to assume that the relevant speech communities here would have the same linguistic conventions. Comparative corpora might be able to both illuminate their differences and inform judges’ decisions about how to interpret these differing legal texts.

But where a judge has made a choice (whether tacitly or explicitly) about whose linguistic conventions should govern, and then made some claim about language and meaning, I simply don’t see it as a failing of a commitment to empiricism to say that a claim has been made about meaning, and that claim should be evaluated with evidence.

With respect to how the answers are used, when a chemist, a hydrologist, or an engineer is called to serve as expert witness in a court, they are (or should be) still doing chemistry, hydrology, and engineering. This is true even though their work will be used not for the pursuit of pure knowledge, but to win a case. It is obviously possible for any of them to act in bad faith, or to be clouded by confirmation bias (or money) and to proceed in a sloppy or dishonest fashion. But the chemist is not not doing chemistry merely because the results will be used to persuade.

Professor Bernstein here argues that “a lot of academic corpus research asks questions that go way beyond the meanings of words.” Certainly. But as the author of a corpus-based dictionary, I am fairly confident that examining and describing the meaning of words has always formed an important part of how corpora are used. Corpus linguistics has transformed lexicography. At least since the publication of the corpus-based Collins Cobuild Dictionary of the English Language, every major dictionary publishing house has an in-house corpus. (The Oxford University Press, for example, produces the Oxford Dictionary of English with usage evidence from the Oxford English Corpus.) Several of the contributors to the Routledge Handbook of Corpus Linguistics list corpus lexicography as an area of academic interest (and the text dedicates a chapter to this subject). Corpus lexicography is the subject of extensive scholarship. Routledge has a dedicated series of corpus-based dictionaries in English, Spanish, French, Mandarin, Russian, etc. These volumes are written by academic linguists. It may be true that there are few academic articles that address the meaning of a particular word. But this would likely be the case only because, outside of the context of a legal dispute, such a question would not be all that interesting. That does not mean that academic corpus linguists are not concerned about the meaning of individual words.

A common objection to law and corpus linguistics is that it ignores the syntactic relationship among words, fails to take into account pragmatic context, or phrase- or sentence-level meaning, and that most of the language in a corpus is the language of ordinary speech, and not the commanding language of the law. Some of the work in law and corpus linguistics already attempts to take into account the relationships among the words in a legal text, and there is no reason that corpus evidence cannot be brought to bear on questions of syntax and pragmatics. And it is certainly the case that there is more work to be done on this front. But it is also possible to overstate these limitations to the utility of corpus evidence. In their recent article, The Meaning of Sex: Dynamic Words, Novel Applications, and Original Public Meaning, William N. Eskridge Jr., Brian G. Slocum, and Stefan Th. Gries observe that meaning is typically compositional, that is, “the meaning of a complex linguistic expression is built up from the meanings of its composite parts in a rule-governed fashion.” But where a word or phrase is proposed to have “some conventional meaning that differs from the compositional public meaning,” some reason must be given for why that is the case.

In addition, word or phrase-level search in the corpus’ concordance function will reveal the word’s use in a wide variety of contexts. The corpus user may search at the word or phrase level, but the evidence is often presented at the sentence level. There is nothing foreclosing the consideration of syntax or pragmatic context merely because the search was performed on a smaller linguistic unit. Context is, of course, very important to the interpretation of a text. But word or phrase level searches can reveal a significant amount of contextual usage evidence.

Finally, Professor Kevin Tobia imagines a world in which corpus-based arguments function in the same thrust and parry style of Karl Llewelyn’s canons of interpretation—a world in which corpus-based arguments can be brought to bear on the opposite side of any dispute. I regret to concur that such a world is possible. However, I think that it is worth noting that most of the participants in this conference, after examining the contemporary usage evidence, agree that some of the claims about the meaning of “bear arms” that were used to support the United States Supreme Court’s decision in Heller are contradicted by usage evidence from the Founding Era. The parry to this thrust is that this conclusion has not convinced every participant that Heller is wrongly decided. Still, I think the apparent consensus is remarkable and demonstrates the potential for settling some thorny interpretive disputes with usage evidence.