Data Collections

Here we provide an overview about existing data collections and copora of legal language in various sizes, languages, and text types. If you miss any corpus or if you have additional information for this page, please »​contact us.

Overview

English Corpora

1.1 American Law Corpus
1.2 British Law Report Corpus
1.3 CAL² Corpus of European Law
1.4 Corpus of Historical English Law Reports
1.5 House of Lords Judgements Corpus
1.6 Old Bailey Corpus
1.7 Cambridge Corpus of Legal English
1.8 CADIS Corpus of Academic English
1.9 USCC Corpus
1.10 Case Law Corpus
1.11 Corpus of Founding Era American English
1.12 BYU-Corpus of Early Modern English
1.13 Corpus of Supreme Court Opinions of the United States
1.14 Corpus of State Conventions on the Adoption of the Constitution
1.15 Corpus of the Records of the Constitutional Convention
1.16 Corpus of Early Statutes at Large
1.17 Corpus of US Caselaw
1.18 Corpus of US Supreme Court Opinions

German Corpora

2.1 CAL² Corpus of European Law / Juristisches Referenzkorpus
2.2 DS21 Corpus
2.3 Swiss Legislation Corpus

Multilingual- / comparable Corpora

3.1 Bononia Legal Corpus (IT/EN)
3.2 DS21 Corpus (DE/FR/IT)
3.3 JRC-Acquis (23 languages)
3.4. JUD-GENTT (EN/ES/DE/FR)
3.5 IULA Technical Corpus (ES/EN)
3.6 CLUVI Corpus (GL/ES)
3.7 GENTEXT-N (EN/ES)
3.8 CADIS Corpus (EN/IT)
3.9 DGT-TM (276 language Pairs)
3.10 DGT-Acquis (253 language combinations)
3.11 EUROPARL (21 european languages)
3.12 EUCLCORP (european languages)
3.13 COSPE (ES/IT/EN)

Other Languages

4.1 LEGA -//-
4.2 GARALEX -//-
4.3 CORIS/CODIS (IT)
4.4 Perugia Corpus (IT)
4.5 Testi Amministrativi Chiari e Semplici (IT)
4.6 Polish Law Corpus (PL)
4.7 Corpus de Procesos Penales (ES)

1. English Corpora

1.1 American Law Corpus (ALC)

Academic journals, textbooks, briefs, contracts, legislations and opinions.

  • Content: Directives and judgements of the European Community from 1968 untill 1995
  • Size: 20 million words | Access: open access
  • Editor: Univeristy of Bologna, Italy | Link: n.a.

 

1.2 British Law Report Corpus (BLaRC)

n.a.

  • Content: Law reports from Northern Ireland, Scotland, England, and Wales from 2008 until 2010
  • Size: 8.85 million words, 1,228 texts | Access: open access

  • Editor: Univeristy of Murica, Spain | Link: Homepage

 

1.3 CAL² Corpus of European Law

A collection of all relevant text types of German Law, which covers the following three main domains: all statutes of national law (legislation, recorded at one time); decisions and opinions of alle federal courts and of a selecetion of courts at different instances (case law); commentaries, legal papers and articles of academic legal discourse, published in the most important and high ranked law journals.“ (Self-description, CAL²)

  • Content: Statutes, academic texts, decisions, and opinions, most from ca. 1980 until today
  • Size: 1 billion words | Access: n.a.
  • Editor: International Research Group Computer Assisted Legal Linguistics | Link: Homepage

 

1.4 Corpus of Histrorical English Law Reports (CHELAR)

„CHELAR is a specialised corpus consisting of law reports dating from the period 1535-1999. Law reports are records of judicial decisions which are“ cited by lawyers and judges for their use as precedent in subsequent cases” (Encyclopædia Britannica Online s.v. law report); they typically contain an account of all the facts of the case, the arguments of the judge, his reasoning, the judgment he arrives at and the kind of authority and evidence he uses.“ (Self-description, CHELAR)

  • Content: Law reports, from 1535 until 1999.
  • Size: ca. 0.5 million words | Access: open access
  • Editor: Universidad de Santiaggo de Compostela | Link: Homepage

 

1.5 House of Lords Judgments Corpus (HOLJ)

„This page lists HTML versions of all House of Lords judgments delivered from 14 November 1996 to 30 July 2009. Print versions of judgments since 2005 are available in PDF format from the top right hand side of individual judgment pages. Information about judgments prior to 1996 can be found on the judgements page.“ (Self-description, HOLJ)

  • Content: Judgments by the House of Lords from 2001–2003
  • Size: 2.8 million words | Access: open access
  • Editor: University of Edinburgh | Link: Homepage

1.6 Old Balley Corpus

„Old Bailey Corpus is a sociolinguistically, pragmatically and textually annotated corpus based on the Proceedings of the Old Bailey. These speech-related texts document Late Modern English as used in London’s Central Criminal Court. The Proceedings of the Old Bailey were published from 1674 to 1913 and constitute a large body of Late Modern English texts.“ (Self-description, Old Balley Corpus)

  • Content: Proceedings of London’s Central Criminal Court from 1674–1913
  • Size: ca. 200,000 | Access: open access
  • Editor: Justus-Liebig-Universität Giessenface | Link: Homepage

1.7 Cambridge Corpus of Legal English

„The Cambridge Corpus of Legal English is a collection of books, journals, newspaper articles relating to the law and legal processes. It includes documents in both British and American English.“ (Self-description, Cambridge Corpus of Legal English)

  • Content: Collection of books, journals, newspaper articles relating to the law and legal processes
  • Size: 20 million words | Access: purchasable
  • Editor: Cambridge University Press | Link: Homepage

1.8 CADIS Corpus of Academic English

„The corpus lies at the heart of a scientific 131 project aimed at analyzing identity traits in academic discourse (Gotti 2010). It is composed of a major English subcorpus and a smaller one in Italian for comparative purposes. CADIS represents four main disciplinary areas: Applied Linguistics (AL), Economics (E), Law (L) and Medicine (M). For each disciplinaryarea, four different textual genres have been considered: abstracts (A), book reviews (B), editorials (E), research articles (RA).“ (Wiley Online Library)

  • Content: Applied Linguistics (AL), Economics (E), Law (L) and Medicine (M)
  • Size: 12 million words, 2,761 academic texts| Access: no open access
  • Editor: Università degli Studi di Bergamo, M. Gotti | Link: n.a.

1.9 USCC Corpus

A corpus built by David Mazzi (University of Modena e Reggio Emilia). It is made up of 67 opinions (658,154 words) delivered by the US Supreme Court, with the primary aim of studying judicial argumentation.

  • Content: n.a.
  • Size: 658,154 words | Access: n.a.
  • Editor: University of Modena e Reggio Emillia, Davide Mazzi| Link: n.a.

1.10 Case Law Corpus

Developed in the Centre for Computers and Law (Erasmus University, Rotterdam) by van Noortwijk and De Mulder. The Case Law Corpus is a monolingual corpus gathering 3,073 judicial decisions (16.5 million words) delivered both by civil and criminal UK jurisdictions and courts.

  • Content: n.a.
  • Size: 16.5 million words | Access: n.a.
  • Editor: Centre for Computers and Law (Erasmus University, Rotterdam) by van Noortwijk and De Mul)  | Link: n.a.

1.11 Corpus of Founding Era American English (COFEA)

„The Corpus of Founding Era American English covers the time period starting with the reign of King George III, and ending with the death of George Washington (1760-1799). COFEA contains documents from ordinary people of the day, the Founders, and legal sources, including letters, diaries, newspapers, non-fiction books, fiction, sermons, speeches, debates, legal cases, and other legal materials. The majority of texts have been pulled from the following six sources: the National Archive Founders Online; William S. Hein & Co., HeinOnline; Text Creation Partnership (TCP) Evans Bibliography (University of Michigan); Elliot’s Debates; Farrand’s Records; and the U.S. Statutes-at-Large from the first five Congresses.“ (BYU Law & Corpus Linguistics)

  • Content: n.a.
  • Size: 119,801 texts, 133 million words| Access: n.a.
  • Editor: James Phillips | Link: Homepage

1.12 BYU-Corpus of Early Modern English (COEME)

„The BYU-Corpus of Early Modern English cover texts from 1475 – 1800 that were included in the Evans Bibliography, the Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO) corrected by the Text Creation Partnership (TCP) Evans Bibliography (University of Michigan).“ (BYU Law & Corpus Linguistics)

  • Content: n.a.
  • Size: 40,300 texts, 1,107,365,393 words | Access: n.a.
  • Editor: n.a | Link: n.a.

1.13 Corpus of Supreme Court Opinions of the United States (COSCO-US)

„The Corpus of Supreme Court Opinions of the United States includes all opinions in the United States Reports and opinions published by the Supreme Court through the 2017 term.“ (BYU Law & Corpus Linguistics)

  • Content: n.a.

  • Size: 60,545 texts, 94,156,760 words| Access: n.a.

  • Editor: Supreme Court (US) | Link: n.a.

     

1.14 Corpus of State Conventions on the Adoption of the Constitution (COSCAC)

„The Corpus of State Conventions on the Adoption of the Constitution consists of five volumes of The Debates in the Several State Conventions on the Adoption of the Federal Constitution. According to the library of congress, they “remain the best source for materials about the national government’s transitional period between the closing of the Constitutional Convention in September 1787 and the opening of the First Federal Congress in March 1789.“ (BYU Law & Corpus Linguistics)

  • Content: n.a.

  • Size: 652 texts, 1,479,149 words | Access: n.a.

  • Editor: n.a. | Link: n.a.

1.15 Corpus of the Records of the Constitutional Convention (CORCC)

„The Corpus of the Records of the Constitutional Convention covers three of the four volumes of The Records of the Federal Convention of 1787. Published in 1911, Farrand’s work attempted to represent the documentary records of the Constitutional Convention.“ (BYU Law & Corpus Linguistics)

  • Content: n.a.

  • Size: 847 texts, 689,755 words | Access: open

  • Editor: Max Farrand | Link: Homepage

1.16 Corpus of Early Statues at Large (CESAL)

„The Corpus of Early Statutes at Large includes laws passed by the United States Congress in chronological order. The first set published cover the first five Congresses and a small part of the sixth.“ (BYU Law & Corpus Linguistics)

  • Content: n.a.

  • Size: 481 texts, 471,260 words | Access: n.a.

  • Editor: n.a. | Link: n.a.

1.17 Corpus of US Caselaw (CUSC)

„The Caselaw Access Project (“CAP”) expands public access to U.S. law. Its goal is to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library. The first installment is state cases from 1760 to 1799.“ (BYU Law & Corpus Linguistics)

  • Content: n.a.

  • Size: 8,466 texts, 4,204,790 words | Access: open

  • Editor: Harvard Law School | Link: Homepage

1.18 Corpus of US Supreme Court Opinions (SCOTUS)

„This corpus contains approximately 130 million words in 32,000 Supreme Court decisions from the 1790s to the current time. This corpus was released in March 2017. In Sep 2018, a similar corpus (covering the same period) was released by the BYU Law School.“ (Self-description, SCOTUS)

  • Content: Supreme Court decisions, 1790s to present

  • Size: 130 million word, 32,000 text | Access: open access

  • Editor: Supreme Court of the United States | Link: Homepage

 

2. German Corpora

2.1 CAL² Corpus of European Law

„A collection of all relevant text types of German Law, which covers the following three main domains: all statutes of national law (legislation, recorded at one time); deisions and opinions of alle federal courts and of a selecetion of courts at different instances (case law); commentaries, legal papers and articles of academic legal discourse, published in the most important and high ranked law journals.“ (Self-description, CAL²)

  • Content: Statutes, academic texts, decisions, and opinions, most from ca. 1980 until today

  • Size: 1 billion words | Access: n.a.

  • Editor: International Research Group Computer Assisted Legal Linguistics | Link: Homepage

 

2.2 DS21 corpus

n.a.

  • Content: Swiss legal texts from the early Middle Ages to 1798

  • Size: 4 million words | Access: n.a.

  • Editor: n.a | Link: n.a.

 

2.3 Swiss Legislation Corpus (SLC)

„We describe the construction of two corpora in the domain of Swiss legal texts: The DS21 corpus is based on the Collection of Swiss Law Sources and contains historical legal texts from the early Middle Ages up to 1798; the Swiss Legislation Corpus (SLC) is based on the Classified Compilation of Swiss Federal Legislation and contains all current Swiss federal laws. The paper summarizes the key properties of both corpora, discusses issues encountered while building them, and outlines some applications. (Self-description, SLC)“

 

  • Content: Legislative writings of the Swiss Confederation
  • Size: 5,745 texts | Access: open access
  • Editor: Universität Zürich | Link: Homepage

3. Multilingual- /comparable
Corpora

3.1 Bononia Legal Corpus (BoLC)

„The Bononia Legal Corpus – BoLC – is the result of an on-going research project. It is aimed at the construction and analysis of a multilingual comparable legal corpus. It is being developed at the University of Bologna. It has been coordinated by Rema Rossini Favretti and Fabio Tamburini. John Sinclair played a crucial role as consultant. We wish to thank Adriano Di Pietro for his contribution during the corpus design and implementation.“ (Self-description, BoLC)

  • Content: Directives and judgements of the European Community from 1968 untill 1995
  • Size: 20 million words | Access: open access

  • Editor: University of Bologna, Italy | Link: Homepage

3.2 DS21 corpus

n.a.

  • Content: Swiss legal texts from the early Middle Ages to 1798

  • Size: 4 million words | Access: n.a.

  • Editor: n.a. | Link: n.a.

 

3.3 JRC-Acquis

„The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States. This collection of legislative text changes continuously and currently comprises selected texts written between the 1950s and now. As of the beginning of the year 2007, the EU had 27 Member States and 23 official languages. The Acquis Communautaire texts exist in these languages, although Irish translations are not currently available. The Acquis Communautaire thus is a collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian and Swedish.“ (europa.eu)

  • Content: Texts from EU legislation

  • Size: 463,792 texts | Access: open access

  • Editor: EU Science Hub | Link: Homepage

3.4 JUD-GENT

„JUD-GENT is an ongoing research project devolped within the GENTT project (Textual Genres for Translation), that aims at building a multilingual (EN, ES, DE, FR) comparable corpus of textual genres (Law, Medicine and other technical fields) to provide a sort of encycolpedia of specialised texts for tarnslations.“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: Different kinds of texts produced as part of the criminal proceedings in England, Spain, Germany and France

  • Size: n.a. | Access: no open access

  • Editor: Universidad Jaume I (Castellón) A. Borja Albi (coord.) | Link: n.a.

3.5 IULA Spanish-English Technical Corpus

„The corpus consists of a number of specialized texts (Law, Economics, Medicine, Environment and Computer Science domains) available in both Spanish and English languages. This LSP corpus has been compiled with articles from specialized publications, PhD theses, etc. It contains about a total of about 2,1 M words in 127 documents in each language.“ (Self-description, IULA)

  • Content: Different domains: Law, Economics, Environment, Medicine, Computer Science

  • Size: 2.1 million words | Access: no open access

  • Editor: Institute for Applied Linguistics of the University Pompeu Fabra of Barcelona (IULA) | Link: Homepage

3.6 CLUVI Corpus

„The Linguistic Corpus of the University of Vigo (CLUVI) is a parallel open corpus of specialised registers (fiction, computing, journalism, legal and administrative fields, etc.), totaling more than 27 million words of running texts. Two of its eight subcorpora are entirely dedicated to legal language, namely LEGA and LEGE-BI.“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: Specialised registers (fiction, computing, journalism, legal and administrativefields, etc.)

  • Size: 27 million words | Access: open access

  • Editor: Universidade de VigoG. X. Gómez, A. Simões | Link: Homepage

3.7 GENTEXT-N

„The University of Valencia has built up the GENTEXT-N corpus, within theresearch group Gender, Language and Sexual (In)Equality. It is a bilingual (ES-EN) comparable corpus of almost 35 million words extracted from press articles (The Times, The Guardian, El País, El Mundo) dealing with legal actions to cope withsexual (in)equality in Spain and Great Britain.“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: Newspapers, magazines
  • Size: 35 million words | Access: n.a.
  • Editor: Universoty of Valencia | Link: n.a.

3.8 CADIS Corpus of Academic English

„The corpus lies at the heart of a scientific 131 project aimed at analyzing identity traits in academic discourse (Gotti 2010). It is composed of a major English subcorpus and a smaller one in Italian for comparative purposes. CADIS represents four main disciplinary areas: Applied Linguistics (AL), Economics (E), Law (L) and Medicine (M). For each disciplinaryarea, four different textual genres have been considered: abstracts (A), book reviews (B), editorials (E), research articles (RA).“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: Applied Linguistics (AL), Economics (E), Law (L) and Medicine (M)
  • Size: 12 million words, 2,761 academic texts | Access: no open access
  • Editor: Università degli Studi di Bergamo, M. Gotti | Link: n.a.

3.9 DGT Multilingual Translation Memory of the Acquis Communautaire (DGT-TM)

„Since November 2007 the European Commission’s Directorate-General for Translation has made its multilingual Translation Memory for the Acquis Communautaire, DGT-TM, publicly accessible in order to foster the European Commission’s general effort to support multilingualism, language diversity and the re-use of Commission information.
This page, which is meant for technical users, provides a description of this unique linguistic resource as well as instructions on where to download it and how to produce bilingual aligned corpora for any of the 276 language pairs or 552 language pair directions. Here is an example of one sentence translated into 22 languages.“ (Self-description, DGT-TM)

  • Content: n.a.
  • Size: 6,226,855 translation units | Access: n.a.
  • Editor: | Link: Homepage

3.10 DGT-Acquis

„The DGT-Acquis is a family of several multingual parallel corpora extracted from the Official Journal of the European Union (OJ) in Formex 4 (XML) format, consisting of documents from the middle of 2004 to the end of 2011 in up to 23 languages.“ (Self-description, DGT-Acquis)

  • Content: n.a.
  • Size: 3.53 million files | Access: n.a.
  • Editor: n.a. | Link: Homepage

3.11 European Parliament Proceedings Parallel Corpus (EUROPARL)

„The European Parliament Proceedings Parallel Corpus 1996-2011 (EUROPARL) is a multilingual parallel corpus containing more than 60 million words per language based on the EP proceedings.“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: n.a.
  • Size: over 60 million words per language | Access: open download
  • Editor: n.a. | Link: Homepage

3.12 EUCLCORP

„The EUCLCORP project aimed to address the gap in resources available for analysing EU case law by providing a resource that allows users of law to investigate in a systematic way: the history of the meaning(s) of a particular legal term; in the case of an ambiguous term – the sense in which it is most frequently used; the influence of national legal languages on EU case law (and vice versa); the impact of translation on the development of EU case law“ (Wiley Online Library)

  • Content: n.a.
  • Size: n.a. | Access: n.a.
  • Editor: University of Birmingham| Link: Homepage

3.13 Corpus de Sentencias Penales (COSPE)

Criminal judgments from 2005 to 2012.

  • Content: Criminal judgments, 2005 to 2012
  • Size: 6 million tokens, 782 texts | Access: n.a.
  • Editor: Gianluca Pontrandolfo (University of Trieste) | Link: n.a.

 

4. Other Corpora

4.1 LEGA

n.a.

  • Content: Legal‐administrative texts
  • Size: 120 texts, 1,000+ print pages | Access: n.a.
  • Editor: n.a.| Link: n.a.

4.2 GARALEX

„GARALEX is a web platform for the study of legal language, developed following a corpus-based methodology.“ (Gianluca Pontrandolfo; Legal corpora: an overview)

 

  • Content: n.a.
  • Size: 98,9343 words | Access: n.a.
  • Editor: University of the Basque Country | Link: Homepage

4.3 CORIS/CODIS

„The Corpus di Riferimento dell’Italiano Scritto (CORIS) and the Corpus Dinamicodell’Italiano Scritto (CODIS) are two different structures of the same referencecorpus developed at the University of Bologna by Rossini Favretti’s team. Theproject started in 1998 with the purpose of creating a representative andsizeable general reference corpus of written Italian – following the Brown Corpusmodel (see Xiao 2008: 395-397) – which would be easily accessible and user-friendly. Compared with CORIS (100 million words, plus 30 million words ofmonitor corpus), CODIS (100 million words) has a dynamic structure allowingresearchers to exclude or include different subcorpora for specific analyses(Rossini Favretti et al.2002). It has a subcorpus of legal language, totaling 10million words.“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: General reference corpus of written italian
  • Size: CORIS: 130 million words CODIS: 100 million words | Access: open access
  • Editor: Universoty di Bologna, Rossini Favrett | Link: Homepage

4.4 Perugia Corpus (PEC)

„A reference corpus of contemporary Italian which gathers both oral and written texts (25 million words) distributed among 10 textual genres. It contains a legal subcorpus (1.1 million words) made up of administrative texts (laws, regulations, European legislation). Another corpus developed by the same University is the Academic Italian Corpus (AIC), totaling 1 million words, which contains a legal academic subcorpus (330,000 words)“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: Oral and written texts distributed among 10 textual genres
  • Size: 25 million words | Access: n.a.
  • Editor: University for Foreigners of Perugua, Stefania Spina | Link: Homepage

4.5 Testi Amministrativi Chiari e Semplici (TACS)

„It is a monolingual corpus of originalItalian administrative texts produced by a number of administrative bodies (mu -nicipalities, regions, provinces, universities, ministries) and its ‘translation’/rewriting in a simplified language in the wake of the simplification of legalese and legal administrative language.“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: n.a.
  • Size: n.a. | Access: n.a.
  • Editor: University of Padua, Michele Cortelazzo | Link: Homepage

4.6 Polish Law Corpus

„The Polish Law Corpusis a monolingual corpus (PL) of 4 million words, built by Łucja Biel (University of Gdansk) which includes 211 codes and major legal acts related to contract, company, civil and criminal law (Biel 2010a). One of the main objectives of the author is describing nominal, verbal and adjectival collocations of legal terms within the context of an ongoing project aimed at compiling the Dictionary of Polish Legal Collocations for Translators.“ (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: n.a.

  • Size: 4 million words | Access: n.a.

  • Editor: University of Gdansk, Lucja Biel | Link: n.a.

4.7 Corpus de Procesos Penales (CPP)

„The Corpus de Procesos Penales (CPP) is a monolingual (ES) corpus of criminal trials built by Raquel Taranilla (University of Barcelona) of 98,943 words that collects 10 criminal trials held in Barcelona between 2009 and 2010. Its primary aim was the study of narrative elements in judicial discourse (cf. Taranilla 2011).“  (Gianluca Pontrandolfo; Legal corpora: an overview)

  • Content: Criminal Trials

  • Size: 98,9343 words | Access: n.a.

  • Editor: Universoty of Barcelona, Raquel Taranilla | Link: n.a.