Legal Documents Dataset

Form a template to file a submission or legal document (usually lengthy). This dataset contains Australian legal cases from the Federal Court of Australia (FCA). The cases were downloaded from AustLII ([web link]). We included all cases from 2006, 2007, 2008 and 2009. We built it to experiment with automatic analysis of summaries and citations. For each document, we collected keywords, citation sets, citation keywords, and citation classes. The keywords can be found in the document, we have the keywords are the gold standard for our summary experiments. Quote sentences can be found in later cases that cite this case, we use quote sentences for the summary. Quotation keywords are the keywords (where available) of subsequent cases citing the present case and older cases cited in the present case. The classes of citation are indicated in the document and indicate how the cases cited in the present case are handled. This dataset contains labeled and unlabeled legal contracts for extracting contract elements. POS record labels as well as annotations for various elements of the contract.

For more information, see Reame. Train a template to summarize complex contract jargon or legal analysis. The dataset of legal documents includes court decisions from 2017 and 2018 were selected for the dataset, which was published online by the Federal Ministry of Justice and Consumer Protection. The documents come from seven federal courts: Federal Labour Court (BAG), Federal Finance Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG). NLP is still largely unexplored when it comes to complicated language such as legal contracts. Recently, researchers from Berkeley and the Nueva School have been looking at legal NLPs with their latest work. The dataset consists of 66,723 sets of 2,157,048 tokens. The size of the seven court-specific datasets ranges from 5,858 to 12,791 judgments and between 177,835 and 404,041 tokens. The distribution of annotations by token corresponds to about 19-23%.

The lack of large labeled datasets has been kryptonite for AI in many areas. This section uses legal in the sense of “about the law” as opposed to “not illegal”. The article for the updated dataset is here: These datasets can be used to pre-train larger models. Alternatively, let them build artificial tasks. For anyone who encounters this question during my research, I also found this page: Here are millions of documents of all kinds available for download. It is a collection of references to datasets/tasks/benchmarks related to the intersection of machine learning and law. You will receive all SEC filings in real time. Analyze and upload filing documents. A collection of datasets and tasks for machine learning Legal Documentation – Open Source Contracts Require Connection Researchers believe the answer lies in large, specialized datasets. But the problem is that large data sets can require thousands of annotations and are expensive.

For specialized fields, documents tend to be even more expensive. The researchers published CUAD, or Contract Understanding Atticus Dataset, a dataset on legal contracts with expert feedback from lawyers. With a corpus of more than 13,000 labels in 510 commercial contracts, CUAD explores new avenues in legal NLP. The case was manually labeled under the supervision of experienced lawyers. They worked on contracts in different file formats such as PDF, txt, CSV and Excel with different legal clauses. The vast dataset is estimated at over $2 million. [A dataset of German legal documents for the recognition of named entities]( (Leitner et al., LREC 2020) There was a problem preparing your code space, please try again. In addition, some companies and individuals often sign contracts without even reading them, leading to predatory behavior that harms consumers. It is necessary to understand the intuition of words in different positions, and also to keep the similarity between words. WordNET is a lexical database of semantic relationships between words in more than 200 languages. This page will be updated continuously. If I missed something, please contact me at and I will add it! Trust is key to AI adoption.

It enables organizations to understand and explain recommendations and outcomes and manage AI-based decisions within their organization while maintaining full accountability and protection of data and information. If nothing happens, download GitHub Desktop and try again. You`ve also reviewed over 100 pages of annotation rules and standards created for CUAD. Three other commenters reviewed each annotation to ensure label quality and consistency. CUAD used the HuggingFace Transformers library and was tested with Python 3.8, PyTorch 1.7 and Transformers 4.3/4.4. The researchers also tested CUAD v1 against nine sophisticated pre-trained language models. You can obtain all SEC filings made by publicly traded companies on the SEC`s website: In related news, we have automatically created a freely available corpus that includes thousands of categories (labels) and over 1 million. Contains provisions based on SEC EDGAR filings (same source as the corpus above):, Data generated continuously incrementally from a variety of sources can be considered continuous data.