We often see hundreds of contracts or tenders coming up for review. In law firms we usually come across many tenders and their described contents and the first thing we do is to identify from which clause it comes from. We use the prescribed playbook to look for the suitable clause for the described content or sentence. And this actually comes at a cost of time, we spend a lot of time just looking through one contract or tender. Then just imagine the time it will take to analyze or review hundreds of contracts or tenders. That’s where we need to automate this and how do we automate this, by building an automated contract review system. In that automated contract review system we will have several machine learning models like clause identification, risk assessment (level of risk associated with) and whether it’s compliance with the prescribed playbook or not. We will not cover all of them in this story, we will only go through the clause identification.
So to build a clause identification model we need to have data, which needs to be annotated according to the required clauses. We will show you how predictly built such a dataset and how it will be useful further.
Details about the clause classification dataset:
Number of data points/sentences: 17000
Number of label(Clauses): 22
Input Type: Text
Dataset Type: Multi-class
Labels: Audit, Business Conduct, Compliance with HSS and Environment, Confidentiality, Data Protection, Force Majeure, General Conditions, Import Export Law, Insurance, Intellectual Property Infringement, Intellectual Property Ownership, Liability and Indemnity, Limitation of Liability, Obligations of Company, Obligations of Contractor, Payment Terms, Suspension, Termination, Taxes and Duties, Warranties, Word Order and Change Order
Where and How to collect the data?
The first thing we need to do is collection of data. As we described the label which are the clauses we need to collect the legal contract statements based on those clauses. In today’s digital world where we have the internet, we can get our required data by scraping the required web pages and collecting the data.
Methods: Web Scraping, Data Collection, Data Storage, Data Management, Data Manipulation
On the internet you will find both good and bad information, so it’s really essential to do a quality check before starting scraping. Identifying the good sources is a key thing to build a quality dataset.
That’s why we did rigorous research with the help of our annotation team in which we have professional law people to find out which website has the best information and quality content from which we can legally scrape the data. Once we identified the necessary webpages, we did a thorough analysis of their website structure.
After having an idea what and from where to scrape, using python, Selenium, requests library, BeautifulSoup we started to scrape the website.
Once the data is scraped we need to put those data into some storage system. So we store them in JSON and CSV formats in our cloud storage securely and to process them for further annotation work.
How to build a Clause Classification Dataset?
Methods: Data Labeling, Data Visualization, Model Development, Machine Learning/Deep Learning, Model Evaluation, Word Embeddings(Glove, FastText etc), Active Learning
Technology/Library used: Python, CSV, JSON, Regex, Pytorch, Numpy, TensorBoard, Fast.ai, Scikit-learn, Matplotlib, Seaborn and Predictly Text Annotation Platform
We used the above methods to annotate the data from the unstructured data. The process follows as below.
Once we collected the data, We read those data using the pandas library. After reading the data, we need to clean the data, manage them with the suitable clauses using libraries like regex and python programming language.
To have an idea what data we collected we plotted the data using Matplotlib and Seaborn.
Then the next thing we have to do is to annotate the data, so we forwarded it to our annotation team to have a quality check with around 40% of data.
Our annotation team which consists of law interns reviewed the dataset whether they have proper clauses labeled or not using our text annotation tool.
Once we have our 40% data annotated, we build a Machine Learning model for multi-class classification, and train on those 40% data. We tested out several models using fastai, pytorch and scikit-learn and ended up with the best resulting model to proceed further.
Now we have a model ready with us which is not as accurate as it should be but we will eventually improve its performance. So we take another 20% of data and run the prediction on those using the ML model.
Now we have to verify those predictions by our annotation team, and once the verification is done we will further train our model with those extra 20% data, so by this we will train the model with 60% of data and thus we increase the accuracy of the model.
In this way we keep on going until we annotate 100% of data. And at the end we were also able to build a model which can give us good predictions on the new data.
But we wanted to have a sophisticated model which can be used in real applications, that’s why we used a transformer based model called BERT (the state of the art NLP model at that time) to train and prepare a model which can classify clauses at a 0.90 F1 score.
Where can we use this dataset?
As we have mentioned in the first question, we will be using this dataset to identify the clauses of a contract based on the prescribed playbook. And furthermore, we can build two more datasets which can help to build the complete solution to provide an automated way to review contracts or tenders.
Here’s how the typical review system will look like
So this is how the model works, First you will upload the contract.
Once the contract is uploaded, it will try to fetch all the data and classify them into respective clauses.
Once we get the clauses it will look for the clauses in the playbook, find out the associated risks and whether the written contract statement is compliant with the playbook or not.
And then it will return every detail it has found so far based on the user requirements.