We often see hundreds of contracts or tenders coming up for review. In law firms, we usually come across many tenders, and their described contents, and the first thing we do is to identify from which clause it comes. We use the prescribed playbook to look for the suitable clause for the described content or sentence. And this actually comes at the cost of time; we spend a lot of time just looking through one contract or tender. Then just imagine the time it will take to analyze or review hundreds of contracts or tenders. That’s where we need to automate this, and how do we automate this by building an automated contract review system. In that automated contract review system, we will have several machine learning models like clause identification, risk assessment (level of risk associated with), and whether it’s compliance with the prescribed playbook or not. We will not cover all of them in this story; we will only go through the clause identification.
So to build a clause identification model, we need to have data, which needs to be annotated according to the required clauses. We will show you how predictly built such a dataset is and how it will be useful further.
Details about the clause classification dataset:
Number of data points/sentences: 17000
Number of labels (Clauses): 22
Input Type: Text
Dataset Type: Multi-class
Labels: Audit, Business Conduct, Compliance with HSS and Environment, Confidentiality, Data Protection, Force Majeure, General Conditions, Import-Export Law, Insurance, Intellectual Property Infringement, Intellectual Property Ownership, Liability and Indemnity, Limitation of Liability, Obligations of Company, Obligations of Contractor, Payment Terms, Suspension, Termination, Taxes and Duties, Warranties, Word Order and Change Order
Where and How to collect the data?
We first need to collect data as we described the label, which are the clauses we need to collect the legal contract statements based on those clauses. In today’s digital world, where we have the internet, we can get our required data by scraping the required web pages and collecting the data.
Methods: Web Scraping, Data Collection, Data Storage, Data Management, Data Manipulation
You will find both good and bad information on the internet, so it’s really essential to do a quality check before starting scraping. Identifying the good sources is a key thing to build a quality dataset.
That’s why we did rigorous research with the help of our annotation team. We have professional law people to find out which website has the best information and quality content from which we can legally scrape the data. Then, once we identified the necessary web pages, we thoroughly analyzed their website structure.
We scrape the website after having an idea of what and from where to scrape, using python, Selenium, requests library, and BeautifulSoup.
Once the data is scraped, we need to put those data into some storage system. So we store them in JSON and CSV formats in our cloud storage securely and process them for further annotation work.
How to build a Clause Classification Dataset?
Methods: Data Labeling, Data Visualization, Model Development, Machine Learning/Deep Learning, Model Evaluation, Word Embeddings(Glove, FastText, etc), Active Learning
Technology/Library used: Python, CSV, JSON, Regex, Pytorch, Numpy, TensorBoard, Fast.ai, Scikit-Learn, Matplotlib, Seaborn, and Predictly Text Annotation Platform
We used the above methods to annotate the data from the unstructured data. The process follows as below.
Once we collected the data, We read those data using the pandas’ library. After reading the data, we need to clean the data, manage them with suitable clauses using libraries like regex and python programming language.
To have an idea of what data we collected we plotted the data using Matplotlib and Seaborn.
Then the next thing we have to do is to annotate the data, so we forwarded it to our annotation team to have a quality check with around 40% of the data.
Our annotation team which consists of law interns reviewed the dataset whether they have proper clauses labeled or not using our text annotation tool.
Once we have our 40% data annotated, we build a Machine Learning model for multi-class classification and train on those 40% data. We tested out several models using fastai, PyTorch, and scikit-learn and ended up with the best-resulting model to proceed further.
Now we have a model ready with us which is not as accurate as it should be but we will eventually improve its performance. So we take another 20% of the data and run the prediction on those using the ML model.
Now we have to verify those predictions by our annotation team. Once the verification is done we will further train our model with those extra 20% data, so by this, we will train the model with 60% of data and thus we increase the accuracy of the model.
In this way, we keep on going until we annotate 100% of the data. And at the end, we were also able to build a model which can give us good predictions on the new data.
But we wanted to have a sophisticated model that can be used in real applications; that’s why we used a transformer-based model called BERT (the state-of-the-art NLP model) to train and prepare a model that can classify clauses at a 0.90 F1 score.
Where can we use this dataset?
As we have mentioned in the first question, we will be using this dataset to identify the clauses of a contract based on the prescribed playbook. And furthermore, we can build two more datasets that can help to build the complete solution to provide an automated way to review contracts or tenders.
Here’s how the typical review system will look like
So this is how the model works, First, you will upload the contract.
Once the contract is uploaded, it will try to fetch all the data and classify them into respective clauses.
Once we get the clauses it will look for the clauses in the playbook, find out the associated risks, and whether the written contract statement is compliant with the playbook or not.
And then it will return every detail it has found so far based on the user requirements.