Why Contract/Tender Risk Analysis?
Contract risk analysis is the process of figuring out the potential risks associated with the contracts and verifying whether the terms written in a contract are according to the rules of the playbook or not. In case there is a violation, it needs to be filtered out and put it in a separate document for further analysis or redraft of the contract.
In a contract basically, we have to check what is the clause written under the event of “act of God”, what will happen when there is a payment delay, under what circumstances the contract could terminate, and who will be responsible in case of any accident or loss/damage. All these clauses are mentioned in a contract, and when we are reviewing a contract it is really essential that we verify what exactly the clauses are and what the wordings under each clause. Are the clauses written in a contract according to what the company has specified, or is there any violation of the contract. We not only have to identify the violations of clauses, we also need to find the level of risks associated with each clause.
So, Whenever one company gets a lot of contracts, it has to verify the contracts to check whether those contracts are according to the prescribed playbook rules or not. And to evaluate thousands of such contracts manually and finding out the associated risks and writing down the exceptions by looking at the playbook again and again is such a painful task for any lawyer. Moreover this process requires a lot of time and effort to get some result.
There it’s really essential that you use a system where the analysis of contracts should be done within minutes not hours or days. We need a process where it identifies the risks, should recognize the associated clauses according to the playbook and more importantly it should show the high level of risks with it’s defined playbook rules, so that the lawyers can verify in one go what’s wrong and what’s right.
AI-based Legal Contract Analysis
Using AI the data collection process makes it really easy and fast, which often requires days to months in a manual process. Starting from data collection to data extraction, building models for risk level detection, clause identification and showing suitable playbook rules for high risk sentences, all these will be done within a few minutes.
We will follow the following steps to build our KYC verification system:
- Data Extraction
The first thing we need to do is collection of data. we need to collect the legal contract statements based on those clauses. In today’s digital world where we have the internet, we can get our required data by scraping the required web pages and collecting the data.
Methods: Web Scraping, Data Collection, Data Storage, Data Management, Data Manipulation
Technology/Libraries used: Python, Pandas, Selenium, BeautifulSoup, Requests, JSON, CSV
- On the internet you will find both good and bad information, so it’s really essential to do a quality check before starting scraping. Identifying the good sources is a key thing to build a quality dataset.
- That’s why we did rigorous research with the help of our annotation team in which we have professional law people to find out which website has the best information and quality content from which we can legally scrape the data. Once we identified the necessary webpages, we did a thorough analysis of their website structure.
- After having an idea what and from where to scrape, using python, Selenium, requests library, BeautifulSoup we started to scrape the website.
- Once the data is scraped we need to put those data into some storage system. Apart from the scraped data, we also used many contracts which were used earlier in industries to fetch the data. Finally we store them in JSON and CSV formats in our cloud storage securely and to process them for further annotation work.
- Data Annotation
To build the contract risk analysis system we will require the following labeled data.
- Clause Classification Dataset
To build this dataset we will prepare the fetched clause sentences and arrange them into multiple files. Then with our expert annotators we will verify whether the clause sentences have the right clause or not. Once we confirmed with a few thousands of data. We will build a model using that to speed the process of annotation and using the trained model we will keep improving the annotation in each iteration and using our quality check methods we will keep checking the wrongly labeled data.
- Risk level Identification Dataset
In this dataset, For each contract statement we will label them with a level of risk. To make it intuitive we made three levels of risk.
Risks: Red, Yellow and Blue
Red: These are the statements which contain the higher risks and are more prone to have violations in contract and needs to verify carefully.
Yellow: This type of risk is not that much of a risk. We can say these contain a mid-level of risk associated with.
Blue: These are the low risk sentences. Which need to be verified, but if needed these risk types can also be verified.
So to build this dataset, we will need to label each contract statement with it’s risk and we will follow our earlier methodology to annotate the data, using machine learning and quality assurance systems.
- Associated Playbook rule recognizer
To build this dataset we have to formulate the whole playbook, and build a mapping which tells us the complete rule according to the formulated label. So what we will do is you will use the same dataset, but this time instead of clause or risk-level we will annotate the associated playbook rule number. So a clause statement can have multiple rules associated with it so we will annotate with whatever rule it follows.
To build the above datasets, we will use a similar set of sentences but with multiple features. And we make sure to annotate such a dataset faster. Our annotation platform allows us to label such complex hierarchical datasets with ease and the data processing platform allows us to build the required dataset out of that.
- Data Pre-Processing
As it involves scraped data and fetched data from unstructured data formats like docx and pdfs, it’s really essential to clean the data properly. We used several text data cleansing methods to get the data cleaned.
- Model Development
- Language Model with Legal Data
We will use transfer learning to build a language model on legal contracts. Basically what we will do is we will take a pre-trained model, which we will finetune on our legal corpus and create a trained language model on top of the legal data which we can use in our further models to achieve best results. This behaves like a machine which has knowledge of legal terms, contracts and clauses, which will further train on specific tasks to perform better.
- Clause Classification Model
In this stage of model we will train a multi-class classification model which will help us to identify the clause from the playbook for a particular contract statement. To build this model we will use our earlier pre-trained model, using our annotated clause classification dataset we will train this model using deep learning.
- Risk Level Identification Model
In this stage of model building, we will build a model where it is able to identify the risk level associated with the clause statement. So again we will train this model using our pre-trained language model. Once it is trained this will help us to identify which risk level it is in terms of Red Risk, Yellow Risk and Blue Risk.
- Playbook Rule Recognizer Model
The final model we will use to recognize what are the playbooks associated with a particular clause and risk. So we will again train on our language model with the dataset we prepared in the data annotation step to recognize the rules associated with a statement. This model will help us to tell what are rules prescribed in the playbook for a particular selected clause.
- Inference and Deployment
Once we have all the models trained, we will deploy them to build the end-to-end solution. So the final deployment model will work as follows:
- Upload your contract in the prescribed format/
- Once the contract is uploaded, the system will parse the document and fetch the sentences.
- Once we have the fetched sentences, it will pass through the Clause classification model which will label each fetched sentence with it’s clause label.
- Once we know the clause label for each of the sentences, it will pass through the second model which identifies the risk level.
- Simultaneously it will pass through the playbook rule fetched model to fetch the rules associated with each of the clauses, but it takes the clause label into account to find the relevant rules associated with it.
Once all the models are processed, it will return the highlighted risks with it’s colour code and we can see what playbook rule it violated based on the risk it poses.