Contract risk analysis is the process of figuring out the potential risks associated with the contracts and verifying whether the terms written in a contract are according to the rules of the playbook or not. In case there is a violation, it needs to be filtered out and put it in a separate document for further analysis or redraft of the contract.
In a contract basically, we have to check what is the clause written under the event of “act of God”, what will happen when there is a payment delay, under what circumstances the contract could terminate, and who will be responsible in case of any accident or loss/damage. All these clauses are mentioned in a contract, and when we are reviewing a contract it is really essential that we verify what exactly the clauses are and what the wordings under each clause. Are the clauses written in a contract according to what the company has specified, or is there any violation of the contract? We not only have to identify the violations of clauses, we also need to find the level of risks associated with each clause.
So, Whenever one company gets a lot of contracts, it has to verify the contracts to check whether those contracts are according to the prescribed playbook rules or not. And to evaluate thousands of such contracts manually and finding out the associated risks and writing down the exceptions by looking at the playbook, again and again, is such a painful task for any lawyer. Moreover, this process requires a lot of time and effort to get some results.
There it’s really essential that you use a system where the analysis of contracts should be done within minutes not hours or days. We need a process where it identifies the risks, should recognize the associated clauses according to the playbook, and more importantly, it should show the high level of risks with its defined playbook rules so that the lawyers can verify in one go what’s wrong and what’s right.
Using AI the data collection process makes it really easy and fast, which often requires days to months in a manual process. Starting from data collection to data extraction, building models for risk level detection, clause identification, and showing suitable playbook rules for high-risk sentences, all these will be done within a few minutes.
We will follow the following steps to build our KYC verification system:
The first thing we need to do is collection of data. we need to collect the legal contract statements based on those clauses. In today’s digital world where we have the internet, we can get our required data by scraping the required web pages and collecting the data.
Methods: Web Scraping, Data Collection, Data Storage, Data Management, Data Manipulation
Technology/Libraries used: Python, Pandas, Selenium, BeautifulSoup, Requests, JSON, CSV
In this dataset, For each contract statement, we will label them with a level of risk. To make it intuitive we made three levels of risk.
Risks: Red, Yellow, and Blue
Red: These are the statements that contain the higher risks and are more prone to have violations in the contract and need to verify carefully.
Yellow: This type of risk is not that much of a risk. We can say these contain a mid-level of risk associated with.
Blue: These are the low-risk sentences. Which need to be verified, but if needed these risk types can also be verified.
So to build this dataset, we will need to label each contract statement with its risk and we will follow our earlier methodology to annotate the data, using machine learning and quality assurance systems.
To build this dataset we have to formulate the whole playbook and build a mapping that tells us the complete rule according to the formulated label. So what we will do is you will use the same dataset, but this time instead of clause or risk-level we will annotate the associated playbook rule number. So a clause statement can have multiple rules associated with it so we will annotate with whatever rule it follows.
To build the above datasets, we will use a similar set of sentences but with multiple features. And we make sure to annotate such a dataset faster. Our annotation platform allows us to label such complex hierarchical datasets with ease and the data processing platform allows us to build the required dataset out of that.
As it involves scraped data and fetched data from unstructured data formats like Docx and pdfs, it’s really essential to clean the data properly. We used several text data cleansing methods to get the data cleaned.
We will use transfer learning to build a language model on legal contracts. Basically what we will do is we will take a pre-trained model, which we will finetune on our legal corpus, and create a trained language model on top of the legal data which we can use in our further models to achieve the best results. This behaves like a machine that has knowledge of legal terms, contracts, and clauses, which will further train on specific tasks to perform better.
In this stage of the model, we will train a multi-class classification model which will help us to identify the clause from the playbook for a particular contract statement. To build this model we will use our earlier pre-trained model, using our annotated clause classification dataset we will train this model using deep learning.
In this stage of model building, we will build a model where it is able to identify the risk level associated with the clause statement. So again we will train this model using our pre-trained language model. Once it is trained this will help us to identify which risk level it is in terms of Red Risk, Yellow Risk, and Blue Risk.
The final model we will use to recognize what are the playbooks associated with a particular clause and risk. So we will again train on our language model with the dataset we prepared in the data annotation step to recognize the rules associated with a statement. This model will help us to tell what are rules prescribed in the playbook for a particular selected clause.
Once we have all the models trained, we will deploy them to build the end-to-end solution. So the final deployment model will work as follows:
Once all the models are processed, it will return the highlighted risks with its color code and we can see what playbook rule is violated based on the risk it poses.