Extracting Entities From Legal Text Using Python and spaCy5 min read

In this post, we will see how easy it is to use Python for extracting entities from a piece of legal text.

For the most part, this post will be useful for lawyers who are curious about Python and how it works with text. Let’s get started.

Python is very powerful not only because of its inherent universality and simple syntax, but also due to a huge community support reflected in a vast library of third party packages. One of such packages is spaCy, an open-source software library for NLP, which we will use for this task.

Let’s say we have a random sentence from a legal document:

text = "This Agreement is executed in two counterparts on 25 September 2018 between Joseph Watts, representing Microsoft Inc., and E. V. Jovanovich, representing Google Inc."

We assign a variable text to a string representing our sentence.

Let’s see what entities spaCy can extract from this sentence.

import spacy


nlp = spacy.load("en_core_web_sm")

tagged_text = nlp(text)

extracted_entities = [(i.text, i.label_) for i in tagged_text.ents]

print(extracted_entities)

We imported the spaCy module, then we loaded one of its NLP models. We proceed to using the loaded model to process our text and store the result in the tagged_text variable. Then, we iterate through the result’s ents attribute to extract entities and corresponding labels. And we print the list of entity name-entity label pairs. The output will be:

[('two', 'CARDINAL'), ('25', 'CARDINAL'), ('September 2018', 'DATE'), ('Joseph Watts', 'PERSON'), ('Microsoft Inc.', 'ORG'), ('E. V. Jovanovich', 'PERSON'), ('Google Inc.', 'ORG')]

As we can see, the date 25 September 2018 was not extracted completely, instead 25 was separately labeled as a CARDINAL number. We can add some more logic to our code to make sure that when a CARDINAL number entity directly precedes the DATE entity, the number is prepended to (i.e. added to, becomes part of) the DATE entity.

import re


current_index = 0

for entity, label in extracted_entities:
    if label == "CARDINAL":
        next_entity_index = current_index + 1
        next_entity, next_label = extracted_entities[next_entity_index]
        if next_label == "DATE":
            combined_entity_pattern = entity + " " + next_entity
            if re.search(combined_entity_pattern, text):
                extracted_entities[next_entity_index] = (
                    entity + " " + next_entity, next_label)
    current_index += 1

print(extracted_entities)

We iterate through each entity in the extracted_entities list, checking for the entity’s label one by one. If the label is CARDINAL, we check for the next entity’s label after looking up the next entity using the current entity index + 1. If the next entity’s label is DATE, we assume that the current CARDINAL entity is part of the DATE entity, and store a new combined entity in the combined_entity_pattern variable. Note that we can only assume that the combined entity indeed forms a correct DATE: for example, if the extracted number is located much earlier in text than the extracted date, combining the two would not be correct. One way to check this is to use the pattern matching functionality offered by the re standard library module: if there is a match of this pattern in the analyzed text, we can be more confident that the CARDINAL entity is part of the succeeding DATE entity. In the above example, if re.search() returns a match, we prepend the current CARDINAL entity to the succeeding DATE entity by changing the latter entity’s value in the list: same DATE label, but text is replaced with the combined entity.

Note that if the date started with 25th, i.e. ordinal number, we would just use the ORDINAL label instead.

Let’s see what result we have got now:

[('two', 'CARDINAL'), ('25', 'CARDINAL'), ('25 September 2018', 'DATE'), ('Joseph Watts', 'PERSON'), ('Microsoft Inc.', 'ORG'), ('E. V. Jovanovich', 'PERSON'), ('Google Inc.', 'ORG')]

We observe that the date 25 September 2018 is now fully extracted. Note that we could do the same trick with PERSON and ORG entities, checking in what proximity they are to each other and deducing who represents what organization/company.

We can now show our results to the user.

relevant_labels = ["DATE", "PERSON", "ORG",]

for relevant_label in relevant_labels:
    print("Extracted for label: " + relevant_label)
    for entity, label in extracted_entities:
        if label == relevant_label:
            print("- " + entity)
            print("\n")

Let’s say we are interested in all labels except for CARDINAL, which by itself doesn’t tell us much. We create the relevant_labels list with the labels we need, then iterate through this list and print only entities from the extracted_entities list which have this label.

The output will be as follows:

Extracted for label: DATE
– 25 September 2018

Extracted for label: PERSON
– Joseph Watts
– E. V. Jovanovich

Extracted for label: ORG
– Microsoft Inc.
– Google Inc.

It’s not always this easy, however. Legal texts use capitalized terms a lot, which complicates spaCy’s guessing of named entities. For example, a few capitalized terms may be tagged as ORG or PERSON. And a company name does not always start with a capital letter, in which case spaCy may not tag it as ORG at all. Misprints and bad formatting also contribute to the complexity of entity extraction. On top of that, spaCy is not always correct. For example, a company name might be tagged as PERSON, and vice versa. But, at the same time, spaCy provides a powerful foundation for basic entity extraction, which, coupled with a well-written custom logic, results in a useful tool.

Note that for a more advanced legal entity extraction, e.g. liability cap or agreement duration, we would need to train our own entity classifier with custom labels, e.g. “LIABILITY_CAP” and “AGREEMENT_DURATION”, using a considerable amount of manually annotated data.

Sergii Shcherbak

Head of Software Development @ Synch