Python is a great programming language. Not only because of its simple syntax, but also due to its generic applicability: from simple web-apps to complex neural networks, from statistics to text analytics. Not to mention the amazing community support and thousands of “wheels” one doesn’t need to reinvent, i.e. packages.
Synch, the law firm behind this blog, employs Python as the core language to develop LegalTech tools. The key reason is the commonplace application of this language in machine learning, which Synch’s apps make use of.
For the most part, this post will be useful for lawyers who are curious about Python and how it works with text. Let’s get started.
1. Opening and reading text file
with open("ppc_ai_privacy_policy.txt") as f: text = f.read()
Let’s see the first 100 characters of the text we extracted.
We use “slicing” here, i.e. getting all the elements (characters) in a string (text) up to a certain index. Every character has a unique index, and the first index is always 0. The output is the following:
'\nPrivacy\nPolicy\nfor users of privacypolicycheck.ai,\nand persons that are identified in submitted pri'
Notice that we did not use the print() built-in function, checking for the raw representation of the text instead. That’s why we see \n, which are newline delimiters, in the text. Let’s print the same first 100 characters and see what changes.
The output is different, since newlines are automatically translated:
for users of privacypolicycheck.ai,
and persons that are identified in submitted pri
For the curious, we can check how many characters our text has, using the built-in len() function.
The output will be:
2. Working with paragraphs
As long as we have text, we can do anything with it. Why not to calculate how many paragraphs the document has?
For this task, we will use the standard library re module which helps us find information using regular expressions, i.e. text patterns.
Let’s define the paragraph pattern first. The simplest way to extract paragraphs is to split the text with the newline delimiter we saw above, and filter out all the “empty” paragraphs, i.e. repeating newlines or spaces appearing at the beginning/end of the document before/after a newline delimiter.
para_delimiter_regex = "\\n"
Notice that we have added an extra \, since all the special characters, like the backslash, need to be “escaped” if they are used in regular expression patterns, which we intend to do. And the “escaping” is done by prepending a backslash.
Now, let’s split the text with this delimiter using the above-mentioned re module.
import re paras = [para for para in re.split(para_delimiter_regex, text) if len(para)]
Let’s stop here for a minute. First, we imported the re module. Second, we used a list comprehension: it starts with the opening “[” and ends with the closing “]”. In it, we iterate through every paragraph returned by the re.split() function which uses para_delimiter_regex (newline delimiter) to split the text into paragraphs. And, finally, we leave only those paragraphs in the list which are not empty, hence the use of if len(para), which means “if a paragraph’s length is more than 0”. Then the resulting list is assigned to the paras variable, so that we can work with the list later.
Now we have the list of paragraphs in the document. Let’s see how many there are.
The output is
Let’s print the first 5 paragraphs in the list.
The output is a list of first five paragraphs in the document:
['Privacy', 'Policy', 'for users of privacypolicycheck.ai,', 'and persons that are identified in submitted privacy policies', 'Hi there,']
Let’s go a bit further and calculate the average paragraph length. To achieve this, we will need, obviously, the sum of all paragraphs’ lengths divided by the number of paragraphs. We will also round the result using the built-in round() function to get rid of fractions.
average_para_length = sum([len(para) for para in paras]) / len(paras) average_para_length = round(average_para_length) print(average_para_length)
Notice that we have used the built-in sum() function to calculate the sum of all numbers (paragraph lengths) in a list. Then we used the / arithmetic operator for division.
The output is:
which means that our paragraphs have 162 characters on average.
Let’s calculate the minimum and maximum paragraph lengths in the document. It is even simpler due to built-in min() and max() functions.
min_para_length = min([len(para) for para in paras]) max_para_length = max([len(para) for para in paras]) print(min_para_length) print(max_para_length)
The output is:
meaning that the smallest paragraph has 3 characters, while the largest 646.
3. Extracting emails and URLs
import re email_regex = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+" matched_emails = re.finditer(email_regex, text) extracted_emails = [m.group(0) for m in matched_emails] print(extracted_emails)
After importing the re module, we define the email regular expression (text pattern) and use it in the re.finditer() function which returns all the matches of the pattern in the text as one object (iterator) assigned to the matched_emails variable. Then, we iterate through this object and put each matched email in a list (starts with the opening “[” and ends with the closing “]”). The list is assigned to (stored in) the extracted_emails variable.
When we print the result, the output is:
which is a list with a single extracted email in it.
Let’s do the same for URLs. Note that the re module has already been imported.
url_regex = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" matched_urls = re.finditer(url_regex, text) extracted_urls = [m.group(0) for m in matched_urls] print(extracted_urls)
Similarly, the output will be a list with a single URL in it:
The above steps show how in a very few lines of Python code we can get interesting and useful information from the text of a legal document.