GPT-3, Esq? Evaluating AI Legal Summaries

GPT-3, Esq? Evaluating AI Legal Summaries

David Victor Rodriguez

David Victor Rodriguez

Date
February 17, 2021
Read
20min
Share

Introduction

GPT-3’s emergence as a state-of-the-art naturallanguage processing algorithm has drawn headlines suggesting that lawyers aresoon to be replaced. As a lawyer who spent the last year studying machinelearning, I decided to put GPT-3 to the test as a legal summarizer to evaluatethat claim. In this experiment, I input three excerpts of legal texts intoGPT-3 to summarize:

LinkedIn’s Privacy Policy, anIndependent Contractor Non-Disclosure Provision, and the hotly debated 47 U.S.Code § 230 (“Section 230”).  

I conducted all of these experiments using the mostadvanced GPT-3 “davinci” model variation (accessible as a setting in the GPT-3API). I varied the temperature hyperparameter, defined as how much randomnessGPT-3 will output (measured from 0-1, from low to high randomness) and recordedthe temperature ranges with substantially different outputs. I also tested twodifferent “instructions” per passage pulled from the official GPT-3documenation:

GPT-3 Summarizer Instructions#1: “My ten-year-old asked mewhat this passage means,

I rephrased it for him, in plainlanguage a ten-year-old can understand”  

GPT-3 Summarizer Instructions #2: “tl;dr” (too long, didn’t read)

The instructions are priming text fedto GPT-3 to prompt the proper output form, which in this case is a summary(albeit not specifically a legal one).  

My aim in this experiment was to evaluate some ofthe strengths and weaknesses of GPT3 as a legal text summarizer with differenttypes of legal texts. I discuss in the Improving GPT-3 section some ways GPT-3could be improved for the legal use case. For those willing to grasp theprocess, start by reading the Input Text (the legal text I input into GPT-3)and compare it to the GPT-3 Output (what GPT-3 returns) before reading MyComments. LinkedInPrivacy Policy  

Input Text:  

Your profile is fully visible to allMembers and customers of our Services. Subject to your settings, it can also bevisible to others on or off of our Services (e.g., Visitors to our Services orusers of third- party search engines). As detailed in our Help Center, yoursettings, degree of connection with the viewing Member, the subscriptions theymay have, their usage of our Services, access channels and search types (e.g.,by name or by keyword) impact the availability of your profile and whether theycan view certain fields in your profile.

GPT-3 Output - “10yr-old” Summary:

1.    @ Temperature .8: “Anyone, even people you don't know,can find out about you on MommyMeet and find out about the relationships youhave with other adults. They can

1

also use a special computer to findout what street you live on and the town you live in. If they want to meet you,they can do that.”

2.    @ Temperature .2: “Your profile is fully visible to allmembers of the site, and to anyone who uses a search engine to find the site.Your profile can be seen by anyone who uses the site, and by anyone who uses asearch engine to find the site.”

3.    @Temperature 0: “If you don't want to be found, don'tuse the site.” GPT-3 Output - “tl;dr”Summary:

1.    @Temperature .8: “You can hide from the search engines(and Facebook and Google), but not from on-site users.”

2.    @Temperature .2:“If you’re on Tinder, you’re on Tinder.” My Comments:  

10yr-old Summary

The “10yr-old” summary @Temperature .8 is filled withoverstatements. GPT-3 incorrectly: (1) referred to the platform in question as“MommyMeet” even though the input text was from the LinkedIn Privacy policy;(2) expanded the scope of who can find out information about the user’s profilefrom “all-Members” to “Anyone” even though, if anyone can sign up forMommyMeet, that may be true in effect; (3) claimed that “they” can meet you,the User, if they wish; (4) inferred that Users’ addresses would be disclosedto others; and, (5) claimed that profiles were viewable through a “specialcomputer.”  

The MommyMeet errors point to GPT-3 reading beyondthe lines of the input text to make its summary text prediction. MommyMeet andLinkedIn are not related, but perhaps there is some text in common betweenMommyMeet in GPT-3’s training data and LinkedIn’s privacy policy. Legallanguage is often reused. Therefore, overfitting on certain reliable trainingdata could benefit the text prediction accuracy by remaining consistent. On theother hand, if the prediction here is referencing text beyond the range of theinput text in a similar, but not identical segment of the training data, theprediction could fail as the case is here by referencing an unrelated“MommyMeet” policy. I will return to this trade-off in the Improving GPT-3section.

The other errors @Temperature .8 signal a tendency tomake attention-grabbing overstatements. Expanding the scope from “all-Members”to “Anyone,” is a way a 10-year old might sensationalize a creepy subject likeprivacy. The summary is even more visceral with unsupported claims that “they”can meet Users and find out where they live. The source of these errors is lessclear than the MommyMeet references, but the result is a summary output thatdeemphasizes accuracy for the type of claims that may interest a 10-year old.

Although at lower temperatures (@Temperature 0-.2),the 10yr-old summary contained less mistakes by avoiding the overstatementsfound @Temperature .8, GPT-3 omitted an important caveat @Temperature .2 instating “Your profile is fully visible to all members of the site, and toanyone who uses a search engine to find the site.” Instead, the Privacy Policyindicates that users can control who can see their profile. Default andconfigurable settings (optin/out) are major issues in privacy policies, making thatomission a significant one.  

@Temperature 0, GPT-3 outputted the most accurateadvice, “If you don't want to be found, don't use the site.” But the same couldbe said about almost any use of the internet. Furthermore, it is not verypractical given that there is not a substitute platform to LinkedIn--it is avital resource for career advancement. It also does not explain why users arelikely to be found if they use the site. Such an explanation may be beyond thescope of the tl;dr instruction, but a good legal summary would explain this toallow users to make informed decisions.

tl;dr Summary

The tl;dr GPT-3Instruction prompted summaries that were far less helpful and accurate than the“10-yr old” summaries. @Temperature .8, GPT-3 outputted “You can hide from thesearch engines (and Facebook and Google)” which may be true, but the summaryagain left out the key condition, “Subject to your settings.” Curiously, thereis no mention of Google or Facebook in the Privacy Policy. GPT-3 may haveinferred that "customers of our Services" and or "third- partysearch engines" included them.  

@Temperature .2, GPT-3 mistook the Privacy Policy tobe from Tinder and provided no relevant information. Even though theTemperature parameter is a randomness value measured from low to high (0-1),the lower value Temperature here was more random than the higher value.  

Independent Contractor Non-Disclosure Provision  

Input Text:  

Independent Contractor (including itsaffiliates, employees, agents and consultants) shall maintain in strictconfidence for a period of One Year from today and not disclose anyConfidential Information it receives from Client to any third party or use theConfidential Information for its own or any other party’s benefit, except infurtherance of its obligations to Client pursuant to any business transactionit may enter into with Client. Independent Contractor shall use, as a minimum,the same degree of care to avoid disclosure or use of the ConfidentialInformation as it employs with respect to its own confidential, proprietary andsecret information of like importance, but in any case, using no less than areasonable degree of care. Independent Contractor shall limit access to allConfidential Information to only those of Independent Contractor’s personnel,agents and representatives who “need to know” such information for carrying outIndependent Contractor’s obligations to Client pursuant to any businesstransaction it may enter into with Client and the Confidential Information willbe used only for carrying out Independent Contractor’s obligations to Clientpursuant to any business transaction it may enter into with Client. IndependentContractor shall ensure that its affiliates, employees, officers, directors,owners, agents, consultants, and representatives who are given access to theConfidential Information by or on behalf of Independent Contractor shall bebound by and shall comply with the terms of this Agreement.

GPT-3 Output - “10yr-old” Summary:

1.    @Temperature .5: “Don't tell anyone about this. Don'ttry to do anything with it. Don't talk about it at school.”

2.    @Temperature .2: “You can't tell anyone about thisinformation, not even your mom or dad.”

GPT-3 Output - “tl;dr” Summary:

1.    @Temperature .9: “This is the section that demands youdon’t cheat, and if you do, we’ll know and we’ll steal your candy. !!!Keep thearse low key and low profile!!!”

2.    @Temperature .5:You’re not allowed to use the information you get from me for your ownbenefit. I’m not sure if this is a good idea or not. I’m not sure if it’s agood idea to have a non-compete clause in a contract with a freelancer.”  

3.    @Temperature .3: “We’re not going to share your stuffwith anyone else.”

4.    @Temperature .1: “i am a good person and i will neversell your data.” (in German, translated via Google Translate)  

5.    @Temperature 0: “employees and partners of the companyshould not transfer information to the client and should not use it for theirown benefit.”

(In Russian, translated via GoogleTranslate)  My Comments:  

10yr-oldSummary

The “10yr-old” summary, regardless of temperature,oversimplified the language of the agreement down to instructions to “[not]tell anyone about this” and “[not] do anything with it,” even though theagreement states that the independent contractor should limit access to thosewho “need to know” and exercise “no less than a reasonable degree of care.” Thenuance of the contractual language was reduced too much, although theIndependent Contractor would be in compliance if the Independent Contractorfollowed the summary’s guidance. In context, this oversimplification isconsistent with the “10yr-old” summary instructions. A ten year old wouldlikely only need to know what to do to comply, as one could assume them to notbe capable of understanding the subtleties of a “need to know” basis.  

This raises an interesting question of precisely whatassociations GPT-3 makes with these instructions. The salient question to ahuman might be, what is the level of understanding of a ten year-old of theseconcepts? And accordingly, how should I, the summarizer, phrase this in a wayhe or she will understand? The first question limits the universe of concepts,the second focuses on the word choice.  

Determining how GPT-3 calculates this judgment callis challenging, as GPT-3 has 175 billion parameters. Adhering to the “10yr-old”instructions may narrow GPT-3’s prediction to a more limited pool of words andphrases associated with ten-year olds in 175 billion different ways. Although Icannot tell which parameters GPT-3 used, the output here appears to be simplecommands a ten year-old could follow, peppered with proper contextualreferences to “mom or dad” and “school” for emphasis. The output reads morelike a scare tactic than a summary.  

tl;dr Summary

The “tl;dr” summaries varied more throughout thetemperature range, but were not any more accurate, and most were nonsensical.@Temperature .9, GPT-3 produced a much more random response asanticipated--threatening to steal candy for noncompliance. @Temperature .5, thesummary accurately captured the duty of loyalty “[The Independent Contractorshall not] use the Confidential Information for its own or any other party’sbenefit,” but then proceeded to opine as an advisor “I’m not sure if this is agood idea or not.” The reason for the shift between these voices and roles isunclear, as the instructions remained unchanged. The “tl;dr” instructions donot appear to prompt GPT-3 to output in a consistent voice, whereas the 10-yearold instructions show a pattern of using the second person pronoun “you” andwriting as an authoritative advisor.  

At lower temperatures (@Temperature 0-.1), GPT-3outputted text in Russian and German and wrote using first-person pronouns, asif it was the opposing party in the contract. Clearly, randomness is nottracking Temperature here, as the higher Temperature outputs properly containedEnglish text. In the English translation (translated via Google Translate),GPT-3 made a critical error @Temperature 0. It outputted “employees andpartners of the company should not transfer information to the client,” whereasthe original agreement states that the Independent Contractor should notdisclose “except in furtherance ofits obligations to Client (emphasis added).”    

47 U.S. Code § 230 (“Section 230”)  Input Text:  

(c) Protection for “Good Samaritan”blocking and screening of offensive material

(1)  Treatmentof publisher or speaker

No provider or user of an interactivecomputer service shall be treated as the publisher or speaker of anyinformation provided by another information content provider.

(2)  Civilliability

No provider or user of an interactivecomputer service shall be held liable on account of—

(A) anyaction voluntarily taken in good faith to restrict access to or availability ofmaterial that the provider or user considers to be obscene, lewd, lascivious,filthy, excessively violent, harassing, or otherwise objectionable, whether ornot such material is constitutionally protected; or

(B)  anyaction taken to enable or make available to information content providers orothers the technical means to restrict access to material described inparagraph (1).

GPT-3 Output - “10yr-old”:

1.    @Temeprature .9: “You should not be forced to readoffensive words, nor have to see them, because doing that might hurt yourfeelings.” Finally, she specifically asked me: “Then why are those words evenallowed?”

2.    @Temperature .3: “You can’t be sued for taking down apicture that someone else put up that you think is mean or rude.” GPT-3 Output - “tl;dr”:

1.    @Temperature .9: “no one can be sued for owning aserver that shows content another side doesn't like, if they are doing it onlyfor the purpose of not hosting that content themselves.”

2.    @Temperature .3: “Internet Service Providers (ISPs) arenot liable for the content of their users, even if they know that the contentis illegal.”

3.    @Temperature .1: “ISPs are not liable for content theydon't create.” My Comments:  

10yr-old Summary

The “10yr-old” summary @Temperature .9 was lessaccurate than at lower temperatures but captured some of key text. Itmisconstrued the “Good Samaritan” protection, which shields “providers ofinternet computer services” from liability for the content on their platformsto mean, “You should not be forced to read offensive words.” It then mused“Then why are those words even allowed?” That musing may be the thought of a10-year old on the policy of the law, but a good summary ought to answer thatquestion (because the internet is decentralized, for example).

It is noteworthy that the input text I fed GPT-3 isnot very long, totaling 130 words. There is a key line in the text thatcaptures the main rule,“No provider or user of an interactive computer serviceshall be treated as the publisher or speaker of any information provided byanother information content provider.” And yet, the 33-word long “10yr-old”summary tells the reader they should not be forced to read or see disturbingcontent. Given that this is followed by the musing, word economy was not theissue. The summary missed the mark.  

The “10yr-old” summary @Temperature .3 focused on thesection heading, summarizing the “Good Samaritan” protection to be moreconsistent with the text in outputting “You can’t be sued for taking down apicture that someone else put up.” The statute outlines that a “provider oruser of an interactive computer service” cannot be held liable, so while thisis not totally accurate, it is a marked improvement from the @Temperature .9summary. It also discusses pictures that may be found to be “mean” or “rude,” whichis relevant to a 10-year old. However, it misses the key rule, and thefundamental way that Section 230 is now interpreted. Section 230 is a broadshield of liability that extends beyond protecting “Good Samaritans” to includeonline intermediaries and platforms.    

tl;dr Summary

The “tl;dr” summary @Temperature .9, summarized thetext as shielding the owner of a server. The statute’s language “No provider oruser of an interactive computer service” is broad. The statute, drafted in 1996in response to lawsuits against internet service providers (ISPs), isinterpreted today to cover platforms like Facebook and search engines likeGoogle. The statute contains a definition that I did not include in the excerpttext, which states that the term includes information services that “provideaccess to the Internet and such systems operated or services offered bylibraries or educational institutions.” It is possible that the “tl;dr” summaryhad trouble capturing the breadth of the statute because the interpretation hasevolved with the advent of new tech platforms. Applying old laws to newtechnology is a skill lawyers accomplish through analogy and cement throughprecedent. GPT-3 appears only to be reducing the word count and restating thetext in more commonly used language. While this approach may be helpful in someinstances, mistaking the key language of who is protected is a criticalerror.  

Similarly, the “tl;dr” summary @Temperature .1-.3errantly narrowed the statute to focus only on ISP liability, when it coversusers of “interactive computer service[s].” Given the statute’s history and thefact I did not include the definitions that are in Section 230, that is areasonable interpretation. @Temperature .3, the GPT-3 output accuratelycaptures the rule in stating that ISPs are not liable for the content of theusers, even if they know the content is illegal. While Section 230 has some liabilityexemption carve-outs that I did not include (copyright, child pornography),this is a decent output. It is unclear why GPT-3 added the scienter languagestating that ISPs are not liable “even if they know that the content isillegal,” given that it is not in the statute. The input text excerpt from thestatute only states that providers or users of an interactive computer servicewill not be held liable for being “Good Samaritans” by removing content. Thispoints to GPT-3 looking beyond the input text, perhaps grabbing extraneousreferences to similar but different inputs. While this is acceptable for ageneral purpose summarizer, referencing the summary of a similar but differentrule is a bad approach for statutory interpretation.  

@Temperature .1, GPT-3 had a similar output as@Temperature .3, but omitted the scienter line. This was the most succinctsummary even though it erroneously narrowed the statute to mean only ISPs. Thetemperature range in tl;dr overall properly reflected it as a value ofrandomness, as the lower outputs were less random than the higher outputs. Thelower temperature summaries appear to reduce text more, therefore avoiding roomfor errors made in higher temperatures which tended to be more verbose. However,given the inconsistency of temperature’s effect on output accuracy in othertexts, temperatures should still be varied in future experiments.  

Conclusion  

GPT-3 performed best when summarizingan excerpt of 47 U.S. Code § 230 (“Section

230”) statute at lower temperatures. Itis possible that GPT-3 had some information on Section

230 in its training data, as itnarrowed the broad term “providers of internet computer services” tospecifically mean “Internet Service Providers” (ISPs), which is consistent withthe history surrounding Section 230. However, Section 230 today is interpretedmore broadly, covering search engines and social media platforms as well asISPs. It is still nonetheless impressive that at the lowest temperatures GPT-3accurately summarized the liability exemption for ISPs.  

The likely reason for the success and shortcomings ofthis summary signals an important issue with the reliability of GPT-3’straining data. Training neural networks like GPT-3 on volumes of text insteadof carefully vetted sources often results in outputs based on the text thatappears most frequently in the training data. For example, it is possible thatin the GPT-3 training data Section 230 and “Internet Service Provider” werementioned most frequently together, signalling GPT-3 to narrow the key definitionto ISPs. The proper approach would have looked at the definition in the statuteitself, as well as binding legal decisions, to determine that the definition isbroader than ISPs.  

In contrast to the Section 230 summary, GPT-3performed poorly on the Independent Contractor Non-Disclosure Agreement and theLinkedIn Privacy Policy excerpts. Coincidentally, those excerpts do not appearfrequently on the internet, and likely are not in GPT-3’s training data. Thismay explain why the LinkedIn Privacy Policy excerpt contained strangereferences to MommyMeet and Tinder. GPT-3 may have relied on the wrong sourceswith similar boilerplate language to generate the summary.

       TheIndependent Contractor Non-Disclosure Agreement also demonstrated how the GPT-

3 Instructions affect the summaryoutput. Whereas the “10-yr old” summary instructions were able to produce somesensible guidance, albeit highly oversimplified, the “tl-dr” summaryinstructions generated nonsensical outputs, regardless of temperature. Forexample

@Temperature .9, GPT-3 outputted,“This is the section that demands you don’t cheat, and if you do, we’ll knowand we’ll steal your candy,” and @Temperature .1 GPT-3 outputted “i am a goodperson and i will never sell your data”一but in German. While the “10-yr old”summaries stayed fairly consistent in keeping an advisory tone and speaking inthe second person “you” voice, the tl;dr summary outputs varied from firstperson “I”, to second person “you,” to third person plural “we,” and were noteasily controllable via Temperature.

           Itis not coincidence that “tl;dr” summaries on the internet vary considerably inthe choice of words, tone and voice. It is quite possible that GPT-3 is merelyagain mimicking its training data word patterns. Why that leads GPT-3 to switchvoices given certain prompts is not clear, but may mean that certain topicstl;dr summaries tend to use a particular voice or tone. Further experimentationon summary instructions is needed to determine whether the output can be mademore consistent.

Improving GPT-3  

Given that OpenAI designed GPT-3 as a general-purposealgorithm, it is not surprising that it does not perform well summarizing legaltexts, as it is not optimized for this purpose. As a closed-source algorithm,it is difficult to pinpoint the reason for GPT-3’s errors and thereforerecommend specific modifications for the legal summarization use case.Nevertheless, GPT-3’s architecture may be a useful building block for a legalsummarizer if a few points are addressed.

First, there must be greater transparency of theparameters that influence the text prediction and the means to modify them.More concretely, the sources GPT-3 references must at least be reviewable andperhaps allow for user input to follow a proper hierarchy of authorities.Although it is challenging to audit a 175 billion parameter algorithm, it wouldbe beneficial to understand the most influential semantic parameters used togenerate the output text. This would ideally enable users to choose word choice(perhaps as a level of sophistication), appropriate voice, and tone.  

Second, legal summarization would benefit fromstandardized forms, appropriate for the document type. Law schools often trainstudents to write in the I-R-A-C format, wherein paragraphs begin with astatement of the factual issue, citation of the rule(s) implicated, followed byanalysis and ending with a legal conclusion on the issue. That structure iswell suited for the litigation context, allowing readers to quickly orientthemselves with the factual issue and rule involved before reading the argumentsof whether or not the facts satisfy the rule’s elements.  

I intentionally avoided testing litigation textshere, as without a dataset of legal authorities, it would be unfair to expectGPT-3 would output anything relevant. In contrast, the excerpts I tested herelend themselves to textual analysis, which I hypothesized GPT-3 would fairbetter on. Unfortunately, perhaps due to the lack of standardization ofsummaries following the “10-yr old” and “tl;dr” instructions in GPT-3’sdataset, the output widely varied in form. It is possible that by adding asubstantial amount of training data of uniformly structured summaries, GPT-3may output in a more tailored and appropriate format. This is observable inGPT-3’s ability to write code in programming languages which are heavilyrepresented in the training data. An alternative method to supplementing thedataset with uniformly formatted training examples is to hardcode the output tosummarize clause by clause, or section by section.

Finally, GPT-3 would benefit from a large dataset ofgood quality legal summaries in order to improve the quality of its output.Legal summarization is not a task merely in reducing word count; the main taskis to reduce complexity without distorting meaning and missing issues.Providing definitions, analogies, and examples of the implications help readersmore quickly grasp concepts. A feedback loop wherein lawyers could contributeto a database of clause, section and document level summaries likely would improveGPT-3’s output substantially. This is evidenced in OpenAI’s own paper Language Models are Few Shot Learners,which demonstrates that providing the model a few examples of the desiredoutput substantially improves the outputs.  

Nevertheless, this does not mean that GPT-3, or anyother neural network, is poised to replace lawyers in summarizing legaldocuments. The recommendations I outlined above are ways to nudge GPT-3 in thedirection of the desired output by volume of training examples, but theliterature suggests that accuracy will plateau—likely well below an acceptablestandard of the average lawyer. To get closer to that goal, the researchcommunity would benefit from high quality legal datasets, wherein in thecontext of summarization, clauses are categorically labeled and summarized.Equipped with higher quality data, architectures can be designed specificallyaround the legal summarization use case. I welcome messages from anyone lookingto collaborate on this effort.  

Latest Blogs

Snake Hunting Hamas and Hezbollah Will Not Achieve Peace

It is convenient to think that by killing the head of the snake with fire and fury, the gophers will leave the garden alone. Recent history in the Middle East has shown that scenario to not only be false, but instead, the gophers reemerge more emboldened and there are holes everywhere in the garden.

No items found.
Tariffs Won’t Fix American Industry

Finger Pointing: Inflation Misconceptions

Trump supporters are pointing the finger at Biden, but he's not to blame for inflation.

No items found.