My Journey From Law to Machine Learning
I also went into this journey irked, with a chip on my shoulder. In law school, there is an eerie quiet that comes over the otherwise fast-firing Socratic dialogue in class when the central question hinges on a math formula. It's a running joke that lawyers aren't good at math. I resent that generalization. I felt a lightning bolt run down my spine when my classmate said over a beer immediately following our bar exam that the reason we suffered through that miserable test was because we didn't have the chops to be software programmers. Regrettably, I nodded approvingly and took another sip of my Belgian Trippel despite all of my insides shouting in retort, "To hell I can't!" When I pitched the idea of studying machine learning to my favorite law school professor and extraordinary privacy law scholar, Paul Ohm (who was a CS major in college and teaches a pioneering course, Coding for Lawyers, among other more traditional law courses), he cautioned me against retiring my legal pen for the machine learning calculator. The thunder roared in my bones once more. Oh, no he didn't tell me I shouldn't.
So I did. Professor Ohm had a fair point on how far along I was as an attorney versus as a software programmer or mathematician. But I ascribe to the adage that it ain't where you start, but where you finish. The extent of my coding skills prior to diving into machine learning consisted of having built a game of hangman in Python, as well as having built a TCL Internet Relay Chat botnet back in the day. While I enjoyed my statistics and economics courses in college, I never studied calculus or linear algebra as I failed to recognize a foreseeable use for them. Despite diving into this a bit behind the learning curve on the math, I did have some rigorous law school experience stacked on top of a lifetime of grit. Law school taught me that if you can grasp definitions, follow the references and map logic, any subject matter is conquerable. Alas, after some humming and hawing, I dove in.
Despite many frustrations and sleepless nights trying to crack a programming problem or math equation, I persevered through one programming assignment after another, and things started to click. l found my toes tickled when I took a step back to acknowledge just how amazingly powerful and elegant some machine learning formulas are. It is humbling to think there were bald (sorry but true!), bearded (scruffy?) and intellectually bold researchers who for decades grinded through machine learning research without much recognition outside of their community. Many of the early models had to be trained for weeks on end (if not longer) prior to the increase in computing power shortening the training time. To those brave pioneers whose work now shapes the future of industry, I am so very grateful for your persistence. And to Andrew Ng and Laurence Moroney, thank you for assembling your magnificent courses that trailblazed what otherwise would have been an Amazon-wide forest^100 I'd have tossed myself into without much more than a compass (and some rock skipping talent).
First, there were numbers, then, there were curves.
I dove headfirst into Andrew Ng's Machine Learning course on Coursera, which explores the mathematical foundations of machine learning in Octave. Here are some of my model results visualized (produced in Octave through gradient descent primarily using vector math):
Armed with this quantitative foundation, I continued on to get my hands dirty with the state-of-the art in Andrew Ng's Deep Learning Specialization, as well as Laurence Moroney's (@Google) Tensorflow Developer Professional Certification. The Deep Learning Specialization approaches building neural networks from scratch in NumPy, before proceeding to leverage Keras. Moroney's Tensorflow course takes a more practical approach by implementing neural networks in Tensorflow.
Here is a glimpse of some of the audio-visual projects I worked on (because they're more scrollable than model accuracy metrics!):
Neural style Transfer: images that make my instagram LOATHING heart smile
These neural style transfer images speak for themselves. At a high level of abstraction, the process involves blending a content layer (the original image) with a style image using a convolutional neural network. I found the gram matrix (see below) endlessly fascinating. At different convolutional layers, the neural network pulls out patterns of different color gradients, lines, shapes, and even some less intuitive characteristics.
Gram Matrix at Different Layers
Layer One
Layer Four
In the examples above, in the first convolutional layer, the neural network is detecting patterns in color gradients and contrast lines. Deeper into the network in layer four, the network is zooming out a bit to match the patterns of dog faces, animal feet and vehicle wheels. It's interesting to note the non-conforming photos (non-conforming to the human obvious definition of the category) in each section. For example, note that in layer four, cameras appear in the spiral patterns in the middle section, and there is an animal in the bottom-left section of vehicle wheels. The neural network is not categorizing these images consistent with any preprogrammed human intuition. It is learning these characteristics by finding RGB-pixel patterns at different levels of abstraction.
Image Recognition: You only look once
You only look once (YOLO) is a fast image detection algorithm responsible for detecting objects in RGB camera images (the type of images you get from a typical camera or camera phone). Object detection is a critical component in the series of algorithms and programs that control autonomous vehicles. A computationally efficient program is essential in image detection as critical navigation decisions (acceleration, braking, steering) hinge on split-second accurate detection of surrounding objects, traffic signals and signage. What isn't visible from the low resolution in the output image above are confidence values for each object. Here are the basic steps of how YOLO works:
Draw boxes (>1,800 here)
Calculate the confidence of object's presence
Determine the class of the object via softmax
YOLO may look like a scattershot approach to drawing bounding boxes, but it's far more efficient than the "sliding window" approach of old, which used a predefined box size sequentially marched across the screen a few pixels per iteration (much slower computationally). Armed with a robust image detection model, these bounding boxes can identify nearly any street-based object. Tracking these boxes over multiple frames in a video allows for trajectory mapping and therein enable safe path planning (vis-à-vis object avoidance, for example). I couldn't get enough of YOLO, so I dove in to build a robot to implement it in a real world setting.
Long short-term memory Poetry
The truth lives just beyond the mountain,
or the day to show heaven's view old,
end thee clears live shows thine,
ever used aside bright 'tis more pride told,
well more back lust another nought straight,
'tis tongue light expired,
light need part in me alone,
so say mine am, am forsworn thee in thee,
but so will see blind cross part,
know behold such young me shows thee bright,
so writ it now say none now be deeds ' was by every part,
are bear things brought 'tis done bear,
bear true crime,
prove true age,
prove best fire,
check true, true spirit
I generated this Shakespeare-esque poem by implementing a bidirectional long short-term memory model (LSTM) in TensorFlow. The model was trained on Shakespeare's Sonnets, initiated by the trigger text "The Truth lives just beyond the mountain." The only edit I made was adding the line breaks. LSTM models are modified recurrent neural networks, which in this case makes predictions of the following word based on the previous words in the sequence. A novel aspect of word-based LSTMs like this is that they make predictions based not only on the word immediately prior to the prediction word, but also from previous words in the sentence or paragraph. This gives LSTMs better context to factor into the prediction, and in this instance, resulted in more varied (dare I say creative) poetry than merely using a simpler next-word prediction model.
As someone who took to writing poetry at a young age as a means of hip-hop fueled expression outside of the confines of "proper" usage and grammar, achieving a human-like output out of this algorithm was a fist-pumping occasion. Language undoubtedly has learnable patterns, but discerning meaning is still a challenge for machine learning algorithms. The state-of-the-art GPT-3 algorithm has taken the natural language processing world by storm, but it is clear it has no idea what it's on about. Working on this algorithm further inspired me to leverage natural language processing in the practice of law. I earnestly believe AI is well suited for the many tasks of lawyering. I'll circle back on this below.
Long short-term memory MODEL JAZZ
I generated this jazz piano solo with an LSTM (see image below of the model structure) using electrojazz music samples like this as training data. I broke the sample data into 78 shorter snippets of around a quarter second in length. The snippets allowed for me to run measures and chords of the sample music through the model. In a sense, the model analyzes jazz "sounds" as opposed to the traditional (and more precise) note/chord/phrase structure found in music theory.
Given that music needs to sound "pleasing" (a hard criteria for an ML algorithm!) to be considered music, the model included some post processing "massaging" of the output music. The main reason for this is to avoid repetitive notes and large variations in pitch. While I'm no Miles Davis (I have zero musical ability), this one brought a smile to my face. I won't delve into the overarching definition of creativity or the essence of newness, but I will say that if I can make mash up jazz, Kenny G might lose his worldwide elevator music contract.
STACKING CONVOLUTIONS, LSTMS AND RNN LAYERS to Predict Sunspots
Think of the learning rate graph above as the ideal minigolf put. You're trying to hit the ball such that it would travel on the path that gets closer to the hole at every timestep. Contrast this with a slow velocity put that may arrive at the whole, but would waste computational resources, as well as a high velocity put, that might get there faster ultimately, but also may jump off the course in an unpredictable way. The optimal putting velocity is thus the optimal learning rate.The results of my sunspot predictor are below on the right next to the input data showing the cycles of low temperature spots observed on the sun. The orange line is the prediction line.
Stable @ lr=1e-5.5
Think of the learning rate graph above as the ideal minigolf put. You're trying to hit the ball such that it would travel on the path that gets closer to the hole at every timestep. Contrast this with a slow velocity put that may arrive at the whole, but would waste computational resources, as well as a high velocity put, that might get there faster ultimately, but also may jump off the course in an unpredictable way. The optimal putting velocity is thus the optimal learning rate.
The results of my sunspot predictor are below on the right next to the input data showing the cycles of low temperature spots observed on the sun. The orange line is the prediction line.
Daily Minimum Temperatures
Prediction MAE = 1.780626 (output)
Not too shabby!
Deep Learning on the horizon
These specializations made me feel like Dorothy in the Wizard of Oz at times, but going deeper down the yellow code road was well worth it. Not only is the feeling of code working an emancipatory moment, but some of the algorithms I built provided me with gateways to how the future of our world will be designed. There are philosophical (and technical) arguments of whether human cognition can be learned through gradient descent (think of the minigolf analogy applied to a multidimensional course), enabling machines to do everything we can. The central question of whether deep learning can scale up to human level thinking hinges on whether or not our thinking is comprised of a finite number of quantifiable parameters. I will not go near the question of whether computers may one day have a soul, but it is clear that deep learning is ripe to make rapid progress in mapping thought patterns, enabling us to read thoughts, predict and manipulate behaviors.
Curious how machine learning could be applied to the understand and optimize the brain, I dove a bit into some research papers this year with a neurosurgeon resident friend of mine, Anil Mahadavi. The signal to noise in electroencephalogram technology (those weird skullcaps with wires popping out of them, and what Muse is doing to help track sleep and optimize meditation) is not great, but it's rapidly improving, and surgically implanted solutions like the one Neuralink is developing, combined with neural networks appear to be good enough to predict speech from thoughts. I can think of nothing cooler than a Matrix training simulator, like the one where Neo goes to learn Kung-fu and Trinity learns to fly a helicopter in an instant. Even a primitive version of that would be a huge advancement in human education and performance evaluation. Imagine if any skill could be practiced in a virtual environment that is reinforced and adapted to maintain the ideal flow state, manage fatigue and boost creativity. That's the holy grail of education.
I think far sooner than we crack the brain code, the legal profession will be disrupted by deep learning. One of the fundamental legal skills is issue spotting, which boils down to connecting facts to law. There is no doubt that this task lends itself to the type of pattern recognition deep learning excels in. Contract language anomaly spotting has already taken off. Legal drafting may be a bit more distant of a frontier, but given GPT-3's writing abilities, we are not far from solid draft level work.
A lot of the heavy lifting appears to be in the pre-processing data labelling stage. One of the key inflection points for the growth in the effectiveness of image recognition (and the accompanying manufacturing and autonomous driving efforts) were massive efforts to manually label datasets for training. An expertly labeled legal dataset categorizing term types, law-fact analysis, and other useful parameters would greatly help this effort. While there are not captchas for lawyers to label a dataset as there is for images, and there is tendency towards luddite traditionalism in the legal industry, disruption is inevitable.
I believe everyone should have a lawyer in their pocket. Jack Weatherford writes in Ghengis Kahn and the Making of the Modern World about how the Mongols emphasized the promotion of literacy in part to communicate the legal code to everyone under Mongol rule. While the law is far more complex today than it was during the reign of the Mongol Empire, some of that complexity is unnecessary. Legalese combined with an abundance of references, inter alia, makes law inaccessible to many. We need a search engine for legal questions, and we need to give regular people the power to file and contest legal actions and draft robust contracts. We also need to cut down on the incredibly expensive legal bills companies spend on M&A transactions & IPOs. We need an AI lawyer.