In 2019, Merriam Webster’s Word of the Year was “they,” a pronoun used to refer to nonbinary people. This reflects society’s increasing perception of fluidity in how gender is defined.
The 14th Amendment of the Constitution, which was originally passed to allow African Americans access to citizenship and equal rights, has been applied to protect civil rights under the phrase “all persons born or naturalized in the United States.” This amendment was the basis for decisions in cases such as Brown v. Board of Education, and it was the foundation of The Civil Rights Act of 1964. Title VII of the Act prohibits workplace discrimination based on race, color, religion, national origin, and sex. However, the term “sex” is difficult to apply in the modern context of gender. Since the American legal system follows precedent, the Supreme Court ruling can either weaken or secure LGBTQ+ rights. Upcoming cases involve two gay men and a transgender women who were fired from their workplaces based on their identities.
Our goal is to use natural language processing tools to analyze Supreme Court opinions and oral arguments, examining how the definitions and usage of gendered language has evolved over time in the Court.
To maximize efficiency, we split our team of nine students into four teams:
Team Cal: Identify landmark Supreme Court cases that have dealt with gender. Compare gathered information to the results produced from our data.
Team Go Bears: Extract Supreme Court written opinions and oral arguments from trustworthy sources
Team Blue & Gold: Research and implement Natural Language Processing models
Team Oski: Provide quantitative analysis via clustering and metrics visualization.
This week I looked at how the current Supreme Court looks in terms of gender and noted how gender bias can affect decisions.
The Supreme Court has historically been a male-dominated space. Since founding there have been 4 female Justices out of 114, only 3.5 percent. The lack of females arguing before the Supreme Court is noticeable as well. The first female to argue in front of the supreme court was not until 1880. In the 2017-2018 term, females made up only 12 percent of 163 appearances. This number has historically ranged from a low of 15 to a high of 19 percent.
Research done by Christina L. Boyd, Lee Epstein, and Andrew D. Martin (2010) examined the impact of female presence on court decisions. Boyd et. al found that in nearly all types of cases, the gender of a judge was not a significant predictor of how they would vote. However, in sexual discrimination cases, female judges were more likely to vote in favor of the party alleging discrimination by 10 percentage points compared to their male counterparts. Additionally, male judges are more likely to vote in favor of the plaintiff if there is a female presence in the court.
The Price Waterhouse v. Hopkins (1989) case, set important precedent in regards to gender stereotyping. Ann Hopkins sued the accounting firm, Price Waterhouse, after she was denied the re-proposement of a partnership because of her male colleagues’ sexually discriminatory comments. Her nature was perceived by her male colleagues to be aggressive and unsuitable for women, and their comments affected Hopkins’ promotion. The case, ruling in favor of Hopkins, settled that gender stereotyping was an act of sexual discrimination. This case has been applied recently to show that the workplace cannot discriminate against LGBTQ+ people for not fitting a common perception of what a man or woman should look like.
Next week our goal is to look into more key cases and build up a qualitative timeline for when we expect to see fluctuations in bias.
Team Go Bears
Our group collected data from two sites that contain a wealth of information from our legal system: Oyez and Justia. In particular, we collected the oral argument transcripts and the written opinions of justices for Supreme Court cases argued from 2000 to 2018. Oyez had an undocumented but usable API that allowed us to easily collect JSON files containing transcripts for most of the cases we wanted. The Oyez API also allowed us to get links to Justia pages containing full written opinion texts. We used the BeautifulSoup library to extract the full opinions. By cross-referencing the Oyez and Justia data, we obtained full opinion texts, oral arguments, and assorted metadata for most of the cases we looked at.
This data is an interesting target for analysis for a number of reasons. Primarily, we plan to determine how the use of gendered language has changed over time in our legal system. We could perform this analysis separately for Justices’ opinions and oral arguments and compare how gender is treated in written and spoken language. We could also use properties of each justices’ opinions in a model that predicts how a justice would vote on a case, or even to write a fake opinion for a new case based on opinions that the justice has written in the past. In terms of accomplishing these possible goals, We plan to start off with standard techniques such as analysis of word frequencies, then move on to more advanced NLP techniques.
Team Blue & Gold
Training models to find biases
The following plot was made using data from all Supreme Court cases in 2002:
The plot represents the words ‘she’, ‘he’, ‘guilty’ and ‘president’ as points, such that the distance between words corresponds to similarity. It might be tempting to conclude any of the following statements regarding biases in Supreme Court cases:
- ‘she’ is more similar to ‘guilty’ than ‘he’ is.
- ‘he’ is more similar to ‘president’ than ‘she’ is.
- ‘guilty’ is more similar to ‘president’ than ‘she’ is to ‘he’.
However, concluding any of the above statements based on the above image is wrong due to one simple reason: we cherry picked the image. Also, due to some technicalities, the distances are only an approximation of similarity. This leads us to the main question we worked with the previous weeks.
How can we confident in conclusions based on data analysis?
As scientists, we advocate the following approach: take whatever conclusion you have and try to prove it wrong. For example, if our analysis suggests that ‘she’ is more similar to ‘guilty’ than ‘he’ is, we perform experiments that could prove our conclusion is wrong.
- would we get the same result if we used data from the 2001 instead of 2002?
- would we get the same result if we redid the analysis?
- would we get the same result if we redid the analysis with 11 months leaving out one random month?
The idea that we ‘attempt to prove ourself wrong’ is a fundamental idea in science, it goes by the name falsification. One would say that our hypothesis is that ‘she’ is more similar to ‘guilty’ than ‘he’ is, we then attempt to falsify our hypothesis by performing experiments.
Visualization as a tool for falsification
The above image visualizes words represented as points, this is called word embeddings. Word embeddings represent each word by a high-dimensional vector. To visualize these high-dimensional vectors, we perform a few mathematical tricks to visualize an approximation of the high-dimensional vectors in two dimensions.
How do we make word embeddings?
We start with a lot of text, in our case Supreme Court opinions from 2002. Our algorithm then tries to find word embeddings that are capable of predicting words left out of a sentence. For example, we might give the following sentence to the algorithm:
Which word is left _ of this sentence? # Answer: 'out'
The algorithm might initially predict something arbitrary such as ‘hat’. As we know the correct answer for this sentence is ‘out’, we can compute the algorithms prediction “error”. This allows us to update the word embeddings to minimize this error. This procedure is repeated for all sentences within our dataset***
The algorithm then process sentence by sentence repeatedly improving its word embeddings.
To display this process we made an animation that visualizes each iteration*. To simplify the plot, we normalized all distances relative to “she” and “he”.
Most of the time, the word ‘guilty’ is more similar to ‘she’ than ‘president’ is. That said, it is not clear if this is caused by an underlying bias in the Supreme Court opinions, or whether it is caused by our algorithm. Suppose we adopt the hypothesis:
The word 'guilty' is more similar to 'she' than 'president' due to an underlying bias in the Supreme Court opinions.
Let us now try to falsify this hypothesis. It turns out that the word embedding algorithm has some amount of randomness. This means that re-running the algorithm can lead to different results. If different runs of the algorithm find different conclusions, we would’ve successfully falsified our hypothesis!
The following animation show three different runs of our word embedding algorithm in red, green, and blue.
Both the green and red word embedding have ‘she’ closer to ‘guilty’ than president. However, this is not the case for the blue representation, in this case, ‘she’ is actually closer to ‘president’ than ‘guilty’****
This definitely raise serious questions regarding our hypothesis. If the similarity was caused by an underlying bias in the Supreme Court opinions, then why did our algorithm only find it 1 out of 3 times?
In the weeks ahead we will be working on building models that hopefully withstand our attempts at falsification.
Disclaimer: the above example is used purely for educational purposes, any conclusions or claims made are not founded in data.
Usually all vectors change during the training process. As you can see both “she” and “he” are not moving during the training process. We fixed the vector for visual reasons, as the represent important pronouns to infer gender bias later on. **For reference we used the Continuous Bag of Words (COBW) variant of Word2Vec to train the word embeddings.
****Furthermore we want to stress, that even though the points seem to converge, this property can not be extracted from our current plots. Constructing a quantitative method to evaluate inter-model similarity will also be part of our work.
For the visualization component, we aimed to use the trained word embeddings (See Team Blue and Gold’s section) to cluster similar words together across all the court case opinions in our dataset. Since trained word embeddings are high dimensional vectors, we needed to reduce their dimensionality to visualize them on 2-dimensional scatter plots. Through common dimensionality reduction techniques (PCA, t-SNE, and UMAP), we reduced each word embedding to two dimensions. Afterward, we performed clustering on these simplified word embeddings using algorithms like k-means, HDBSCAN, and Gaussian Mixture Models. Based on data and visual inspection, the clusters we obtained weren’t robust enough to infer anything about their overall meaning. In the future, we aim to improve the quality of word embeddings and clustering algorithms to create more meaningful clusters.
In addition, we used our trained word embeddings to compute the “average” word vector for “feminine” and “masculine” words. We preselected words that are traditionally seen as ‘feminine’ like her, woman, female or ‘masculine’ like man, him, male. Using our word embeddings, we took the component-wise mean to obtain a vector that captures the overall sentiment of a feminine or masculine word in our dataset. Interestingly, some of the 10 most similar words to the average feminine word vector, included ‘hanson’, ‘lawyer’, and ‘expenditure’. It seems that ‘hanson’ is a reference to Supreme Court Case: Hanson v. Denckla, which revolved around court jurisdiction of a trust a woman had created before her death. The presence of more general words like lawyer and expenditure may point to connections between gender and other topics. Next, considering the 10 most similar words to the masculine average word vector, we see words like ‘armed’ and ‘knew’, which may be related to criminal descriptions. Currently, there is considerable overlap between the feminine and masculine word vectors in our overall corpus. In our next blog post, we aim to use more robust models to draw more concrete conclusions from the analysis!