Naming matters: Identifying common-law trademarks with machine learning

Introduced by Keeley Takimoto, Kevin Chow, Atul Madhugiri, Hannah Burak, and Andrew Nichol

Companies choosing names for their entity spend thousands of dollars on
reports which identify their legal risk for a trademark suit based on their company name and industry. These reports are assembled by hand with results from manual database searches in extremely inefficient fashion. Our client, Naming Matters, produces these reports like these using state and federal trademark registries. However, these reports do not include the risk of lawsuits from companies with common-law trademarks.

A common law trademark is owned by a company with an established reputation and history of operation in a certain industry. Common-law trademarks apply to companies providing similar products or services with identical or similar-sounding names. Our task was to use potential company name and industry pairings to identify potential common-law trademarks that may pose a risk for suit associated with the company names being considered by Naming Matters customers.

The initially desired solution to this was to use text data (either the full
document or a more specific excerpt) to do common-law trademark recognition using natural language processing. The process of creating and finding the training datasets for this was a defining part of the project, albeit slow. After looking into our options, we decided to use two different approaches to modeling and constructing training data based on the databases and sources we found access to while researching.

The first method focuses on deriving some value and information about whether a company name is found in a block of unstructured text using a convolutional neural network implementation. We knew a neural network would be ideal for processing unstructured text, but would require an immense amount of training data identifying positive and negative cases of words representing company names. The only large dataset which fit these requirements was the NYT Annotated Corpus, which contained a tag for organizations. This model is a very valuable addition to the normal process for identifying common-law trademarks, especially when looking at large chunks of unstructured texts.

The second method focuses on using text results from queries of article databases which we process with feature engineering and feed into models which could create a risk score for trademark suit. We invested in creating a second, smaller training dataset more appropriate for logistic and tree-based models. In addition, we created a clean front end for this method. With a Random Forest model leveraging current tools to do the process start-to-finish, and a CNN/NLP model customized for trademarks and that could be incorporated in case full text is available, we have created two interesting methods to help identify possible common-law trademarks and the associated risk.