AI Patent Classification Dataset

This repository provides a dataset that identifies and classifies AI patent filings from among the universe of USPTO utility patent applications, May 2007-March 2026.

The dataset is used in the following paper:

Chen, M.A. and Wang, J. (2026). Displacement Versus Augmentation: The Effects of AI on Employment Dynamics and Firm Value. Working paper. Available at SSRN: https://ssrn.com/abstract=5246388

Dataset Description

This is a patent-level dataset of AI-related U.S. patent applications with pre-grant publication dates from May 2007 to March 2026. We apply large language models (LLMs) to patent abstracts from the USPTO Bulk Data Storage System (BDSS) to identify AI-related patents and classify them into seven functional categories: Perception AI, Motor Control AI, Language AI, Decision-Making AI, Learning AI, Engagement AI, and Creativity AI, based on the type of cognitive task they perform.

Each observation is a unique AI patent application, identified by its USPTO application number, filing date, pre-grant publication date, grant number and grant date (if granted as of March 31, 2026), and AI functional type classification(s).

In Chen and Wang (2026), this dataset is used to construct firm-level measures of AI patenting by linking patent applications to their assignee firms via USPTO assignee records or PatentsView.

Conceptual Background: AI and Human Intelligence

A commonly-held view is that AI consists of computing systems that can perform tasks traditionally requiring human intelligence (Simon, 1977; Korteling et al., 2021). The notion that artificial intelligence can or should do what human intelligence is capable of has featured prominently in the work of computer scientists since at least the mid-20th century. For example, in 1950, Alan Turing proposed a test based on an "imitation game" whereby a computer can be said to possess artificial intelligence if it adequately mimics human behavior under certain conditions (Turing, 1950). In recent years, major technical advances related to artificial neural networks, deep learning, and generative AI have enabled computers to equal or exceed human performance in narrow groups of cognitive tasks (LeCun, Bengio, and Hinton, 2015; Griffiths, 2020; Sternberg, 2024). Many computer scientists believe that progress may eventually lead to Artificial General Intelligence (AGI), a form of AI that, even without extensive pre-training, would be able to generalize across domains and think, reason, and create as humans do (Goertzel and Pennachin, 2006; Goertzel, 2014).

Motivated by the oft-discussed parallels between AI and human intelligence, we propose to classify AI according to key cognitive functions. In particular, we develop a functional typology that associates each instance of AI with one or more kinds of human-like cognition.

We draw upon the unified theory of cognition proposed by Allen Newell (1990), a founder of modern AI. According to Newell's theory, all cognitive behavior arises from mechanisms in six major areas: (1) problem-solving, decision-making, routine action; (2) memory, learning, skill; (3) perception, motor behavior; (4) language; (5) motivation, emotion; and (6) imagining, dreaming, daydreaming.

Extending Newell's theory into a parsimonious framework for classifying AI, we synthesize descriptions from the psychology, cognitive science, and computer science literatures to propose seven functional AI capabilities. Three of Newell's cognitive areas map directly to single AI capabilities: Language, Decision-Making, and Learning. In the case of Newell's "perception, motor behavior," it is useful to unpack this group and map it to two distinct AI capabilities: Perception and Motor Control. Engagement (with human users or the environment) represents Newell's motivation/emotion mechanisms, and Creativity (generation of novel content) embodies his imagining/daydreaming processes.

AI Patent Typology

Table 1. A Functional Typology of Artificial Intelligence

This table shows a typology of AI based on the cognitive capabilities described above. The first column lists the seven capabilities in our typology. The second column shows a sampling of relevant definitions from the psychology, cognitive science, and computer science literatures that, together with Newell's unified theory of cognition, characterize the seven AI types. The third column shows examples of real-world applications related to each AI type.

Functional AI Capability	Definitions from Psychology, Cognitive Science, and Computer Science	Examples of AI Applications
Perception	Reception of stimuli, extraction and storage of perceptual symbols, and updating of perceptual systems (Barsalou, 1999; Wang, Hahm, and Hammer, 2022)	Computer vision systems; Smart sensors; Autonomous vehicle systems; Diagnostic medical imaging; Augmented reality (AR); Virtual reality (VR)
Motor Control	Control, execution, and adjustment of goal-directed actions based on visual, proprioceptive, and kinesthetic feedback (Grush, 2004; Rosenbaum, 2009)	Physical AI; Autonomous ground vehicles (AGVs) and drones; AI-powered humanoid robots; Mobile vehicle swarms; Servo motor control and monitoring; Smart home sensors
Language	Phonology, phonotactics, reading, spelling, lexis, morphosyntax, formulaic language, language comprehension, grammaticality, sentence production, and syntax (Carruthers, 2002; Kravchenko, 2007)	Large language models (LLMs); Voice cloning and AI dubbing for video localization; Voiceprint analysis; Real-time voice translation; Live call-center interpretation
Decision-Making	Setting plans through a reasoning function and selecting subjectively rational choices based on information about uncertainties and preferences (Pomerol, 1997; Fox, Cooper, and Glasspool, 2013)	Automated credit decisioning; AI resume screening; AI-driven troubleshooting; Predictive maintenance; Autonomous vehicle navigation
Learning	Acquiring skills or experience in one situation and utilizing them in another situation (Greeno, Collins, and Resnick, 1996; Silver et al., 2021)	AlphaFold protein structure prediction; Personalized recommendation engines; Machine learning systems for healthcare diagnostics; Fraud and anomaly detection systems
Engagement	Selective engagement with the environment based on cognitive motivation and emotion (Simon, 1967; Deci and Ryan, 2000)	Siri; Cortana; Alexa; ChatGPT; Customer service AI; Videogame NPCs; Smart home devices; Self-driving delivery robots
Creativity	Actively and consciously pursuing an innovative solution or a new idea with the novelty derived from imagination (Pelaprat and Cole, 2011; Stokes, 2014)	ChatGPT; Adobe Firefly art generator; Google Veo video generator; OpenAI Sora video generator; Robotic painters and sculptors

Classification Methodology

LLM-Based Classification

We use the open-source generative AI model Qwen2.5-14b-Instruct, developed by Alibaba Cloud and released in September 2024 under an open-source license for broad commercial and research use. The model features 14.8 billion parameters and was trained on approximately 18 trillion tokens (https://huggingface.co/Qwen/Qwen2.5-14B-Instruct). As of October 2024, it was the leading LLM in its size class, outperforming larger proprietary models such as Gemini-1.5-Pro, Claude 3-Opus, and GPT-4 Turbo on engineering benchmarks.

Classification proceeds in two steps:

AI identification: For each of the 6,265,985 patent applications screened (2007–2026), the LLM is prompted with a yes/no question combined with the patent abstract — e.g.:

"Is the following invention directly related to artificial intelligence? Respond with just YES or NO. Abstract: [patent abstract text]"
AI type classification: For each of the 541,216 AI-identified patents, seven separate yes/no queries determine whether it belongs to each functional type, e.g., for Perception:

"Is the following invention directly related to artificial intelligence that has perceptual ability? Respond with just YES or NO. [patent abstract text]"

A patent may be classified into multiple types or none of the seven types.

Validation

We validate the LLM classifications using an LLM-as-a-judge approach (Zheng, 2023; Gu, 2025; Li, 2025). A stratified random sample of Qwen2.5-14b-Instruct inferences (500 patent applications per year, 2007-2026) are evaluated for correctness by four well-known commercial models. For AI type classifications, the sample is drawn from among patents previously identified as AI-related, yielding 70,000 type-level inferences (10,000 patents x 7 types). Each inference is re-prompted with an explanation request, and each judge model is asked to determine whether the response is CORRECT or INCORRECT.

Judge models:

Judge Model	Developer
Gemini 2.5 Pro	Google DeepMind
Claude Opus 4.1	Anthropic
GPT 4.1	OpenAI
DeepSeek R1 0528	DeepSeek

In addition to quantitative validation, we manually review numerous patent classifications and explanations from Qwen2.5-14b-Instruct. These reviews confirm that the model can differentiate fine shades of meaning in patent abstracts. For example, the Qwen2.5-14b-Instruct model can correctly identify perception-based AI in a patent that does not explicitly use the term "perception" or any of its synonyms. Also, the model can correctly recognize that a patent involving automated alerts is not an instance of engagement AI because it does not involve AI-driven conversational interaction with users. Illustrative examples for each AI type and more detailed validation results are provided in the Internet Appendix of Chen and Wang (2026).

Key Variables

Identifiers and Patent Status

Variable	Description
`app_number`	USPTO patent application number, 12-digit numeric (YYYY + 8-digit serial, e.g., `200711618987`)
`filing_date`	Application filing date (YYYY-MM-DD)
`pub_number`	Pre-grant publication number, 11-digit numeric (YYYY + 7-digit sequence, e.g., `20070104618`)
`pub_date`	Pre-grant publication date (YYYY-MM-DD)
`patent_id`	USPTO patent grant number, 8-digit zero-padded numeric (e.g., `08877146`); empty if not granted as of March 31, 2026
`grant_date`	Patent grant date (YYYY-MM-DD); empty if not granted as of March 31, 2026

AI Classification

Variable	Description
`is_ai`	1 for all observations (dataset contains only AI patents)
`perception`	1 if Perception AI, 0 otherwise
`motor_control`	1 if Motor Control AI, 0 otherwise
`language`	1 if Language AI, 0 otherwise
`decision_making`	1 if Decision-Making AI, 0 otherwise
`learning`	1 if Learning AI, 0 otherwise
`engagement`	1 if Engagement AI, 0 otherwise
`creativity`	1 if Creativity AI, 0 otherwise

The seven functional type variables correspond to the seven AI capabilities in Table 1. A single patent may be assigned to more than one functional type.

Suggested Uses

Firm AI adoption: Link to USPTO assignee data or PatentsView to construct firm-year AI patenting measures as proxies for firm-level AI investment and adoption.
Technology trends: Track the rise of AI patenting by functional type across industries and over time.

Data Files

Two formats are provided. Each is available as a separate zip file in the data/ folder of this repository.

File	Format	Notes
`ai_patents_2007_2026_xlsx.zip`	Excel	Identifier columns pre-formatted as text; opens correctly in Excel without any special import steps.
`ai_patents_2007_2026_dta.zip`	Stata	Stata 14+ format.

Data Notes

The dataset contains only AI patents (541,216 observations). Non-AI patent applications are excluded.
filing_date is used as the primary event date; grant_date and patent_id are non-empty only for applications granted as of March 31, 2026.
AI classification is performed at the application level using LLMs applied to patent abstracts.
The dataset covers utility patents from the USPTO Bulk Data Storage System (BDSS).
A single patent may be assigned to more than one functional type.
The is_ai indicator is 1 for all observations; the seven functional type variables (perception, motor_control, language, decision_making, learning, engagement, creativity) are assigned in Step 2 of the classification pipeline.
app_number, pub_number, and patent_id are plain numeric identifiers with no letter prefixes or suffixes. All years use consistent formats: app_number is 12 digits (YYYY + 8-digit serial), pub_number is 11 digits (YYYY + 7-digit sequence), and patent_id is 8 digits (zero-padded).

License

This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the data for any purpose, provided appropriate credit is given.

How to Cite

If you use this dataset, please cite:

@unpublished{ChenWang2026,
  author  = {Chen, Mark A. and Wang, Joanna},
  title   = {Displacement Versus Augmentation: The Effects of AI on Employment Dynamics and Firm Value},
  note    = {Working paper. Available at SSRN: https://ssrn.com/abstract=5246388},
  year    = {2026},
}

References

Barsalou, L.W. (1999). Perceptual symbol systems. Behavioral and Brain Sciences, 22(4), 577-609.

Carruthers, P. (2002). The cognitive functions of language. Behavioral and Brain Sciences, 25(6), 657-674.

Deci, E.L., and Ryan, R.M. (2000). The "what" and "why" of goal pursuits: Human needs and the self-determination of behavior. Psychological Inquiry, 11(4), 227-268.

Fox, J., Cooper, R., and Glasspool, D. (2013). A canonical theory of dynamic decision-making. Frontiers in Psychology, 4, 150.

Goertzel, B. (2014). Artificial general intelligence: Concept, state of the art, and future prospects. Journal of Artificial General Intelligence, 5(1), 1-48.

Goertzel, B., and Pennachin, C. (2006). Artificial General Intelligence. Springer.

Greeno, J.G., Collins, A.M., and Resnick, L.B. (1996). Cognition and learning. In D.C. Berliner and R.C. Calfee (Eds.), Handbook of Educational Psychology (pp. 15-46). Macmillan.

Griffiths, T.L. (2020). Understanding human intelligence through human limitations. Trends in Cognitive Sciences, 24(11), 873-883.

Grush, R. (2004). The emulation theory of representation: Motor control, imagery, and perception. Behavioral and Brain Sciences, 27(3), 377-396.

Gu, J. et al. (2025). A survey on LLM-as-a-judge. arXiv:2411.15594v5. https://doi.org/10.48550/arXiv.2411.15594

Korteling, J.E., van de Boer-Visschedijk, G.C., Smets, N.J., and Neerincx, M.A. (2021). Human- versus artificial intelligence. Frontiers in Artificial Intelligence, 4, 622364.

Kravchenko, A.V. (2007). Essential properties of language, or, why language is not a code. Language and Communication, 27(2), 140-155.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

Li, D. et al. (2025). From generation to judgment: Opportunities and challenges of LLM-as-a-judge. arXiv:2411.16594v6. https://doi.org/10.48550/arXiv.2411.16594

Newell, A. (1990). Unified Theories of Cognition. Harvard University Press.

Pelaprat, E., and Cole, M. (2011). "Minding the gap": Imagination, creativity and human cognition. Integrative Psychological and Behavioral Science, 45(4), 397-418.

Pomerol, J.C. (1997). Artificial intelligence and human decision making. European Journal of Operational Research, 99(1), 3-25.

Rosenbaum, D.A. (2009). Human Motor Control (2nd ed.). Academic Press.

Silver, D., Singh, S., Precup, D., and Sutton, R.S. (2021). Reward is enough. Artificial Intelligence, 299, 103535.

Simon, H.A. (1967). Motivational and emotional controls of cognition. Psychological Review, 74(1), 29-39.

Simon, H.A. (1977). The New Science of Management Decision (revised ed.). Prentice-Hall.

Sternberg, R.J. (2024). How can we now best understand artificial intelligence? Intelligence, 102, 101815.

Stokes, P.D. (2014). Thinking inside the box: Creativity, constraints, and the coevolution of language and art. Perspectives on Psychological Science, 9(4), 438-451.

Turing, A.M. (1950). Computing machinery and intelligence. Mind, 59(236), 433-460.

Wang, Y., Hahm, S., and Hammer, J. (2022). What is AI literacy? Competencies and design considerations. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22). ACM.

Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36 (NeurIPS 2023).

Contact

For questions about the data, please contact:

Mark A. Chen, Georgia State University (machen@gsu.edu)
Joanna Wang, Peking University HSBC Business School (joannawang@phbs.pku.edu.cn)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Patent Classification Dataset

Dataset Description

Conceptual Background: AI and Human Intelligence

AI Patent Typology

Classification Methodology

LLM-Based Classification

Validation

Key Variables

Identifiers and Patent Status

AI Classification

Suggested Uses

Data Files

Data Notes

License

How to Cite

References

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Patent Classification Dataset

Dataset Description

Conceptual Background: AI and Human Intelligence

AI Patent Typology

Classification Methodology

LLM-Based Classification

Validation

Key Variables

Identifiers and Patent Status

AI Classification

Suggested Uses

Data Files

Data Notes

License

How to Cite

References

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages