Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
Patents have traditionally been used in the history of technology as an indication of the thinking process of the inventors, of the challenges or “reverse salients” they faced, or of the social groups influencing the construction of technology. More recently, historians of science and technology also read them to interpret the way people described technology and how the specific inscriptions of inventions mattered for the justification and operation of the patent system. The digitization of historical patents opens up unique opportunities to assess the feasibility of unsupervised machine learning and natural language methods for such explorations. In this project, we analyze over a million US historical patents from 1830-1930 using a variety of text-based methods, with two major aims: 1) categorizing patents into coherent technical categories, 2) identifying discourses of safety, reflexivity, and environmental concern in technological innovation. We use both frequency-based and context-based methods, and find that a bag of words-based methods such as TDF-IDF and topic modeling do not perform well on semantic categorization due to the linguistic peculiarities of patent specifications. This suggests that a successful approach to categorizing patents would require contextual semantic representations such as Transformers-based methods (e.g. BERT), or static embedding based methods (e.g Word2Vec, Doc2Vec) which have relatively low computational costs but less expressive in some scenarios. We run early experiments using these methods and find that word embedding models are effective in learning semantics from the descriptions of the patents. In this poster, we will describe our early results, as well as exploratory data analysis on this massive historical patents dataset.
Jan Frederik Jonas Florian Mai