Digital infrastructures: the Alfred P. Sloan Foundation is funding a project by the Sant'Anna School of Advanced Studies in Pisa to improve the compression and search capabilities of the Software Heritage Archive

Design new algorithmic solutions for data compression and develop a search engine with unique features to improve accessibility to the Software Heritage Archive, the world's largest archive that allows the collection, preservation, and free and perpetual access to the source code of millions of software libraries publicly available on the Web. These are the two main objectives of the two-year project coordinated by Paolo Ferragina, full professor of Computer Science at the Sant'Anna School of Advanced Studies and the University of Pisa, and funded by the Alfred P. Sloan Foundation, a US philanthropic organization and one of the most important supporters of research in science, technology, engineering, mathematics and economics.
“It is a great satisfaction to see such a challenging and ambitious project funded, and to be able to contribute with two research and software development efforts that highlight the expertise of the algorithmic school in Pisa: data compression, aimed at saving the enormous storage space of the Software Heritage Archive and thus improving the sustainability of its digital infrastructure; and the development of a search engine capable of efficiently and effectively identifying, within the vast amount of software in the archive, those code fragments that are possibly syntactically different from a searched snippet but as computationally equivalent as possible to it. The incredible size and uniqueness of the archive will give our algorithmic solutions a significant impact, given the crucial role software plays today in scientific research and in industrial processes and products.” says Paolo Ferragina.
The Software Heritage Archive, the most important database of source codes
The Software Heritage Archive (SWH) was founded in 2016 as a non-profit initiative promoted by INRIA (the French National Institute for Research in Computer Science and Automation), in collaboration with UNESCO, and coordinated by Prof. Roberto Di Cosmo. Currently, the archive holds over 23 billion files, from more than 350 million software projects publicly available on the Web and created by more than 85 million programmers. It is a true intangible heritage of Computer Science and a unique access point to a vast technological knowledge base, necessary to support digital transformation and innovation. In fact, software today plays a crucial role in various scientific and industrial activities.
Why it is crucial to improve compression and search capabilities of SWH
What does it mean to preserve the source code of publicly available software? It means not only knowing the history of Computer Science and its software products, but also coming into contact with an enormous amount of information that goes far beyond the “source code” and can therefore give a great boost to innovation and new technological frontiers.
Interest in source code is also growing in the field of Artificial Intelligence. Researchers and engineers are building pre-trained models for code generation and for improving the performance of the best Large Language Models (LLM), such as OpenAI's GPT-4 or Google's Gemini, training them on both natural language and the source code of publicly available software.
The challenge of the project coordinated by the Sant'Anna School of Advanced Studies in Pisa is to find a way to navigate this gigantic virtual library with more efficient and effective tools: In this scenario, the infrastructure represented by SWH is a great opportunity to address a wide range of needs and applications, ranging, for example, from the development of AI-supported coding and its “explainability”, to the detection and tracking of plagiarized (for the protection of intellectual property) or potentially “harmful” (for cybersecurity) code snippets.
The statements
“Professor Ferragina and his team at the Sant'Anna School of Advanced Studies in Pisa are bringing cutting edge techniques in AI, data compression, and digital infrastructure design to upgrade and augment one of the world's most important collections of software source code. Their efforts will help ensure that the more than 23 billion files being safeguarded by the Software Heritage Archive will remain open and accessible to all” says Dr. Joshua M. Greenberg, Program Director, Alfred P. Sloan Foundation.
“The project will involve young researchers who will thus be able to deepen their knowledge and skills in the field of data compression and the development of new generation search engines. The project will give them the opportunity to collaborate with international research groups, allowing them to engage with globally renowned research centers and companies. I believe that this is a great opportunity for the personal, scientific and professional growth of all our young talents,” Ferragina concludes.
“The archive built by Software Heritage aims to preserve and make easily accessible the technical, scientific and collaborative knowledge that is increasingly found in the source code of software libraries. Over the last twenty years, the amount of original code has doubled every two years on average, creating a significant challenge for the sustainability of the archive. We are therefore extremely pleased that Prof. Ferragina is making his experience and expertise available for this mission,” says Roberto Di Cosmo, scientific director of the Software Heritage Archive.