A Bibliography Database for Machine Learning

2 minute read

Published: December 16, 2024

Getting the correct bibtex entry for a conference paper (e.g. published at NeurIPS, ICML, ICLR) is annoyingly hard: if you search for the title, you will often find a link to arxiv or to the pdf file, but not to the conference website that contains the bibtex.

To simplify this, I created a single bib file with all published papers for NeurIPS, ICML and ICLR. The files are available on Github.

Some remarks and caveats:

The bibtex entries are taken from the official proceedings for NeurIPS and ICML, and from DBLP for ICLR.
The ICML proceedings on https://proceedings.mlr.press/ start in 2013, even though there have been editions of ICML since 1980. Papers from the editions before 2013 are not included in the database.
Scraping the bibtex entries and merging the individual years was done in a semi-automated way, hence there may be bugs/errors/missing entries. Please let me know if you encounter one of these.
I will update the database over time (and maybe add other conference proceedings like COLT and AISTATS).

Some data insights

Having a database of papers published at the major ML conferences, we can also do some simple data analysis. For this, I used the library bibtexparser to create a csv file with title, authors and year of each paper. (Note that the ICML bib even contains the abstracts, which would allow for a more detailled analysis.)

First, let’s plot the number of papers per venue each year. Unsurprisingly, the ML paper factory is growing exponentially fast.

Fig. 1: Number of accepted papers per year and conference.

Not only do we have more papers, but also the average number of authors per paper increased from (approximately) three to five within 2010-2024.

Fig. 2: Number of authors per paper (computed across all conferences).

To wrap up, a lazy approach to finding historical trends in ML topics is to count papers that have specific keywords in their title.

Maybe because I started to do research in the machine learning field rather late, I sometimes find it quite hard to understand the historical context of certain topics; below is a selection of well-known keywords over time, that might serve as a proxy.

Fig. 3: Historical timelines: percentage of papers (per conference and year) with the paper title containing certain keywords.

Share on

Twitter Facebook LinkedIn

Fabian Schaipp

A Bibliography Database for Machine Learning

Some data insights

Share on

You May Also Enjoy

How to jointly tune learning rate and weight decay for AdamW

Optimization Nuggets: Stochastic Polyak Step-size, Part 2

Decay No More

Solve it all and solve it fast: using numba for optimization in Python