In this semilab, we will see how far we can get in solving a single problem: If we want to determine the age distribution of a city (i.e., the proportion of individuals aged 1, 2, 3, etc.) but can only know the answers to three questions—such as "What is the average age?" or "How many individuals are aged 5?"—which three questions (each resulting in a single value) would allow us to construct the most accurate approximation of the true distribution?
This problem is far from trivial. As we make progress, we will explore concepts central to information theory, including how to compute "distance" between distributions and how to update approximations once we have an answer to a question. We will encounter functions of many variables (hundreds!) and implement code that, in effect, identifies optimal questions from the space of possible questions. These methods are foundational to statistical physics and advanced statistical inference, and are actively used in professional migration estimation.
In practice, data from 200 cities will be utilized—100 for training and 100 for evaluation. A prize valued at $100 is offered to the first student who makes an actionable improvement to our collective approach, which we will implement together and verify on the testing set cities. This challenge exists because meaningful improvement could have real research implications, making it more than worthy of the award.
Prerequisites: being comfortable with exponents, logarithms, multivariable functions, and basic probability theory (discrete distributions, computing the mean of a distribution, etc.). Some knowledge of calculus is recommended, and knowledge of Python (or a similar mathematical programming language) is highly recommended.