In this semilab, we will see how far we can get in solving a single problem: If we want to determine the age distribution of a city (i.e., the proportion of individuals aged 1, 2, 3, etc.) but can only know the answers to three questions—such as “What is the average age?” or “How many individuals are aged 5?”—which three questions (each resulting in a single value) would allow us to construct the most accurate approximation of the true distribution?
This problem is far from trivial. As we make progress, we will explore concepts central to information theory, including how to compute 'distance' between distributions and how to update approximations once we have an answer to a question. We will encounter functions of many variables (hundreds!), and implement code that, in effect, identifies optimal questions from the space of possible questions.
In practice, data from 200 cities will be utilized. 100 of these cities will be our training set, allowing us to “learn” from the data which questions are informative. If we succeed in resolving all of the necessary theoretical and implementation challenges, we will then be able to evaluate our optimized questions on the remaining 100 cities.
[Bonus] A prize valued at $100 is offered to the first student who makes an actionable improvement to our collective approach—which we will implement together and verify on the testing set cities. This challenge exists because the proposed solution (potentially) has room for improvement, and identifying a better set of questions could have substantial research implications. Given these real-world applications, the challenge is especially formidable—any genuine improvement is more than worthy of the reward.
Prerequisites: being comfortable with exponents, logarithms, multivariable functions, and basic probability theory (discrete distributions, computing the mean of a distribution, etc.). Some knowledge of calculus is recommended, and knowledge of Python (or a similar mathematical programming language) is highly recommended.