Final Project proposal details#


  1. Find a dataset that seems interesting.

    • Use at least one dataset that you aren’t familiar with.

      • Using data from a primary source is preferred.

    • Finding a dataset available in CSV or JSON is recommended, though pandas can read other formats.

    • Note the JupyterHub limits.

  2. Inspect the data a bit.

  3. Come up with a question that the data is capable of answering and isn’t trivial to answer.

    • If you aren’t sure, ask.

  4. Come up with a hypothesis (a.k.a. a guess of the answer to the question).

  5. Submit the proposal to a new Conversation under the Final Project proposals Discussion, using the format below.

If the proposal shows effort and follows the format below, full credit will be given.


  • What dataset(s) are you going to use?

    • Please include link(s).

  • What’s the question you are trying to answer?

    • It should be specific, and objectively answerable through the data available.

  • What columns of the dataset(s) are you going to use to answer that?

  • If you’re using multiple datasets: What columns are you going to merge/join them on?

  • What’s your hypothesis?


  • Your question/hypothesis doesn’t need to be something novel; confirming something you read in the news is fine.

  • You won’t be graded on the scientific soundness of your work.

    • That said, please think through and note assumptions/caveats/unknowns of your approach.

  • The sooner you post your proposal, the sooner you’ll get feedback.

Simplified example#

  • Dataset: Recycling Diversion and Capture Rates

  • Question: From 2016 to 2019, what community district increased their diversion (recycling) rate the most?

  • Columns: District, Fiscal Year, Diversion Rate-Total (Total Recycling / Total Waste)

  • Hypothesis: Bushwick, because it’s gentrified over that time, and hipsters love to recycle.

Another example#

  • Dataset: data about people’s trash

  • Question: Is recycling better now than before?

  • Hypothesis: probably

What’s wrong with this proposal?

Even the question can bake in assumptions. For example:

What ZIP code has the highest number of food poisoning cases?

assumes a relationship between food-borne illness and geography. What assumptions does your question make?