Final Project proposal details#
Process#
Find a dataset that seems interesting.
Use at least one dataset that you aren’t familiar with.
Using data from a primary source is preferred.
Finding a dataset available in CSV or JSON is recommended, though pandas can read other formats.
Note the JupyterHub limits.
Inspect the data a bit.
Come up with a question that the data is capable of answering and isn’t trivial to answer.
If you aren’t sure, ask.
Come up with a hypothesis (a.k.a. a guess of the answer to the question).
Submit the proposal to a new Conversation under the
Final Project proposals
Discussion, using the format below.
If the proposal shows effort and follows the format below, full credit will be given.
Format#
What dataset(s) are you going to use?
Please include link(s).
What’s the question you are trying to answer?
It should be specific, and objectively answerable through the data available.
What columns of the dataset(s) are you going to use to answer that?
If you’re using multiple datasets: What columns are you going to merge/join them on?
What’s your hypothesis?
Tips#
Your question/hypothesis doesn’t need to be something novel; confirming something you read in the news is fine.
You won’t be graded on the scientific soundness of your work.
That said, please think through and note assumptions/caveats/unknowns of your approach.
The sooner you post your proposal, the sooner you’ll get feedback.
Simplified example#
Question: From 2016 to 2019, what community district increased their diversion (recycling) rate the most?
Columns:
District
,Fiscal Year
,Diversion Rate-Total (Total Recycling / Total Waste)
Hypothesis: Bushwick, because it’s gentrified over that time, and hipsters love to recycle.
Another example#
Dataset: data about people’s trash
Question: Is recycling better now than before?
Hypothesis: probably
What’s wrong with this proposal?
Even the question can bake in assumptions. For example:
What ZIP code has the highest number of food poisoning cases?
assumes a relationship between food-borne illness and geography. What assumptions does your question make?