This is an edited version of the project guidelines used for the course.
If you wish to pursue an independent data science project, this outline may be a useful guide.
The Final Project will give you the chance to explore a topic of your choice and to expand your analytical skills. By working with real data of your choosing you can examine questions of particular interest to you.
The broad objectives for the project are to:
The basic project steps (broken down in more detail below):
The project proposal includes the following sections:
RESEARCH QUESTION: What is your research question? Include the specific question you’re setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)
BACKGROUND & PRIOR WORK: This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.
Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.
References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)
HYPOTHESIS: What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)
DATA: Here, you are to think about and describe the ideal dataset (or datasets) you you would need to answer this question:
Note: For the project proposal, you do NOT have to find the actual dataset(s) needed for your project. For the first checkpoint and onward, you will.
ETHICS & PRIVACY: Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out Deon’s Ethics Checklist. In particular:
The proposal should be written clearly and at a level understandable by a typical undergraduate student.
This is a short but detailed proposal meant to give us time to assess and critique your Final Project idea (further described below), in order to give you time to improve upon it throughout the quarter.
Remember to proofread your Project Proposal. Do not use overly flowery and/or vague language.
Time to put it all together! The main products of the final project are 1) a report submitted as single Jupyter Notebook on GitHub and 2) a 3-5 minute video communicating your group project.
This single notebook should include all the code you used for all components of the project (cleaning, visualization, analysis). Because we won’t be running the code in your notebook, it is important to make sure your notebook as submitted to GitHub has the code evaluated and outputs present (e.g., plots) so that we can read the project as is.
Each of the following sections corresponds to a section in the file FinalProject_groupXXX.ipynb (template is in your group’s GitHub repo).
For sections included in your proposal and previous checkpoints, you can copy and paste into your final project, but be sure to edit these sections with feedback you received on your proposal or additional information you learned throughout the project. This report should read clearly from start to finish, explaining what you did, why you did it, and what you learned. This should be a concise and well-written report.
PERMISSIONS: Specify whether you want your group project to be made publicly available. Place an X in the square brackets where appropriate.
OVERVIEW: Include 3-4 sentences summarizing your group’s project and results.
NAMES: See proposal specifications.
RESEARCH QUESTION: See proposal specifications.
BACKGROUND & PRIOR WORK: See proposal specifications.
HYPOTHESIS: See proposal specifications.
DATASET(S): Same as Checkpoint #1.
SETUP: See Checkpoint #1.
DATA CLEANING: See Checkpoint #1.
DATA ANALYSIS & RESULTS: This section should include markdown text and code walking us through the following:
ETHICS & PRIVACY: See proposal specifications. (be sure to update with what you actually did to take the ethical considerations into account for the analysis you did!)
CONCLUSION & DISCUSSION: Discuss your project. Summarize your data and question. Briefly describe your analysis. Summarize your results and conclusions. Be sure to mention any limitations of your project. Discuss the impact of this work on society. (2-3 paragraphs)
See Prof. Voytek’s write-up of excellent class projects from the Spring 2017 instance of COGS 108 here, all of which received perfect scores.
Additionally, previous projects can be viewed from when this course ran in Spring 2017, Winter 2018, Spring 2019, Fall 2019, Winter 2020, Spring 2020, Fall 2020, or Winter 2021. Note first, that these projects are of variable quality and second, that if you get inspiration or code from previous projects, this must be noted in your project, giving attribution to the former groups’ work.
The purpose of this project is to find a real-world problem and dataset (or likely, datasets!) that can be analyzed with the techniques learned in class and those you learn on your own. It is imperative that by doing so you believe extra information will be gained — that you believe you can discover something new!
You must use at least one dataset containing at least approximately 1000 observations (if your data are smaller but you feel they are sufficient. You are welcome (and in fact recommended) to find multiple datasets!
The best datasets are the ones that can help you answer your question of interest.
Your question could be just for fun: Using text mining of song lyric websites to identify the most commonly used phrases and sentiments by decade.
Your question could be scientific: Scrape data from animal taxonomies and Wikipedia to figure out if larger animals are more likely to be carnivores?.
Or, ideally, your question can be aimed at civic or social good, for example, use mapping, transit, and car accident data to identify which parts of San Diego are most in need of dedicated bike lanes.
To help you find datasets, we have collected a list of websites that have a considerable number of open source data sets and included them at the end of this document.
Here, is a list of potential locations to find datasets and problems to investigate. If you have another dataset or search location, that is great!
Natural Language Processing