Avatar

Introduction to Data Science

Fall 2020

University of Edinburgh

Introduction to Data Science

Learn to explore, visualize, and analyze data to understand natural phenomena, investigate patterns, model outcomes, and make predictions, and do so in a reproducible and shareable manner. Gain experience in data collection, wrangling, and visualization, exploratory data analysis, predictive modeling, and effective communication of results while working on problems and case studies inspired by and based on real-world questions. The course will focus on the R statistical computing language. No statistical or computing background is necessary.

This is the website for the Introduction to Data Science course offered in 2020. Click here if you’re looking for the 2019 version of the course.

Timetable

Videos released on Mondays, code alongs on Thursdays, workshops on Fridays. Access official course information here.

Monday

Videos released

Tuesday

Student hours - TBA

Wednesday

Student hours - TBA

Thursday

Code along sessions - 11:10-12:00

Friday

Workshops - 10:00-10:50, 11:10-12:00, or 12:10-13:00

Things are going to look a little different than expected this year, but we got you!

Course Schedule

Overview

This is a tentative course schedule. The flow of topics might change slightly depending on how quickly / slowly it feels right to …

Week 1 - Welcome to IDS

Get acquainted with the course, the technology, the workflow, and the skills you will acquire throughout the semester.

Week 2 - Visualizing data

Data visualization and interpretation of graphical information.

Week 3 - Wrangling and tidying data

Data wrangling, joining, and tidying.

Week 4 - Importing and recoding data

Importing data, data types and classes, recoding.

Week 5 - Communicating data science results effectively

Tips for effective data visualization, communication of results, and collaboration.

Week 6 - Web scraping and programming

Harvesting data from the web, writing functions, and iteration.

Week 7 - Data science ethics

Misrepresentation of findings, data privacy, and algorithmic bias

Week 8 - Modelling data

Linear models for predicting numerical data from single and multiple variables.

Week 9 - Classification and model building

Logistic regression for predicting categorical data and model building.

Week 10 - Model validation and uncertainty quantification

Evaluating models with cross validation and uncertainty quantification with bootstrap confidence intervals.

Week 11 - Looking beyond IDS

Topics requested by you!

Syllabus

Course components

Weekly structure

  • Monday: Lecture videos for the week released
  • Tuesday and Wednesday: Student hours with course organisers on Zoom
  • Thursday: Code along sessions on Zoom
  • Friday: Workshops on Zoom
  • Sunday: Weekly quiz due

Pre-recorded videos

These are released on Mondays and are comprised of course content and weekly “State of the IDS” videos (recap of previous week, what’s coming in the new week, frequently asked questions). You’re expected to watch (and learn from) them before Friday’s workshops. To keep your learning active, you’ll have the chance to work on application exercises that you can complete on RStudio Cloud.

Student hours

Course organisers will each hold student hours on Tuesdays (with David, 10-11am) and Wednesdays (with Mine, 12-1pm) starting in Week 2. These will be on Zoom and will not be recorded. It’s a great time to come and get real time answers to your questions or just say hi!

Code along sessions

These will be held on Thursdays on Zoom, and they will be recorded and posted. These live coding sessions are yet another opportunity to ask questions. They’re also a good time to learn about workflow and debugging techniques since there will inevitably be lots wrinkles to iron out during unstructured live coding.

Workshops / labs

These will be held on Fridays on Zoom, and they will not be recorded. We expect that you show up to the workshop session you’ve been assigned to weekly. During these sessions you will work in teams on computing lab exercises and you will finish the exercises after the workshop and turn in your lab reports by Tuesday 4pm UK time. Labs will be submitted as GitHub repositories and teams with the lowest score for each student will be dropped. Tip: Set up weekly team meetings between Friday and Tuesday.

A frequently asked questions “What happens if I can’t make it to a workshop one week because I’m sick or have another obligation at that time?” Answers below:

  • First, if you have another obligation every week at the time of your workshop, you should change into another group (workshops are held at three separate hours on Friday, pick a time slot that works for your schedule.) If you can’t make any of the workshops times, you should drop this class.
  • Chances are you asked this question because you’re only missing one or two workshops throughout the semester:
    • If you’re missing a workshop day due to short-term illness or some other reason, you should communicate this with your team and attend a team meeting before the deadline for the assignment to contribute to the teamwork. If you have made 0 commits towards a lab assignment, you will receive a 0 for that assignment, so you need to participate both for being a team player and also for your own individual score.
    • If you’re unable to contribute to a lab assignment because of an illness taking you away from school work for an extended period of time, you should let your team know that you won’t be able to contribute to that lab and either make this your dropped lab score or apply for special circumstances.

Overall these policies are put in place to ensure communication between team members, respect for each others’ time, and also to give you a safety net in the case of illness or other reasons that keep you away from attending class once or twice.

Homework assignments

Beyond the in class activities, you will be assigned fortnightly larger programming tasks throughout the semester. These assignments will be completed individually, and submitted as GitHub repositories. Homework with the lowest score for each student will be dropped. Tip: Do the (optional) R tutorials which will introduce you to the datasets and topics covered in the homework assignments

Quizzes

These weekly multiple choice quizzes will help you evaluate your learning continuously. Online quiz with the lowest score for each student will be dropped. Tip: Don’t leave it till the last minute!

Final project

You will be responsible for the completion of an open ended final project for this course, the goal of which is to tackle an “interesting” problem using the tools and techniques covered in this class. Additional details on the project will be provided as the course progresses. Each team’s work will also be shared with and evaluated by at least one other team at an earlier stage in order to provide feedback in the form of code review. You must complete the final project and be in class to present it in order to pass this course. Tip: Stick to optional interim deadlines.

Teams

For all of the team based assignments in this class you will be randomly assigned to teams of 3 or 4 students - these teams will change after each assignment. You will work in these teams during class and on the homework assignment. For team based assignments, all team members are expected to contribute equally to the completion of each assignment and you will be asked to evaluate your team members after each assignment is due. Failure to adequately contribute to an assignment will result in a penalty to your mark relative to the team’s overall mark.

Students are expected to make use of the provided GitHub repository as their central collaborative platform. Commits to this repository will be used as a metric (one of several) of each team member’s relative contribution for each homework.

Grading

Scheme

Your overall course grade will be comprised of the following components, and their weights:

  • Homework: 40%
  • Lab: 20%
  • Project: 30%
  • Quiz: 10%

Moderation and scaling

Please review the official University and School policies here.

Policies

Zoom expectations

  • When in a large session you should,

    • have your microphone muted by default
    • use the raise your hand feature or type in the chat for questions and comments
  • In the small team sessions you should,

    • have your camera turned on as much as possible
    • engage with your team mates via voice and text chat
    • take turns sharing your screen when necessary

Collaboration policy

Only work that is clearly assigned as team work should be completed collaboratively. Individual assignments must be completed individually, you may not directly share or discuss answers / code with anyone other than the instructors and tutors. You are welcome to discuss the problems in general and ask for advice.

Sharing / reusing code

I am well aware that a huge volume of code is available on the web to solve any number of problems. Unless I explicitly tell you not to use something the course’s policy is that you may make use of any online resources (e.g. StackOverflow) but you must explicitly cite where you obtained any code you directly use (or use as inspiration). Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism. On individual assignments you may not directly share code with another student in this class, and on team assignments you may not directly share code with another team in this class. You are welcome to discuss the problems together and ask for advice, but you may not send or make use of code from another team.

Academic integrity

The University takes academic misconduct very seriously and is committed to ensuring that so far as possible it is detected and dealt with appropriately. Find out more about the University’s official policies around academic misconduct here.

Cheating or plagiarising on assignments, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the University policies, and will not be tolerated. Such incidences will result in a 0 grade for all parties involved. Additionally, there may be penalties to your final class grade along with being reported to the School Academic Misconduct Office.

Late work, extensions, and special circumstances

All work is due on the stated due date. Due dates are there to help guide your pace through the course and they also allow us (the course staff) to return marks and feedback to you in a timely manner. However, sometimes life gets in the way and you might not be able to turn in your work on time. Note, first of all, that we drop the lowest score of quizzes, labs, and homework assignments. So if you miss one assignment, this can be your dropped score.

  • Late work policy: Some assignments cannot be turned in late and some assignments can be turned in past the deadline with a late penalty:
    • Quizzes: No late work accepted
    • Labs: No late work accepted
    • Homework assignments: Late work accepted up to 4 days past the deadline (i.e. Monday after the deadline, 16:00 UK time), with 5% penalty for each day
    • Project proposal: Late work accepted up to 4 days past the deadline (i.e. Monday after the deadline, 16:00 UK time), with 5% penalty for each day
    • Project re-proposal: No late work accepted (this is an optional assignment)
    • Project: No late work accepted for the presentation, late work accepted for the write-up up to 7 days after the deadline, with 5% penalty for each dat

Please review the official University and School policies here. If you intend to submit work late for the project, you must notify the course organizer before the original deadline as well as as soon as the completed work is submitted on GitHub.

  • Extensions: The University has an extension policy whereby you can request an extension for any assignments where late work is accepted. If your extension request is approved, you can turn in the assignment late and not incur the late penalty. As outlined above you can request an extension for homework assignments, the project proposal, and the project write up. Extensions are not granted for quizzes, labs, project re-proposal, and project presentation under any circumstances. To request an extension you must visit the Extensions and Special Circumstances website and Apply for an extension there. A decision will be made within 2 days and the team will notify the student of their decision. Note that decisions are made by an external committee, not the course teaching staff, so requests for extensions must go through this form and not through course organisers and tutors.

  • Special circumstances: You can think of special circumstances as one level above an extension request, where there is a documented reason why you’re unable to complete any assignment in the course. Special circumstances decisions are made at the end of the semester by an external committee. To request a special circumstances waiver you must visit the Extensions and Special Circumstances website and Apply for special circumstances there.

If you’re not sure whether your personal circumstance should be filed under an extension or special circumstances, we recommend you reach out to your Personal Tutor and/or Student Support Officers (studentsupport@maths.ed.ac.uk).

Regrade requests

Regrade requests must be made within one week of when the assignment is returned, and must be typed up and submitted in person to me via email to David Elliott (david.elliott@ec.ac.uk). These will be honoured if points were tallied incorrectly, or if you feel your answer is correct but it was marked wrong. No regrade will be made to alter the number of points deducted for a mistake. There will be no grade changes after the final project presentations.

Diversity & inclusion

It is my intent that students from all diverse backgrounds and perspectives be well-served by this course, that students’ learning needs be addressed both in and out of class, and that the diversity that the students bring to this class be viewed as a resource, strength and benefit. It is my intent to present materials and activities that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture. Your suggestions are encouraged and appreciated. Please let me know ways to improve the effectiveness of the course for you personally, or for other students or student groups.

Furthermore, I would like to create a learning environment for my students that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture.) To help accomplish this:

  • If you have a name that differs from those that appear in your official University of Edinburgh records, please let me know!
  • Please let me know your preferred pronouns.
  • If you feel like your performance in the class is being impacted by your experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. If you prefer to speak with someone outside of the course, your personal tutor is an excellent resource.
  • I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it.

Learning during a pandemic

I want to make sure that you learn everything you were hoping to learn from this class. If this requires flexibility, please don’t hesitate to ask.

  • You never owe me personal information about your health (mental or physical) but you’re always welcome to talk to me. If I can’t help, I likely know someone who can.

  • I want you to learn lots of things from this class, but I primarily want you to stay healthy, balanced, and grounded during this crisis. ]

Note: If you’ve read this far in the syllabus, email me a puppy or kitten picture! Could be yours, or one you found online.

Help

Most of you will need help at some point and we want to make sure you can identify when that is without getting too frustrated and feel comfortable seeking help.

  • Piazza: The best way to get any questions on course content, technology, logistics, policies is to post your question on Piazza. And you are encouraged to answer each others’ questions here as well. When you post a question on Piazza, you can choose to do so anonymously to your classmates. Note that the course instructors and tutors can always see your name, and this is for a good reason! We want to be able to identify students who might be struggling so that we can extend help. Similarly, we want to know who you are if you’re providing great answers to others’ questions! Piazza will be available soon!
  • Teams: We will use Teams for synchronous course communication. Feel free to post questions you need short and quick answers to there. Note that while Teams is great for quick clarifications, it’s not a great venue for lengthy questions that require extended discussion.
  • Student hours: Course organisers will hold students hours on Tuesdays (David, 10-11am) and Wednesdays (Mine, 12-1pm) on Zoom. Please feel free to call in with any questions, or just to say hi! I am also available to meet by appointment, please use the link below to request one.
  • Email: Please refrain from emailing any course content questions (those should go on Piazza or Teams), and only use email for questions about personal matters that may not be appropriate for the public course forum (e.g. illness, concessions).
  • For more general support and advice, please make use of the following resources:

Make good use of this support system, it is there for you! And if you’re not sure where to go for help, just ask any academic or administrative member of the course team.

Extra credit

There will be four extra credit opportunities in this course.

The first two are related to the Virtual Exchange we’re doing with University of Florida. More on that soon…

The third one is reporting typos / errata for the course. You can do so by filling out this form. You can submit as many as you like.

The fourth, and final, one is about my cats. Undoubtedly they’ll make appearances in videos, during code-alongs and workshops, and undoubtedly, I’ll call out their names saying “___, don’t do that!“. If you get the names of all four of them right, you get the extra credit. Doing so requires engaging in all course content closely.

Each extra credit opportunity is worth one percentage point on your homework average.

Project

Showcase your inner data scientist

TL;DR

Pick a dataset, any dataset…

…and do something with it. That is your final project in a nutshell. More details below.

May be too long, but please do read

The final project for this class will consist of analysis on a dataset of your own choosing. The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.

The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.

The project is very open ended. You should create some kind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use, but sticking to packages we learned in class (tidyverse) is required. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R.

Data

In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your dataset must have at least 50 observations and between 10 to 20 variables (exceptions can be made but you must speak with me first). The variables in the data should include categorical variables, discrete numerical variables, and continuous numerical variables.

If you are using a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into R as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.

Note on reusing datasets from class: Do not reuse datasets used in examples, homework assignments, or labs in the class.

Below are a list of data repositories that might be of interest to browse. You’re not limited to these resources, and in fact you’re encouraged to venture beyond them. But you might find something interesting there:

Deliverables

  1. Proposal - due Tuesday, 27 Oct, at 16:00
  2. Presentation - due Friday, 4 Dec, at 09:00 as pre-recorded video or live presentation in workshop
  3. Write-up - due Friday, 4 Dec, at 09:00

Proposal

  • Section 1 - Introduction: The introduction should introduce your general research question and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.).

  • Section 2 - Data: Place your data in the /data folder, and add dimensions and codebook to the README in that folder. Then print out the output of glimpse() or skim() of your data frame.

  • Section 3 - Data analysis plan:

    • The outcome (response, Y) and predictor (explanatory, X) variables you will use to answer your question.
    • The comparison groups you will use, if applicable.
    • Very preliminary exploratory data analysis, including some summary statistics and visualizations, along with some explanation on how they help you learn more about your data. (You can add to these later as you work on your project.)
    • The statistical method(s) that you believe will be useful in answering your question(s). (You can update these later as you work on your project.)
    • What results from these specific statistical methods are needed to support your hypothesized answer?

Each section should be no more than 1 page (excluding figures). You can check a print preview to confirm length.

Presentation

5 minutes maximum, and each team member should say something substantial. You can either present live during your workshop or pre-record and submit your video to be played during the workshop.

Prepare a slide deck using the template in your repo. This template uses a package called xaringan, and allows you to make presentation slides using R Markdown syntax. There isn’t a limit to how many slides you can use, just a time limit (5 minutes total). Each team member should get a chance to speak during the presentation. Your presentation should not just be an account of everything you tried (“then we did this, then we did this, etc.”), instead it should convey what choices you made, and why, and what you found.

Before you finalize your presentation, make sure your chunks are turned off with echo = FALSE.

Presentation schedule: Presentations will take place during the last workshop of the semester. You can choose to do your presentation live or pre-record it. During your workshop you will watch presentations from other teams in your workshop and provide feedback in the form of peer evaluations. The presentation line-up will be generated randomly.

Write-up

Along with your presentation slides, we want you to provide a brief summary of your project in the README of your repository.

This write-up, which you can also think of as an summary of your project, should provide information on the dataset you’re using, your research question(s), your methodology, and your findings.

Repo organization

The following folders and files in your project repository:

  • presentation.Rmd + presentation.html: Your presentation slides
  • README.md: Your write-up
  • /data/*: Your dataset in csv or RDS format, in the /data folder.
  • /proposal: Your proposal from earlier in the semester

Style and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formated.

Tips

  • You’re working in the same repo as your teammates now, so merge conflicts will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.
  • Review the marking guidelines below and ask questions if any of the expectations are unclear.
  • Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).
  • Set aside time to work together and apart (physically).
  • When you’re done, review the documents on GitHub to make sure you’re happy with the final state of your work. Then go get some rest!
  • Code: In your presentation your code should be hidden (echo = FALSE) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your R Markdown file I should be able to obtain the results you presented. Exception: If you want to highlight something specific about a piece of code, you’re welcomed to show that portion.
  • Teamwork: You are to complete the assignment as a team. All team members are expected to contribute equally to the completion of this assignment and team evaluations will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.

Marking

Total 100 pts
Proposal 10 pts
Presentation 50 pts
Write-up 15 pts
Reproducibility and organization 10 pts
Team peer evaluation 10 pts
Classmates’ evaluation 5 pts

Criteria

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

Team peer evaluation

You will be asked to fill out a survey where you rate the contribution and teamwork of each team member out of 10 points. You will additionally report a contribution percentage for each team member. Filling out the survey is a prerequisite for getting credit on the team member evaluation.If you are suggesting that an individual did less than 20% of the work, pleaseprovide some explanation. If any individual gets an average peer score indicating that they did less than 10% of the work, this person will receive half the grade of the rest of the group.

Late work policy

  • There is no late submission / make up for the presentation. You must be in class on the day of the presentation to get credit for it or pre-record and submit your presentation by 9am in the morning of the presentations.

  • The late work policy for the write-up is 5% of the maximum obtainable mark per calendar day up to seven calendar days after the deadline. If you intend to submit work late for the project, you must notify the course organizer before the original deadline as well as as soon as the completed work is submitted on GitHub.

People

Course organisers