Topic outline

  • General

    MIS 480/680 - Practical Computing for Business Analytics

    Instructor: Mark Isken
    isken@oakland.edu
    Mon, Wed 6:00-9:00p in 202 EH

    My hselab website: http://hselab.org/
    My LinkedIn Profile: www.linkedin.com/pub/mark-isken/a/633/48/

  • About data science and business analytics

  • Learning R

    In addition to our "R for Everyone" text, here are some other resources for learning R. There is a TON of stuff out there on the web.

  • Learning Python

    In addition to our textbooks, "Practical Computing for Biologists" and "Python for Data Analysis", there are a ton of free web based resources to help you learn Python.

  • The pcba virtual machine

    In order to give us all a common computing environment and to allow you to use Linux from a Windows machine, I've created a "virtual appliance" that I've named pcba (for practical computing for business analytics). See the Software section of the syllabus for an overview of the pcba virtual appliance. In this section I'll have links to various resources related to getting, installing, using, and updating the pcba appliance. The pcba appliance has already been installed in the 202EH computer lab. Below you can find information on using pcba on your own computer.

  • Publicly available data

    There's a proliferation of publicly available data out there to use for data science practice. Nevertheless, it can be hard to find what you want (see, we are swimming in data). Here's a few links to freely available datasets.

  • Sessions 1 and 2 - Intro to analytics, text files and Linux (Week 1)

    Intro to the course, the field of analytics and the computing environment and tools we'll be using. We'll also dive right in and learn a little about text files and the Linux bash shell. Text files are ubiquitous. Let's dig into them a bit and we'll start to learn how to get them ready for analysis using regular expressions. We'll also get our first look at the Linux file system and learn the basics of using the shell for basic file management.

    Depending on how things go, we may start R in the second half of the Wed 5/13 class. See details below in the next Session block.

  • Session 3 (May 18) - Intro to R and R Studio

    Overview of R and R Studio, basic interactive use, R data types, installing packages, reading data into and working with dataframes. We'll learn how to create and use basic R scripts to document and series of analysis steps. We'll also be writing our own "tutorials" or "learning guides" using the magic of R Markdown documents along with the knitr package. That's how I create those cool R tutorials on my hselab.org site.

  • Session 4 (May 20) - Exploratory Data Analysis with R

    R Studio projects, summary statistics, basic plots with ggplot2, simple R scripts including writing our own functions, data transformations

    We are going to explore a dataset related to New York City condo evaluations for fiscal year 2011-2012. It was obtained
    from the NYC Open Data initiative - https://data.cityofnewyork.us/. This data has spawned a bunch of apps through a site called BigApps NYC (http://nycbigapps.com/). It's a little like Kaggle (http://www.kaggle.com/) in that there are data and app dev related challenges with prize money attached.

  • Session 5 (Wed 5/27) - Group by analysis and more basic stats

    We'll learn about various ways to do "pivot"or group by analysis with a focus on the widely used plyr package (another Hadley Wickham creation). Related topics of data reshaping and string manipulation will also be covered. We'll get our first look at probability distribution functions in R in the context of generating random data and doing a few basic normal distribution computations (things you've probably done with NORMDIST(), NORMSDIST(), NORMINV(), and NORMSINV() in Excel).

  • Session 6 (Mon 6/1) - Linear models

    One of the workhorses of data mining are linear models. We'll do things like multiple linear regression for numeric predictions and logistic regression as a classifier for binary response variables. We'll use these relatively simple models as a way to also learn about important modeling topics such as partitioning data into training and test sets, model training, validation and diagnostics. We'll also use regression to introduce the notion of parameter estimation, error metrics for assessing model fit and for comparing candidate models against each other. These topics underlie all of statistical learning algorithms.

    We'll be moving on to Python after this week.

  • Session 7 (Wed 6/3) - Data mining with R

    In class we'll finish up some linear regression modeling and then spend some time learning about using logistic regression for binary classification problems - i.e. when our response variable has two possible outcomes (e.g. customer defaults on loan or does not default on loan). We will also discuss a famous classification problem that has been used as a Kaggle learning challenge for new data miners - predicting survivors of the crash of the Titanic.

    We'll look at a few classic data mining techniques such as k nearest neighbors and cluster analysis. I'll also show you a nice R based GUI for doing data mining called Rattle.

  • Session 8/9 (Mon 6/8, Wed 6/10) - Intro to Python

  • Session 11 (Mon 6/15) - Data analysis and plotting in Python - 1

    Pandas, developed by Wes McKinney, is the goto library for doing data manipulation and analysis in Python. It's not really a statistics library (ala R); for that, StatsModels is the Python library of choice for now. For more advanced stuff like machine learning and data mining algorithms, scikit-learn is the answer.

    The de-facto standard plotting library for Python is called matplotlib and it's one of the key reasons that Python has become such a major force in the analytics world. Recently, the venerable ggplot2 package was ported from R to Python by an analytics group called yhat.

  • Session 12 (Wed 6/17) - Intro to machine learning and datetime analysis in Python

    We'll do a number of things tonight including:

    1. Briefly review the stream temperature logging app to see how a simple Python script can use file globbing, pandas, and matplotlib to process a whole directory full of data files, creating statistical summaries and plots, both of which are saved to files (csv and png). Good example of the power of languages like Python for common analytical tasks.
    2. Start building a simulation model of the Monte-Hall 3 door problem.
    3. Using R from within IPython for analysis and visualization (using the rpy2 package)
    4. Talk briefly about Python package management. Before the break, we'll use conda to install seaborn, a nice data visualization package that is built on top of matplotlib. It takes a little while for the update, so we can do it during the break.
    5. Use the Python scikit-learn modules for doing statistical analysis and machine learning
    6. If time, talk a little about datetime analysis in Python (see below)

    The sckit-learn module is a full featured Python module for all kinds of data analysis and predictive modeling algorithms. We'll do an overview of this widely used module and get a bit more exposure to statistical learning algorithms.

    Python has terrific libraries for dealing with time series (pandas). However, there be dragons in the in confluence of pandas, numpy, and base Python date and time handling. Given the ubiquitous nature of datetime data in business, slaying these dragons is a calling we cannot avoid. I've developed a pretty extensive IPython notebook and accompanying blog post on this topic. So we probably won't spend a ton of time in class on it. You can explore it at your leisure. I come back to this notebook time and time again - no pun intended. :)


  • Session 13 (Mon 6/22) - Data acquisition and more analysis in Python

    We'll do a number of things tonight including:

    1. Learn about various approaches to getting data from websites including automated downloading, web scraping of HTML and XML, and using web APIs. We'll see examples both in Python and R.
    2. A little bit on working with dates and times in Python.
    3. Finish the simulation model of the Monte-Hall 3 door problem (here's where we left off Monte-680.ipynb)


  • Session 14 (Wed 6/24) - Hadoop and MapReduce

    Obviously, we can't do this topic justice in one session. However, I can give a sense of what it's all about and then you can dig into it later as you wish. I believe I mentioned in class that I took an online, 4 week, short course on this through Statistics.com. It was outstanding. Here's the link the course I took: http://www.statistics.com/hadoop/.