Introduction to Using Python for Data Analysis.

This lesson will just be a few examples for how to use IPython to do some basic statistics problems that one might see in an introduction to statistics course.

Suppose you buy a container of Jif peanut butter. It will say the net weight is 18oz, but when the container is filled up there will be slight deviations from this: maybe too little peanut butter goes in, maybe too much goes in. Suppose the company notes that the amount of peanut butter is normally distributed with the mean amount of peanut butter $\mu = 18$ oz and the standard deviation is $\sigma = 0.09$ oz. Suppose, then, that we weigh our peanut butter and it's only 17.1 oz; we call up Jif, but they say that kind of thing happens all the time. Let's use Python to see exactly how often we would expect to get a container which weighs 17.91 oz or less.

Let's open up our Spyder and start a new file to get started. As usual, we'll have some imports. The only new one below is `scipy` which will let us easily work with normal distributions. In fact, it has an entire section for statistics which we will import as follows:

import numpy as np import matplotlib.pyplot as plt import scipy.stats as ss

Actually, we won't be using matplotlib, but I tend to import it just in case I wind up wanting to make a plot of something. It doesn't hurt, in any case.

To make a normal distribution with $\mu = 18$ and $\sigma = 0.09$ as your mean and standard deviation, we use the `scipy.stats` command:

ss.norm(loc = 18, scale=0.09)

Here, `ss.norm` says that I want to use the normal function from the `ss.norm` library. When we say `loc = 18, scale = 0.09` this is giving us the mean and standard deviation, respectively, for the normal distribution we're creating. I've then saved this entire distribution as the variable `pb` (for peanut butter!). In order to the percentage of times we would get peanut butter less than or equal to 17.91 oz, we may use the cumulative distribution function (cdf). The cdf will give us the area under our normal curve *to the left* of whatever point we put in (so, in our case, this will give us the area under the normal curve to the left of 17.91; this is exactly the area we want for less-than-or-equal-to). Surprise, surprise, scipy has a normal cdf function, so we are able to easily find this value and complete the code. The way we do this is to just tack on "cdf" after the `ss.norm` part and before the ()'s — this will make a function which you can call and put your value into. The function will look like this: `ss.norm.cdf(x, loc = 18, scale = 0.09)`. We can make it a function like so:

import numpy as np import matplotlib.pyplot as plt import scipy.stats as ss def f(x): return ss.norm.cdf(x, loc = 18, scale=0.09) print f(17.91)

Typing $f(17.91)$ plugs in the value $x = 17.91$ into `f` which we defined to return the value of `ss.norm.cdf(x, loc=18, scale=0.09)`. The returned answer should be `0.158655253931`, meaning that around $15.87\%$ of the time we would expect to get this amount of peanut butter or less.

Start a new file. We don't need to import anything this time.

Suppose Johnny Nostudy got a 60 on his test. His mother, Mama Nostudy, was angry at Johnny — but did she have a right to be? Standardizing scores is a common technique that is used in statistics to look at data in terms of the distribution it is sitting in. As an example, let's make a function in Python to standardize Johnny's score assuming that the test scores were normally-distributed.

Recall that for a normal distribution, the way to standardize a score is to find $z = \dfrac{x - \mu}{\sigma}$, where $\mu$ is the mean and $\sigma$ is the standard deviation; sometimes, this value is called a *z-score*. Either way, we can make a neat Python function that finds this for us. (Note: there is actually a function in one of our imports which does this, but it's a good idea to be able to know how to construct functions on-the-fly in Python.)

One such function can be given by:

def standardize(x, mean, stdev): return (x - mean) / stdev

If the construction of this function was mysterious to you, check out some of this awesome site. We'll be here when you get back.

Now, let's pretend that Johnny Nosutdy tells his mom that the class mean was actually 40 and the class standard deviation was 10. Plugging the values into our function and printing,

print standardize(60, 40, 10)

we find that the program will return `2`. This means Johnny scored *2 standard deviations above the mean*. This is pretty darn good. You may repeat the last example to see how many students scored below him (and, therefore, calculate his *percentile*) and how many students scored above him. Good job, Johnny.

Now, suppose that Johnny's mom goes in to see the teacher and the teacher notes the class average was actually 90 with a standard deviation of 5. How did Johnny do in this case?

Suppose that we own a mexican fast-food resturaunt and we have 8 employees. Suppose that their salaries are given in the following list:

`sals = [8.75, 8.75, 8.90, 9.00, 9.00, 9.00, 9.00, 10.15]`

You'd like to know the mean, standard deviation, and some other useful things. Also, you love Python. Let's start a new file in Spyder and figure this stuff out.

Scipy, as you'd expect, has a neat function for giving summary statistics; it's at `scipy.stats` and is called `describe`, but it's a little strange, so we'll go through it. First, we import everything, and then we print out the describe function with `sals` as a parameter:

import numpy as np import matplotlib.pyplot as plt import scipy.stats as ss sals = [8.75, 8.75, 8.90, 9.00, 9.00, 9.00, 9.00, 10.15] desc = sp.describe(sals)

After you've done this, print out desc and see what it looks like. It should look like this:

[9, 1, 8, 4.8888888888888893, 6.6111111111111107]

This is a bit upsetting. We don't know what these values are. This is where the object inspector shines! If you don't have the object inspector in Spyder, you can also look up "scipy stats describe" and find the documentation. In it, it will say things like "Parameters," and "Returns," as headings. Looking at the "Returns" heading, we see that this function returns a whole bunch of things in a list: the size of the data, the min, max, the mean, and so forth. We only care about the size, min, max, mean, and standard deviation, so we'll only use the first five of these. There's a number of nice ways to label these things, but I'll use the following method. The entire thing will look like:

import numpy as np import matplotlib.pyplot as plt import scipy.stats as ss sals = [8.75, 8.75, 8.90, 9.00, 9.00, 9.00, 9.00, 10.15] desc = ss.describe(sals) names = ["length", "min", "max", "mean", "stdev"] desc = [desc[0], desc[1][0],desc[1][1], desc[2],desc[3]] for i in range(5): print "%s: %d" % (names[i], desc[i])

If you're a bit confused by this, check out the Learning Python the Hard Way tutorial (lessons 5 – 9), which does an excellent job explaining this kind of formatting.

Last in this section, notice that we could easily make this into a function to accept *any* list of data; moreover, this would need only some minor tweaking from the original code we have. Try to do this yourself. My completed function looks like this:

import numpy as np import matplotlib.pyplot as plt import scipy.stats as ss def summary(listofdata): desc = ss.describe(listofdata) names = ["length", "min", "max", "mean", "stdev"] desc = [desc[0], desc[1][0],desc[1][1], desc[2],desc[3]] for i in range(5): print "%s: %d" % (names[i], desc[i])

In my case, I could test this function by writing `summary([1,2,3,4,5])` and see if it returns the correct values. It should!

⇐ Back to 1.1 | Home | Onwards to 1.3 ⇒ |