Getting Deeper into NumPy and Pandas.

We proceed with these projects in the following organized fashion:

- (Ask a Question.)
- Find the data.
- Import the data.
- Analyze the data.
- Conclude.

I call this the **FIAC** system because, you know, the first letters of each of the statements. The first one is in ()'s since sometimes we will not know what kinds of questions we *can* ask until we have the data in front of us; other times, we will have a specific question in mind which we'll need to find the appropriate data for.

For this project, my question was, **"How well are kids in Chicago Public Schools (CPS) learning mathematics?"**

For this one, instead of Googling (which I could have done), I asked on a social networking site if anyone had any data for the CPS system; I have friends who work there and I thought maybe the CPS doesn't give data out to anyone who asks. Luckily, I was wrong: this is a hub for various datasets related to CPS. At the bottom, you'll see the algebra exit exams. For now, don't bother clicking it — I've made a simplified version of it which cuts out the unnecessary (for this project) elements. Moreover, even though the file is naturally in excel-format, I've converted it to csv (remember: you can easily convert csv and excel files back and forth by opening the file in Excel or any of the open Excel-like programs). Download the file here:

Now that we have our file, we need to do the next part of **FIAC**, which is to import the data.

Oh, pandas, what a delight you are to use! We will use the `read_csv` function that you can find by either searching through the list of pandas functions or by googling something like, "csv file in pandas". Here's what it'll look like:

db = pd.read_csv('fileLocation')

Where, of course, 'fileLocation' should be replaced by the string (in quotes!) of where the file is. If it works, great! If not, strangely, sometimes pandas will bring up a weird error and the only way I've found to remedy that error is to use two \'s right before the file name instead of one (so, for example, `'C:\MyFiles\\algexit2009.csv'` with the two \'s at the end). Either way, once you've got it into the variable `db` we've successfully imported the data.

We now move on to the (arguably!) most important part of **FIAC**, which is to analyze the data.

First, let's look at the data. We can just print it out, right?

>>> dbInt64Index: 135 entries, 133 to 123 Data columns (total 4 columns): school_name 135 non-null values total_highpass 135 non-null values tota_pass 135 non-null values total 135 non-null values dtypes: int64(3), object(1)

Well, that's certainly not what we want. It gives us the column headings (which was nice of it) and how many values there are, but not what those values *are*. Hm. Looking through the possible functions in pandas we can apply to `db` we find one called `values`. Try this one out.

>>> db.values

Well. That's kind of what we want. But it's super-ugly. Just a list of list. That's not friendly-looking. It turns out that pandas has two nice commands which allow us to work with extremely large data sets (even though this isn't too large) called `head` and `tail`; `head(number)` gives you the top `number` of elements in your dataframe while `tail(number)` gives you the last `number` of elements in your dataframe. Try them out!

>>> db.head(20) >>> db.tail(15)

You can mess around with different numbers and get different lists. Ultimately, this will help us look at most of the data in a relatively clean and not memory-intensive way.

So, we've listed the data out. Great. Fantastic. That's not quite the analysis we were going for, though. So, let's ask three questions and get three answers.

- Which school(s) has the largest number of passing students? What about high-passing students?
- Which schools(s) had the largest
*percentage*of passing students? What about "high pass" students? - Which school had the largest number of both pass and high pass students? What about percentage?

Here, I'm assuming that "passing" and "highpassing" students are disjoint; that is, if someone "highpasses", hey count that student in the "total_highpass" column, but not the "total_pass" column.

For the first question, we could do a few things: we could find the maximim value, find out where that value was in the index, and then return the school name associated to that index — but this is a lot of work, and, really, all we need to do is sort the data. So let's do that.

We look for any kind of sorting function in pandas by typing `db.` and looking at the autocomplte options (or, looking at the documentation). The `sort()` function looks pretty promising. We'd like to sort by just the "total_pass" column first to answer part of the the first question, which would look something like this:

>>> db.sort(columns=["total_pass"])

If you look at the list again — you'll find that nothing happened. What? We'd need to save this sorted list *as another list* because we've done some modification to it. That's annoying, so pandas allows us to sort the list *in place*, which means that it will automatically save the sorting to the current list. The way you do this is, as you might have guessed by looking at the documentation,

>>> db.sort(columns=["total_pass"], inplace=True)

This gives us a delightfully sorted list. If we do `db.head(20)` again, we'll find a sorted list — but, with the smallest number of passes on top. Ugh. Well, looking at the sort function, we can actually sort in descending order instead by making the "ascending" option False, like this:

>>> db.sort(columns=["total_pass"], inplace=True, ascending=False)

Now when we look at the data with `db.head`, we find that *YOUNG, W* and *KENWOOD* are at the top. If we were to sort for "highpass

>>> db.sort(columns=["total_highpass"], inplace=True, ascending=False)

That's pretty neat. Note, though, something important: yes, *YOUNG, W* has the greatest number of highpass and pass students — but they also have the greatest number of students. This idea leads us to the second question regarding the *percent* of students. This one will take a bit more work.

First, let's make a function which, given an index, will find a percent for that school. For example, if we looked at our sorted list and put in "0", we'd get a percent for *YOUNG, W* since that's at the 0-th index.

The way we look at an value at the position `[row index,column index]` is to use the command `.iat`. I found this by doing a simple google search, but it's also one of the more "common" pandas commands. For example, plugging in `db.iat[0,0]` should give us *YOUNG, W*. Try plugging in different indices and see if you get what you expect!

Okay. Now that we know how to get values, something we'd want to do is to find percentages. For *YOUNG, W* for example, we'd have the percentage of passing students (not highpassing!) is equal to

>>> db.iat[0,2] / float(db.iat[0,3])

The command `float` here is just to make sure Python knows this answer should be a number with a decimal point as opposed to just rounding to the nearest whole number. Some versions of Python do this for us (without using `float`) but better safe than wrong.

Either way, we should get a solution like `0.64800...` and that's exactly what we wanted. That's the percentage of passing (but not high passing) students *for index 0*. If you've got some experience programming already, you'll probably have jump the gun here: this is practically begging us to make a For loop out of it. Let's make a list for these values:

>>> passpercent = [] >>> for i in range(len(db)): ... passpercent.append(db.iat[i,2] / float(db.iat[i,3]))

You will have to press enter a few times after the last line here to make sure it runs the For loop. When you get back to the `>>>` you're good. Make sure to put one or two (or more) spaces when you press enter after the `for i in range` line (when you start the line with `append`) or else Python will complain about indents; indenting is a common important theme in Python.

The above commands shouldn't seem too mysterious to you at this point. We make an empty list by using `passpercent=[]` and "append" elements to it using the `append` command. In addition, `len(db)` is the height of the columns in `db`. If you run this command, you'll fill up `passpercent` with values. Check to make sure.

You might think that it'll be crazy to make a new column in our database, but it's actually really simple:

>>> db['pass percent'] = passpercent

This creates the column 'pass percent' and pastes our `passpercent` list into it. If you check by typing `db.head(10)`, you'll see we've created an appropriate percent of passing column! Nice. Before we sort, let's just create another column for highpass percent.

>>> highpasspercent = [] >>> for i in range(len(db)): ... highpasspercent.append(db.iat[i,1] / float(db.iat[i,3]))

followed by

>>> db['high pass percent'] = highpasspercent

Great. Now let's sort. First, by pass percent:

>>> db.sort(columns=['pass percent'], inplace=True, ascending=False)

Looking at the data (remember, `db.head()`), we now see something kind of neat: *SHOOP, BLACK, STOCKTON* have taken the lead, since nearly all of their students passed! Interesting. But, we note, they have *very few students* compared to the other schools. This will have to be taken into account when making any kind of conclusion.

Now, by high pass percent:

We now see that *GALILEO SCHOLASTIC, THORP O A, GRAY* have taken the lead in terms of high pass percent — again, note that these schools have very few students.

[As a non-mathematical aside, we see here why taking averages and ranking schools is difficult: we've got four different ways of looking at passing rates, and the schools on "top" are dastically different depending on what we considered to be important to look at. Ideally, we would be able to appropriately weight these rankings in some way and create a "reasonable" measure of "goodness". For now we will not worry about these considerations and just focus on the data.]

Last, we look at the final question. We need to take a sum of these two columns but, by now, we've probably got a good idea of how to do that. Guess first, then check below (there's more than one right answer!):

>>> totalpass = [] >>> for i in range(len(db)): ... totalpass.append(db.iat[i,1] + db.iat[i,2])

and then

>>> db.['total pass'] = totalpass

Sorting by the total pass number...

>>> db.sort(columns=['total pass'], inplace=True, ascending=False)

We find that *YOUNG, W* and *KENWOOD* are back on top. Doing percentages again...

>>> totalpasspercent = [] >>> for i in range(len(db)): ... totalpasspercent.append(db.iat[i,6] / float(db.iat[i,3]))

and then...

db['total pass percent'] = totalpasspercent

and finally sorting...

>>> db.sort(columns=['total pass percent'], inplace=True, ascending=False)

And an interesting thing happens: *THORP O A* overtakes *GALILEO*.

For now, we'll leave drawing conclusions to the reader. This project was a small exercise in how to manipulate data on a basic level (indeed, there are better ways to do most of what we did — though, what we did *works* and is fine). You may not be impressed with this data, but it may also be instructive to see how poorly the "worst" schools have done.

We won't make any graphs for this project since I feel I've introduced enough new material. For future projects, we'll want to think about how best to represent the data we've collected and look into how to create the relevant charts and graphs using `matplotlib`.

Below is some data on the Dropout rates for most of the schools in the Chicago Public School System (the headings give the school name, the type of school, the drop out (DO) rate (as a percentage) for 2010, 2011, and 2012.

What can you find out about this data? What schools have the highest and lowest dropout percentages? Do the higher/lower schools tend to be in any particular network? If so, what other data might we look at to try to find a cause for this?

⇐ Back to 2.2 | Home | Onwards to 2.4 ⇒ |