Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Teaching Pandas and Jupyter to Northwestern journalism students (californiacivicdata.org)
86 points by palewire on June 7, 2017 | hide | past | favorite | 45 comments


So many people don't realize pandas can be horribly slow if you use it "wrong" -- i.e., for computations that don't vectorize in the way that's native for pandas. Also, working with dataframes that contain millions of rows is like playing a Russian roulette -- there's usually many ways to do the same thing in pandas, if you guessed correct you'll wait a minute or two till the computation's done, if you guessed wrong it'll run out of ram, segfault or never finish.

For big datasets, I've stopped using pandas myself a few years back for anything other than printing dataframe, datetime index series, doing quick plots, or working with tiny/toy datasets -- in favor of numpy structured/record arrays. It's kind of the same thing, without all the groupby/index fluff, but very fast.

Just last week, I've helped my colleague speed up her code (numerical solver for financial data) by more than 100x, the biggest part of it was ditching pandas entirely and using numpy.


So I've been learning Pandas after mostly using either standard Python, R or VB to do our analysis, and I'm glad I read this because I thought I was going crazy.

I have a data set of about 4 million rows I routinely analyze. I have 32 gb of space on my desktop, and the only time I've really run out is when I write incredibly poor code. In the short while I've been trying to use Pandas run out of memory and get killed by the OOM killer or completely freeze my system for half an hour while processing what I thought were simple operations.

I was honestly beginning to believe I was way worse at programming than I thought due to all of the issues I was having. I wasn't even doing anything particularly complex, I was just loading a dataframe from a sql query and playing around with basic manipulation.


I'm glad you are sharing this. I've made the same experience - in our code, we ditched Pandas entirely for structured arrays. We also used numpy record arrays at first but found them to somehow be significantly slower than structured arrays, and since the former just add syntactic sugar to the latter, we're now running entirely on structured numpy arrays.


  But pandas’ magical simplicity makes things like computed columns immediately intuitive:
  > data['% of total'] = data.amount / data.amount.sum()
Is that immediately intuitive? I'm staring at this trying to understand what it's doing. Is the / operator overloaded? data.amount is one particular amount, and data.amount.sum() is the sum of all amounts? Why does the "computed column" property goes on the same data object as the actual data? Maybe it's immediately intuitive if you've used pandas.


OTOH I think it's immediately intuitive if you are not a programmer. :)

When you see amount / sum, you think of how a list can be divided by what appears to a scalar.

When they see it, they parse it out for what they naturally understand a percentage to mean. And all is well.


Exactly this. I'm the author of the post and was a programmer by trade for a long time before I became a journalist. I _don't_ actually find this more intuitive than more explicit and fundamental programming techniques. But my students grokked it immediately, whereas even simple structures like loops seem to be harder to get for them to get their heads around.

Given I had ten weeks to cram a lot of material in but did want to show them some amount of programming, this worked pretty nicely.


I've been very troubled by coming to this stuff as a programmer. I'm having the same instant dis-satisfactory response that your students are having with looping structures.

I've recently started working on some projects where I need to do a lot of data visualization, story telling, and investigation "into the data". As a programmer getting into this stuff is far worse then I expected. Nothing works as I would think would make sense. My biggest problem is that I'm thinking like a programmer not like a mathematician. I expect objects, segregation or elimination of state, application and reduction, re-usability, and algorithms.

Are there any good frameworks that allow for processing, caching, data visualization (layout -> data population -> rendering), then exporting to some format (PNG/PDF/TeX)?

What follows, below this line, is my groveling about the things that have bothered me. Be warned if you don't like rambling and complaining. -------

Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators"). Matplotlib is unintuitive and poorly documented for anyone who isn't a mathematician (.plot(lons, lats, latlons=True) is correct). Dealing with anything more then 100,000 data points is a pain to revision on. State everywhere it shouldn't be (matplotlib.pyplot).

While I've been working on this project I probably (each spin) spend an hour or two getting the data out of a format that doesn't make sense from a programmers perspective, I spend another 5 to 10 minutes writing an application/reduction, then I spend another hour to go back into the strange data formats that matplotlib will take. All the while re-running expensive computations and waiting because I have no good persistence layer for my project.

There are just things in this community that are common that I'd never dream of. What follows is a list of these things.

1. Functions with 20-40 arguments are the norm for some reason. They also love to throw in a few insane defaults, undocumented options, and even magical flags (not enums).

Things like "draw a line, connect the dots" makes it so you need to know what 5 to 7 arguments of a massive function. In C/Java when I need some flags they probably look like this:

    some_operation(some_data, DO_A | DO_C | DO_Z)
Or, if someone was feeling really nice and defined an enum & used varargs, it looks more like this:

    some_operation(some_data, SomeOperationFeatures.DO_A, SomeOperationFeatures.DO_C, SomeOperationFeatures.DO_Z)
Where all of these have appropriate documentation. My IDE place nice and can complete these things. My compiler likes it and can typecheck these things. I like it because I know all of my options available (SomeOperationFeatures.).

With matplotlib you have things like `linestyle=""`. You have to go to a webpage, look through the docs, and figure out what you want. It's worth reading the docs [1] if you never have. This could have very easily have been LineStyle.DOTTED, LineStyle.DASHED, LineStyle.BLANK. IDEs would have played nice. The 3.6 runtime's typechecking would have played nice. You would be able to see what your options are (LineStyle.).

2. Non-standard ways of treating python-isms

Pandas, for some reason, cannot stick to python-isms. I can't do simple things like...

    if not df: # Check if DF is empty
        return ...

    for row in df: # Iterate through the rows of a DF
        row.date = datetime(row.year, row.month, row.day, ...) # Create a new column in the row based on the row's data.

    subset = [a for a in df if some_condition(a)] # Do simple filtering
Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.

3. All these libraries separate logically grouped concepts.

Lets say I have time series data from 10 sensors.

    class SomeMagicalSample:
        def __init__(self, a, b, c, d. ..., occurred)
            self.a = a
            ...
            self.occurred = occurred

With this code I can generate very complex filtering, combinations, and what not. Things like extracting "real" meaning from measured values becomes easy to express.

    def get_magical_scalar(self): return ... some interpolation ...

    def is_some_magical_type(self): return ... some check ...

Now I can use my already tried and true reduction and application.

    sum(map(SomeMagicalSample.get_magical_scalar,
            filter(SomeMagicalSample.is_some_magical_type, samples)))
Pandas, matplotlib, numpy, scipy and the lot are designed to make me avoid this style of organization. I'm instead forced to do something like this.

    a = [...]
    b = [...]
    c = [...]
    d = [...]
    ....
    occurred = [...]
Then I have to jump through hoops to keep all of this data in the same order, shift it around together.

4. Because everything is meaningless lists of numbers there are no ways to reuse code.

Most of the code I have written to show off a single value over time, or pull some data out of some other data and visualize it, is never going to be used again. Unless I want to look at this exact same thing this code will not be useful. If there was some way pass objects around, hide the internals, and process them independently of their meaning then this would not be the case.

The one case where this was not true in the past few days was when I rendered a model's prediction into a pcolormesh and drew it onto a basemap. By passing it a basemap it will automatically find the place to generate data for with the model. This was an undocumented feature that I had to read the source of basemap to find was possible (pulling the top left and bottom right Lat Lons from a basemap regardless of projection).

Maybe these warts just hurt for a little while? Do these go away? Are there alternatives that can handle >10 million data points? I don't have a good analysis framework setup for the work I'm doing. Maybe this is the issue. I don't even know what a good analysis framework would look like.

[1] - https://matplotlib.org/api/lines_api.html#matplotlib.lines.L...


I'm actually so with you on a lot of this. The inability to use Pythonisms with Pandas is insane and I had to do a data analysis where I really, genuinely needed to do some looping and some simple map/reduce and it almost drove me insane.

You might like [Agate](http://agate.readthedocs.io/) better.

I haven't done a ton of Jupyter in the newsroom yet, but what I've found myself doing is abstracting out the stuff I want to do in normal Python into one or more utility modules and having those return dataframes into my notebook. That way I can mostly write normal Python but have access to some of the nicer pandas features and get to do more exploratory work.

I don't mind that matplotlib is kind of awful -- that data viz would never go in a published piece in any event. I just want some hints as to what I or more likely a teammate would build in D3 around the specifics of the data.


> The inability to use Pythonisms with Pandas is insane and I had to do a data analysis where I really, genuinely needed to do some looping and some simple map/reduce and it almost drove me insane.

I recently started a project that I got to write from the ground up by myself. I was happy with the processing side of things. I was very sad with the data I was getting in and putting out. There's some impedance mismatch that doesn't need to exist.

> You might like [Agate](http://agate.readthedocs.io/) better.

I looked at the front page and definitely wasn't enjoying what I was seeing. It, at first, looked like more complexity piled up on top of things that don't need it. Then I saw this link: http://agate.readthedocs.io/en/1.6.0/cookbook/compute.html#l...

This is definitely worth a try. Much closer to what I was thinking.

> I don't mind that matplotlib is kind of awful -- that data viz would never go in a published piece in any event. I just want some hints as to what I or more likely a teammate would build in D3 around the specifics of the data.

Sadly in my field matplotlib is the professional tool (hah!). The end goal is the matplotlib plots. I'd be all fine for tweaking things in a designing program and putting it up by I'd be upset with myself.

My end goal is to have a single script in a repository that installs, runs, and then compiles my papers. I don't want anyone to need to look at sub-standard copies of my plots. I want anyone to be able to jump in and check my work and create derivative works.

Sadly this is not common in science today so there aren't really good tools for this sort of thing at the composition side. Even worse plotting isn't common in the computer world so tools for that don't exist either.


> I recently started a project that I got to write from the ground up by myself. I was happy with the processing side of things. I was very sad with the data I was getting in and putting out. There's some impedance mismatch that doesn't need to exist.

Impedance mismatch is a great way to put it. For me, if I can deal with that mismatch so that newbies/journalism colleagues don't have to, I'll do it.

> Sadly in my field matplotlib is the professional tool (hah!). The end goal is the matplotlib plots. I'd be all fine for tweaking things in a designing program and putting it up by I'd be upset with myself.

I used to work in science and have found journalism to have better solved many of these issues (at the expense, of course, of specialization and depth -- even a yearlong project isn't quite the same as decades of experience working in a single area). The solutions aren't pure or pretty -- they're more about workflow and held together with duct tape and baling wire. But the competitive pressure to deliver data that has a good user experience on deadline is very powerful and has led to some effective practices.


I am a programmer and I use Pandas quite a bit. While I agree that it's a little counter-intuitive at times, I have found it to be an extremely useful and important Python package that there is just no reasonable substitute for.


">Are there any good frameworks that allow for processing, caching, data visualization (layout -> data population -> rendering), then exporting to some format (PNG/PDF/TeX)?"

I use SAS for this in my Day Job it's not a free program but powerful for this type of stuff.

I typically use SQL queries (via SAS's proc sql command) to manipulate and process my data but you can also programatically manipulate your data sets using SAS's "datastep" language.

SAS has support for macro expansions which make some of your examples (like manipulating 10 sensors at once) pretty trivial. But this is getting into programming language territory I would not expect someone new/unfamiliar with programming to grasp all of this intuitively.

edit: Heres some code I have in production that counts how many (of 8) sensors are reading high in a given time frame.

array aads (*) TP_AD1_TOP_STACK_TC1 -- TP_AD1_TOP_STACK_TC8; NO_AD1_TEMPERATURES_HIGH = 0; do j= 1 to dim(aads); if aads(j) gt 160 then NO_AD1_TEMPERATURES_HIGH = NO_AD1_TEMPERATURES_HIGH +1; end;

Downside is that SAS is a commercial package and it is not free I Have heard a lot of good things about "R" which is supposedly quite similar but have not had opportunity to use it myself.


As someone who has used SAS for many, many different projects: it is terrible, vastly inferior to Pandas or R, and the only reason to ever use it is when you're forced to. Even simple stuff like functions that operate on data have to be hacked on with macros.

Case in point, your production SAS code could be replaced with this Pandas code (and the R code would look very similar):

  temperatures[[TEMPERATURE_COLUMNS]].apply(lambda t: (t > 160).sum(), axis=1)
or if your data is in proper long form

  data.groupby('time').temperature.gt(160).sum()


I'd like to get my analysis systems as "inclusive" as possible. I'd be using my internal SQL server and just fall into python for my processing if I didn't care about sharing my work.

SAS looks good though. I've looked at it many times and it is a clean solution if you really are in the "big games".


Yeah that is a good point trying to separate analysis from database.

My work is going opposite direction unfortunately we are starting to use Hadoop makes it quite difficult to do things "outside of the database" there is just too much data to work with locally.


SQL plus R is a good combo.


Funny you talk about SAS that way.

In my former team, we used SAS for a while and once I introduced the team to Pandas, they happily ditched SAS.


> Pandas, for some reason, cannot stick to python-isms. I can't do simple things like... > if not df: # Check if DF is empty > return ...

This part is a gotcha, but it's also a reflection that allowing if checks for things other than empty leads to subtle bugs. (there are long mailing list posts about it and about the bugs that were uncovered). See here for some explanation about why numpy does it: https://github.com/numpy/numpy/issues/8622


I feel your pain. You can pretty much blame either MATLAB origins or an relentless pursuit of runtime efficiency for most of these problems


Many of the things you list are indeed annoyances when doing data analysis in Python and they make things harder than they should be, but others are typical grievances I see from people new to it, and these do actually go away once you've been working with e.g. Pandas for longer.

> Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators").

What makes Pandas so great is that you can apply arbitrary functions to rows and columns, with the full expressivity of Python. In some cases it might be clunkier (though you should almost never need `.loc` and other indexing methods) but mostly it's just `df.groupby(...).apply(...)` or vectorized methods like `df.column + df.other_column`. This is a huge improvement over having half of your analysis in database queries and half in a programming language.

> Matplotlib is unintuitive and poorly documented

Try https://seaborn.pydata.org/ for statistical graphics.

> Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.

This sucks but is unavoidable, because Python does not have fast data types with support for missing values built in, so all your columns would have to be of mixed type (the actual type + None) and everything would slow down and simple things like computing the mean of a column with missing values would not work.

Note that you don't actually "need to go back and forth" because Pandas will happily convert plain Python objects to their Numpy equivalents for you.

> 3. All these libraries separate logically grouped concepts.

It's not functional, you're just going to have to deal with that. But split-apply-combine and similar patterns are quite elegant in Pandas: http://pandas.pydata.org/pandas-docs/stable/groupby.html

> 4. Because everything is meaningless lists of numbers there are no ways to reuse code.

A lot of data analysis is throw-away code. Some of it can be abstracted into reusable code, some of it can't.

Lastly, don't forget that Python does have a lot of things going for it when it comes to data analysis, from geospatial tools (http://toblerity.org/shapely/) to Bayesian modeling (http://pymc-devs.github.io/pymc3/index.html), as well as interactive coding with Jupyter and Hydrogen for the Atom editor (https://github.com/nteract/hydrogen).


Er, do you really think they "grokked" it immediately? Does anyone truly grok anything, especially in programming, immediately?


The bit I like about this one is that it's also either wrong or highly misleading, depending on your viewpoint. If I have a row that says:

"% of total" : 0.01

I would not expect that to be 1%.

At least, this could easily be the source of an inaccurate calculation elsewhere. This is not a major criticism, but perhaps would be a good point to introduce the idea of testing some of your code, even as a few simple cells that calculate things you expect.


> Maybe it's immediately intuitive if you've used pandas. I don't find this formula any different than anything in any of the math classes I've ever had. Haven't used pandas.


It's vectorized!


For installation of Jupyter, Anaconda works well across all platforms, even most slightly older OSes.

http://jupyter.readthedocs.io/en/latest/install.html

It does work better for people to install Jupyter with Anaconda, rather than use virtual environments, because there's not the overhead of also having to learn about virtual environments. People tend to think of them as just associated with the class and don't use them as much for their own work outside of the workshop or course.


I spend about 8 months of the year teaching pandas to journalism students, and it's a wild ride! Despite some of the iffy syntax and pandas' seeming inability to standardize parameter names, the students seem to grok the workflow much more quickly than wrangling lists and dictionaries in the "normal" world of Python.

I know everyone loves the reproducibility Notebooks supposedly bring to the table, but without a doubt my favorite part is the ability to export super-unattractive matplotlib charts as PDF, clean them up in Illustrator, and suddenly find yourself with publication-quality graphics. Knowing you're producing something more than just some numbers to toss in a story can be a strong sell to a lot of folks.


I really like Jupyter, but somehow I'm not in love with it. Like, every time I fire it up to use it for quick data analysis, I seem to inevitably end up back in sublime + bash, sending plots to disk. Am I the odd one out?


If you know what kind of short analysis you want to do, the benefits of Jupyter are not obvious. If you have to do a lot of exploration, and do longer analyses then it becomes indispensable.


It's also really valuable for sharing. At NPR, I did an analysis of Trump's tweets that was used in a digital post and Morning Edition piece. The notebook was easy to share with the reporter, editor, and readers and accessible enough for them to understand (https://github.com/nprapps/trump-tweet-analysis/blob/master/...).


I guess I could see that. Maybe I need to learn more of the keyboard shortcuts and magic methods.


1 thing that might help

%run the_script_you_wrote_in_sublime.py

will evaluate the script and expose the script globals in the interactive namespace. Then you can mess around with the values and do plots. This gives you the interactivity of the notebook as well as the benefits of the editor you already use


I really like what's offered by Jupyter Lab. It's in alpha right now, but I haven't had too many problems with it. It allows you to open text files, terminals, and notebooks in the interface.


I'll give this a shot, I like the idea of editing a file within a separate tab of the web interface.

Anything to help refactor things into and from file and the notebook is nice.


You're not the only one. I don't want notebooks, I want my own damned editor. For Atom, there is https://github.com/nteract/hydrogen which embeds Jupyter/iPython output right inside of your editor, not so different from how RStudio works.


There is also a way to embed Jupyter inside emacs: https://github.com/millejoh/emacs-ipython-notebook


My main criticism is the ipynb files. I don't like that it stores input and output in the same file. Ideally I'd like at least an option for it to put the output in a directory, with images stored as normal, separate files. It's commonly known that the current approach is terrible for version control, for one thing.


I'm with you, but I still end up using notebooks because I haven't found anything better for doing analysis. The two things I want the most are:

1. A variable window where I can browse through the values of each variable (like R Studio) 2. Be able to set breakpoints

So basically something in-between PyCharm and Jupyter.


You have created some excellent bash scripts over the years to do data analysis? I find pandas much easier to use in general, especially with:

pd.read_clipboard()

pd.read_excel()


It is hard to overstate just how ferociously bad the experience of getting Jupyter from blank computer to the equivalent of "Hello world" actually is.


I have a strategy that works pretty consistently - close your eyes and ignore the best practices like using Anaconda, Python 3, virtualenv (or venv in py3... oh wait it's a module?) and just install Python 2.7 with pip into default locations (I even run pip with sudo, the horror). It works really well! I run all sorts of CV, ML, deep learning notebooks with no problems.


I agree, I never use virtualenv. I might if I was building a production system, but for my own laptop I feel perfectly capable of remember/tracking/checking what is in my ~/.local. (I always install with `--user`)

If I really need to containerize something, I use Docker.


I couldn't agree any less:

    % mkvirtualenv -p `which python3` notebook
    (notebook) % pip install notebook jupyter notebook scipy pandas matplotlib pdbpp ipython
(not sure if all of them are really necessary)


I've found that most of the queries that journalists are trying to run are pretty basic, mostly filtering and histograms. Setting up a virtualenv, dependencies, etc can be tough. And RTFM isn't sufficient for someone getting started. I was surprised that nothing existed for this, so I built it.

It has the basics of a Jupyter notebook - filter, sum, average, plot. So far it's attracted a pretty interesting audience including journalists, but also lawyers and consultants.

www.CSVExplorer.com


Side note, I googled "pandas" and get a lot of results related to the python library, and very few related to the large mammal. Bing doesn't give me any related to the python library. Google knows me too well.


Excellent share.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: