Some Data Munging with Pandas

Posted on August 14, 2018

When doing data projects, working with Pandas is a tool that I will always reach for first. Recently, I was working with Capital Bike Shares public API, and along with my usual method of cleaning up data, had a discovery that makes certain jobs easier.

First, my usual way of gathering data from a JSON feed.

I like to first create Pandas Series instead of a dataframe from the start because I belive that it makes the data collection more modular, making it easier later down the line to make more dataframes with different conditons attached to them. As seen I create the series first, and then I iterate ver the entire JSON feed to grab the rest of the data.

Next, combine it into a dataframe, which is simple, but I will show anyway.

The reason I have astype('int32') their is to standardize the data, along with making sure in the Pandas dataframe, that the dtype does not appear. The dtype apearing in the dataframe can sometimes be a issue, so astype('int32') solves the issue nicely.

Belatedly, I discovered a nice trick with Numpy on Dataframes. I tried to do the usual a < b <c style comparison on a pandas dataframe with where() but I got a error. I discovered after some googling that np.logical_and will let one make these types of logic statements when managing dataframes. As seen below, I was able to do the comparison that I wanted to do with np.logical_and.

The virtue of this trick is while simple, it opens up more ways to look at data. For example just by doing this "Disabled to Available Ratio": (docksDf["Bikes Disabled"] / docksDf["Bikes Available"]).astype('float'), I am able to find the ratio of disabled to available bikes. This is convient because this data point, not given directly, with some simple work in Pandas, is able to be revealed. Doing this, I was able to find out that at somepoint on Sunday, around 10% of all Capital bikes in D.C were disabled.

While more complicated munging is possible working with Lambdas and other tools in python, the discovery of np.logical_and is a very convient tool for doing some data exploration.