How to steal like an artist

Find things to inspire you. If you aren’t sure what these types of things are, start with one person you admire the work of. Then find three people they admire. And then find three people each of…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Pandas for Manipulating and Reshaping Data in python

Pandas is the foremost data analysis library in python, especially for handling rectangular data frames. But it can sometimes seem overly tricky for reshaping, placing a heavy emphasis on df indices.

gotchas — no support for nan of type float in pd.Series, dropna or convert to string/ int64

A Table showing the French Ligue 1 football standings
French Ligue 1 football table

The closest equivalent to excel in the python data processing ecosystem, Pandas has built-in functions for reshaping, aggregating, cleaning and plotting.

Here we import pandas as pd.

Pandas is built on top of the numpy library in python for efficient vectorised operations. This ensures it is as speedy as possible.

Each Column in a pandas dataframe is a pd.Series

Most methods in pandas work equally well on Series objects, but the most value way of working with data is from having the the series together in a dataframe.

Here we introduce some of the more rudimentary operations and their associated syntax.

pd.DataFrame() is the DataFrame constructor, it takes as input a python dict, list , or NULL to initialise an empty DataFrame.

The dataframe’s columns are subsequently accessed using the pd.columns and pd.index attribute calls.

The groupby syntax is consistently unremarkable. If you know SQL/ R the main conceptual shift comes from handling the underlying data structures that are the building blocks in manipulating pandas’ dataframes. Consider grouping by column1.

df.groupby([column1]).sum()

The .sum() will take the sum of all numeric columns in the df grouped by the column1, and column1 will become the new index.

Specify the column to sum using square brackets indexing notation as usual df.groupby([column1])[SpecificColumnToSum].sum()

Speaking of indices. They have a special role in python, more so than they would in R. They act more like the primary key in SQL, without the uniqueness constraint, and generally indicate the grain of the table.

The pd.DataFrame.merge method has a validate parameter that tracks the mapping of foreign (matching) keys across the two dataframes to prevent unexpected bloating of the resultant dataFrame. The allow options to validate = '' are

Also note in contrast to SQL join behaviour, Nulls will match. Despite np.nan not equalling itself in the general pythonic ecosystem.

Joins in the sql understanding are covered above, but for appending, or stacking, DataFrames on top of one another (or beside one another)pd.concat is different (note, not called pd.dataframe.concat — it takes two dataframes that are passed in a list to the first parameter )

😆 This is confusing, but a valiant attempt to describe the join like behaviour of pd.concat

The default behaviour is to stack data frames on top of one another vertically (axis=0 or axis=’index’), in which case the concatenation axis is the columns for which the join parameter is set. The default behaviour for the join parameter is to keep the intersection of columns appearing in both dataframes. (join=’outer’). To only keep those columns that are common to both df, set join =’inner’.

The optional set logical mentioned above is the inner intersection or outer union on this indices join key.

The axis = “columns|index" parameter gives the option to ‘join’ on either the columns or indices. It’s interesting to note in this instance, that the index is treated as a first class column, but that you could also consider the column names could be treated as an index when the axis='column' parameter is set.

So for axis='columns' we see the red scenario in the figure below. And for join='inner' only the matched versions row will return.

Join one or more path components intelligently. The return value is the concatenation of path and any members of *paths with exactly one directory separator following each part except the last, meaning that the result will only end in a separator if the last part is empty. If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.

In the above the *paths is any number of unnamed comma separated arguments.

The **kwargs will give you all keyword arguments except for those corresponding to a formal parameter as a dictionary.

This was a quick summary of the more important aspects to reshape a datatable. Hope it fun!

Add a comment

Related posts:

Graphic Design is Emotional Design

Graphic design is an ancient craft, dating back past Egyptian hieroglyphs to 17,000-year-old cave paintings. As a term originating in the 1920s’ print industry and covering a range of activities…

How Modern Web Applications Are Made Today

Web Development is a highly growing field with new technologies and improvements happening every day. Gone are the days where we used to make web applications to be viewed on a Desktop Computer…