When I started working with pandas I noticed that there were so many ways how to subset, filter and join data with pandas. But I was lacking a systematic overview. How do the different approaches differ and when to use which?

In this blogpost we’ll look at different ways for subsetting, filtering and combining DataFrames.

Subsetting Data: Selecting subsets of rows and columns by labels and positions

.iloc stands for integer-location based indexing. Which means it needs integer input and refers to the position of a row or column you want to select. You can pass integers (0) to select a single row, list of integers ([0,4,9]) and slice objects (1:2,3:5). Here are some examples:

# rows
df.iloc[0] # first row of data frame 
df.iloc[1] # second row of data frame 
df.iloc[-1] # last row of data frame 
df.iloc[0:10] # first 10 rows of a data frame

# columns:
df.iloc[:,0] #first column
df.iloc[:,1] # second column 
df.iloc[:,-1] # last column of data frame 
df.iloc[:,0:10] # first 10 column of data frame

.loc is label based. You can use a single label ('id'), a list of labels (['column_a', 'column_b']) or a slice object (row_a:row_c, column_b:column_d). Note that if you pass an integer value to .loc, it will be interpreted as a label.

If you want to select rows or columns that contain a certain string (e.g. a prefix or postfix) you can use filter(). You pass the string you are looking for to the parameters like or regex. The axis parameter determines whether you search in the index (axis = 0) or in the column names (axis = 1). You can filter all columns that contain the string foo with df.filter(like ='foo', axis = 1.

Here a brief overview about subsetting data

index	how	Example with slice object
`.iloc`	by index (row)	`df.iloc[<row_nr_start>:<row_nr_end>, '<col_nr_start>':'<col_nr_end>']`
`.loc`	by label	`df.loc[<row_label_start>:<row_label_end>, '<col_name_start>':'<col_name_end>']`
`filter`	by label and regexes	`df.filter(like =<some_string>, axis = 0`)

Compare `iloc`, `loc` and `filter`

All three ways above allow you to subset your data. However, sometimes one or the other may be easier to use or lead to less errors, e.g. if you delete columns from your DataFrame. To compare the 3 ways, let’s look at the diabetes dataset

from sklearn import datasets
import pandas as pd
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df.head()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019908	-0.017646
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068330	-0.092204
2	0.085299	0.050680	0.044451	-0.005671	-0.045599	-0.034194	-0.032356	-0.002592	0.002864	-0.025930
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022692	-0.009362
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031991	-0.046641

(1) Select the age column

	code	returns
`iloc`	`df.iloc[:,0]`, `df.iloc[0]`	`Series`
`loc`	`df.loc[:,['age']]`	`DataFrame`
`filter`	`df.filter(items = ['age'])`	`DataFrame`

Note that iloc returns a Series, if you select only one column. Also, if your first column ever changes (e.g. because you reset the index), you will suddendly retrieve another column. If you need a specific column, I suggest you use loc or filter that pass the name of this column specifically.

(2) Select the colums s1 to s6

	code	returns
`iloc`	`df.iloc[:,4:10]`	`DataFrame`
`loc`	`df.loc[:,'s1':'s6']`	`DataFrame`
`filter`	`df.filter(regex ='s\d')`	`DataFrame`

Again, if the positions of the column change, iloc will lead to errors. If your columns follow a pattern rather than a position or a name, I’ll suggest you use filter as it is less error prone.

Filtering Data: Selecting subsets of rows based on conditions

There are multiple ways how to filter data with pandas. We’ll look at how boolean indexing works and how to filter with the query() function.

If you are familiar with SQL, query might be the most intuitive way for you. You just pass the condition as a string to the query function. The condition is almost the same code as you would put in your WHERE-condition in a SQL-query. Just the operators are just slightly different

SQL operator	pandas equivalent
`OR`	`
`AND`	`&`
`NOT`, `!`	`~`

So you could for example use df.query("column_a == 'foo' & column_b == 1")

Boolean Masking is another powerful way to access data. The idea is to use a boolean mask to index the DataFrame. This boolean mask is a vector of true and false values for each row in the DataFrame.

new_df = df[<some boolean mask>]

Only the rows with true values will be returned. Of course, you could pass this vector directly

my_mask = [True, False, True, False, True, False, True, False, True, False, True, False, True]
df[my_mask]

or you create it with conditions.

df[df.<colname> == 'foo']

Compare `query` vs. boolean masking

Lets compare the two approaches this time using the boston dataset.

boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

Here are three example on how to achieve the same things with both approaches.

	`query`	boolean mask
one condition	`df.query("AGE < 50")`	`df[df.AGE < 50]`
two conditions	`df.query("AGE < 50 & RAD == 2")`	`df[(df.AGE < 50) & (df.RAD == 2)]`
multiple values	`df.query('RAD in[1,2]')`	`df[df.RAD.isin([1,2])]`

One could argue, that query is more readable and requieres less typing as you do not have to repead the name of the DataFrame in the condition.

Combining Data: joins

Pandas has three functions for joining tables join,merge and concat. join and merge are similar to joining tables in SQL with a (left, right etc….) join. join() will attempt to do a left join on indices keeping all columns while merge will perform an inner join on the common columns. Both can combine two tables. concat is the most flexible option, which allows you to join rows and columns of multiple DataFrames. Your datasets are just stitched together along an axis — either the row axis or column axis- ignoring the index. So you can use concat for joining and for the union of DataFrames.

function	combines	default	how
`join`	2 dataframes by indices	left	left, right, inner, outer
`merge`	2 dataframes by columns or index	inner	…
`concat`	n dataframes by columns, index, or rows	append rows	…

Compare `join` vs. `concat` vs. `merge`

Let’s create some datasets, what would happen if we combine those datasets with their default parameters. Note that the indices differ!

A = pd.DataFrame({'ID': [1, 2, 3, 4],
                    'Name': ['Tom', 'Lisa', 'Sara', 'John'],
                    'Age': ['10', '11', '12', '11']},
                   index=[0, 1, 2, 3])


B = pd.DataFrame({'ID': [1,3,4,6],
                  'Gender': ['m', 'f', 'm', 'm']},
                  index=[0, 1, 5, 6])

`join`

A.join(B)

will throw and error columns overlap but no suffix specified: Index(['ID'], dtype='object'). Since we have the ID Column in both datasets, we have to specify a suffix.

A.join(B, lsuffix = "a")


   	IDa	 Name	Age	ID 		Gender
0	1	 Tom	10	1.0 	m
1	2	 Lisa	11	3.0 	f
2	3	 Sara	12	NaN		NaN
3	4	 John	11	NaN		NaN

Note that join does a left join on the index and completely ignores the ID column.

`merge`

Merge per default performs an inner join on the common columns. It does not care about the index.

A.merge(B)


   	ID 	Name	Age	Gender
0	1	Tom		10	m
1	3	Sara	12	f
2	4	John	11	m

Using the how parameter, you can determine the type of join

	ID		Name	Age		Gender
0	1		Tom	    10		m
1	2		Lisa	11		NaN
2	3		Sara	12		f
3	4		John	11		m

`concat`

Per default, concat will create a union of the datasets, ignoring the index. It just unions both DataFrames and creates NaNs for columns that are missing in one of the datasets.

pd.concat([A, B])

	ID	Name		Age		Gender
0	1	Tom	    	10		NaN
1	2	Lisa		11		NaN
2	3	Sara		12		NaN
3	4	John		11		NaN
0	1	NaN	    	NaN		m
1	3	NaN  		NaN		f
5	4	NaN	    	NaN		m
6	6	NaN 		NaN		m

I typically use merge for joining DataFrames in a SQL-like way and concat for stitching DataFrames together or SQL-like unions.

Conclusion

We have seen several ways how to subset, filter and join DataFrames. Some of the functions will get you to the same results, while others have small differences you need to be aware of.

Mastering Data Preparation with Pandas: Subsetting, Filtering and Joining DataFrames

Subsetting Data: Selecting subsets of rows and columns by labels and positions

Compare `iloc`, `loc` and `filter`

Filtering Data: Selecting subsets of rows based on conditions

Compare `query` vs. boolean masking

Combining Data: joins

Compare `join` vs. `concat` vs. `merge`

`join`

`merge`

`concat`

Conclusion

Heike Maria (PhD)

Mastering Data Preparation with Pandas: Subsetting, Filtering and Joining DataFrames

Subsetting Data: Selecting subsets of rows and columns by labels and positions

Compare iloc, loc and filter

Filtering Data: Selecting subsets of rows based on conditions

Compare query vs. boolean masking

Combining Data: joins

Compare join vs. concat vs. merge

join

merge

concat

Conclusion

Heike Maria (PhD)

Compare `iloc`, `loc` and `filter`

Compare `query` vs. boolean masking

Compare `join` vs. `concat` vs. `merge`

`join`

`merge`

`concat`