When I started working with pandas I noticed that there were so many ways how to subset, filter and join data with pandas. But I was lacking a systematic overview. How do the different approaches differ and when to use which?

In this blogpost we’ll look at different ways for subsetting, filtering and combining DataFrames.

Subsetting Data: Selecting subsets of rows and columns by labels and positions

.iloc stands for integer-location based indexing. Which means it needs integer input and refers to the position of a row or column you want to select. You can pass integers (0) to select a single row, list of integers ([0,4,9]) and slice objects (1:2,3:5). Here are some examples:

# rows
df.iloc[0] # first row of data frame 
df.iloc[1] # second row of data frame 
df.iloc[-1] # last row of data frame 
df.iloc[0:10] # first 10 rows of a data frame
# columns:
df.iloc[:,0] #first column
df.iloc[:,1] # second column 
df.iloc[:,-1] # last column of data frame 
df.iloc[:,0:10] # first 10 column of data frame 

.loc is label based. You can use a single label ('id'), a list of labels (['column_a', 'column_b']) or a slice object (row_a:row_c, column_b:column_d). Note that if you pass an integer value to .loc, it will be interpreted as a label.

If you want to select rows or columns that contain a certain string (e.g. a prefix or postfix) you can use filter(). You pass the string you are looking for to the parameters like or regex. The axis parameter determines whether you search in the index (axis = 0) or in the column names (axis = 1). You can filter all columns that contain the string foo with df.filter(like ='foo', axis = 1.

Here a brief overview about subsetting data


index how Example with slice object
.iloc by index (row) df.iloc[<row_nr_start>:<row_nr_end>, '<col_nr_start>':'<col_nr_end>']
.loc by label df.loc[<row_label_start>:<row_label_end>, '<col_name_start>':'<col_name_end>']
filter by label and regexes df.filter(like =<some_string>, axis = 0)

Compare iloc, loc and filter

All three ways above allow you to subset your data. However, sometimes one or the other may be easier to use or lead to less errors, e.g. if you delete columns from your DataFrame. To compare the 3 ways, let’s look at the diabetes dataset

from sklearn import datasets
import pandas as pd
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df.head()

age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641

(1) Select the age column


code returns
iloc df.iloc[:,0], df.iloc[0] Series
loc df.loc[:,['age']] DataFrame
filter df.filter(items = ['age']) DataFrame

Note that iloc returns a Series, if you select only one column. Also, if your first column ever changes (e.g. because you reset the index), you will suddendly retrieve another column. If you need a specific column, I suggest you use loc or filter that pass the name of this column specifically.

(2) Select the colums s1 to s6


code returns
iloc df.iloc[:,4:10] DataFrame
loc df.loc[:,'s1':'s6'] DataFrame
filter df.filter(regex ='s\d') DataFrame

Again, if the positions of the column change, iloc will lead to errors. If your columns follow a pattern rather than a position or a name, I’ll suggest you use filter as it is less error prone.

Filtering Data: Selecting subsets of rows based on conditions

There are multiple ways how to filter data with pandas. We’ll look at how boolean indexing works and how to filter with the query() function.

If you are familiar with SQL, query might be the most intuitive way for you. You just pass the condition as a string to the query function. The condition is almost the same code as you would put in your WHERE-condition in a SQL-query. Just the operators are just slightly different


SQL operator pandas equivalent
OR `
AND &
NOT, ! ~

So you could for example use df.query("column_a == 'foo' & column_b == 1")

Boolean Masking is another powerful way to access data. The idea is to use a boolean mask to index the DataFrame. This boolean mask is a vector of true and false values for each row in the DataFrame.

new_df = df[<some boolean mask>]

Only the rows with true values will be returned. Of course, you could pass this vector directly

my_mask = [True, False, True, False, True, False, True, False, True, False, True, False, True]
df[my_mask]  

or you create it with conditions.

df[df.<colname> == 'foo']

Compare query vs. boolean masking

Lets compare the two approaches this time using the boston dataset.

boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df.head()

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Here are three example on how to achieve the same things with both approaches.


query boolean mask
one condition df.query("AGE < 50") df[df.AGE < 50]
two conditions df.query("AGE < 50 & RAD == 2") df[(df.AGE < 50) & (df.RAD == 2)]
multiple values df.query('RAD in[1,2]') df[df.RAD.isin([1,2])]

One could argue, that query is more readable and requieres less typing as you do not have to repead the name of the DataFrame in the condition.

Combining Data: joins

Pandas has three functions for joining tables join,merge and concat. join and merge are similar to joining tables in SQL with a (left, right etc….) join. join() will attempt to do a left join on indices keeping all columns while merge will perform an inner join on the common columns. Both can combine two tables. concat is the most flexible option, which allows you to join rows and columns of multiple DataFrames. Your datasets are just stitched together along an axis — either the row axis or column axis- ignoring the index. So you can use concat for joining and for the union of DataFrames.


function combines default how
join 2 dataframes by indices left left, right, inner, outer
merge 2 dataframes by columns or index inner
concat n dataframes by columns, index, or rows append rows

Compare join vs. concat vs. merge

Let’s create some datasets, what would happen if we combine those datasets with their default parameters. Note that the indices differ!

A = pd.DataFrame({'ID': [1, 2, 3, 4],
                    'Name': ['Tom', 'Lisa', 'Sara', 'John'],
                    'Age': ['10', '11', '12', '11']},
                   index=[0, 1, 2, 3])


B = pd.DataFrame({'ID': [1,3,4,6],
                  'Gender': ['m', 'f', 'm', 'm']},
                  index=[0, 1, 5, 6])

join

A.join(B)

will throw and error columns overlap but no suffix specified: Index(['ID'], dtype='object'). Since we have the ID Column in both datasets, we have to specify a suffix.

A.join(B, lsuffix = "a")


   	IDa	 Name	Age	ID 		Gender
0	1	 Tom	10	1.0 	m
1	2	 Lisa	11	3.0 	f
2	3	 Sara	12	NaN		NaN
3	4	 John	11	NaN		NaN

Note that join does a left join on the index and completely ignores the ID column.

merge

Merge per default performs an inner join on the common columns. It does not care about the index.

A.merge(B)


   	ID 	Name	Age	Gender
0	1	Tom		10	m
1	3	Sara	12	f
2	4	John	11	m

Using the how parameter, you can determine the type of join

	ID		Name	Age		Gender
0	1		Tom	    10		m
1	2		Lisa	11		NaN
2	3		Sara	12		f
3	4		John	11		m

concat

Per default, concat will create a union of the datasets, ignoring the index. It just unions both DataFrames and creates NaNs for columns that are missing in one of the datasets.

pd.concat([A, B])

	ID	Name		Age		Gender
0	1	Tom	    	10		NaN
1	2	Lisa		11		NaN
2	3	Sara		12		NaN
3	4	John		11		NaN
0	1	NaN	    	NaN		m
1	3	NaN  		NaN		f
5	4	NaN	    	NaN		m
6	6	NaN 		NaN		m

I typically use merge for joining DataFrames in a SQL-like way and concat for stitching DataFrames together or SQL-like unions.

Conclusion

We have seen several ways how to subset, filter and join DataFrames. Some of the functions will get you to the same results, while others have small differences you need to be aware of.