When I started working with pandas I noticed that there were so many ways how to subset, filter and join data with pandas. But I was lacking a systematic overview. How do the different approaches differ and when to use which?
In this blogpost we’ll look at different ways for subsetting, filtering and combining DataFrames.
Subsetting Data: Selecting subsets of rows and columns by labels and positions
.iloc
stands for integer-location based indexing. Which means it needs integer input and refers to the position of a row or column you want to select. You can pass integers (0
) to select a single row, list of integers ([0,4,9]
) and slice objects (1:2,3:5
). Here are some examples:
# rows
df.iloc[0] # first row of data frame
df.iloc[1] # second row of data frame
df.iloc[-1] # last row of data frame
df.iloc[0:10] # first 10 rows of a data frame
# columns:
df.iloc[:,0] #first column
df.iloc[:,1] # second column
df.iloc[:,-1] # last column of data frame
df.iloc[:,0:10] # first 10 column of data frame
.loc
is label based. You can use a single label ('id'
), a list of labels (['column_a', 'column_b']
) or a slice object (row_a:row_c, column_b
:column_d
). Note that if you pass an integer value to .loc
, it will be interpreted as a label.
If you want to select rows or columns that contain a certain string (e.g. a prefix or postfix) you can use filter()
. You pass the string you are looking for to the parameters like
or regex
. The axis parameter determines whether you search in the index (axis = 0
) or in the column names (axis = 1
). You can filter all columns that contain the string foo
with df.filter(like ='foo', axis = 1
.
Here a brief overview about subsetting data
index | how | Example with slice object |
---|---|---|
.iloc |
by index (row) | df.iloc[<row_nr_start>:<row_nr_end>, '<col_nr_start>':'<col_nr_end>'] |
.loc |
by label | df.loc[<row_label_start>:<row_label_end>, '<col_name_start>':'<col_name_end>'] |
filter |
by label and regexes | df.filter(like =<some_string>, axis = 0 ) |
Compare iloc
, loc
and filter
All three ways above allow you to subset your data. However, sometimes one or the other may be easier to use or lead to less errors, e.g. if you delete columns from your DataFrame
. To compare the 3 ways, let’s look at the diabetes dataset
from sklearn import datasets
import pandas as pd
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df.head()
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 |
1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 |
2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 |
3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 |
4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 |
(1) Select the age
column
code | returns | |
---|---|---|
iloc |
df.iloc[:,0] , df.iloc[0] |
Series |
loc |
df.loc[:,['age']] |
DataFrame |
filter |
df.filter(items = ['age']) |
DataFrame |
Note that iloc
returns a Series
, if you select only one column. Also, if your first column ever changes (e.g. because you reset the index), you will suddendly retrieve another column. If you need a specific column, I suggest you use loc
or filter
that pass the name of this column specifically.
(2) Select the colums s1 to s6
code | returns | |
---|---|---|
iloc |
df.iloc[:,4:10] |
DataFrame |
loc |
df.loc[:,'s1':'s6'] |
DataFrame |
filter |
df.filter(regex ='s\d') |
DataFrame |
Again, if the positions of the column change, iloc
will lead to errors. If your columns follow a pattern rather than a position or a name, I’ll suggest you use filter
as it is less error prone.
Filtering Data: Selecting subsets of rows based on conditions
There are multiple ways how to filter data with pandas. We’ll look at how boolean indexing works and how to filter with the query()
function.
If you are familiar with SQL, query
might be the most intuitive way for you. You just pass the condition as a string to the query
function. The condition is almost the same code as you would put in your WHERE
-condition in a SQL-query. Just the operators are just slightly different
SQL operator | pandas equivalent |
---|---|
OR |
` |
AND |
& |
NOT , ! |
~ |
So you could for example use df.query("column_a == 'foo' & column_b == 1")
Boolean Masking is another powerful way to access data. The idea is to use a boolean mask to index the DataFrame
. This boolean mask is a vector of true and false values for each row in the DataFrame
.
new_df = df[<some boolean mask>]
Only the rows with true values will be returned. Of course, you could pass this vector directly
my_mask = [True, False, True, False, True, False, True, False, True, False, True, False, True]
df[my_mask]
or you create it with conditions.
df[df.<colname> == 'foo']
Compare query
vs. boolean masking
Lets compare the two approaches this time using the boston dataset.
boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
Here are three example on how to achieve the same things with both approaches.
query |
boolean mask | |
---|---|---|
one condition | df.query("AGE < 50") |
df[df.AGE < 50] |
two conditions | df.query("AGE < 50 & RAD == 2") |
df[(df.AGE < 50) & (df.RAD == 2)] |
multiple values | df.query('RAD in[1,2]') |
df[df.RAD.isin([1,2])] |
One could argue, that query
is more readable and requieres less typing as you do not have to repead the name of the DataFrame
in the condition.
Combining Data: joins
Pandas has three functions for joining tables join
,merge
and concat
. join
and merge
are similar to joining tables in SQL with a (left, right etc….) join. join()
will attempt to do a left join on indices keeping all columns while merge
will perform an inner join on the common columns. Both can combine two tables. concat
is the most flexible option, which allows you to join rows and columns of multiple DataFrames. Your datasets are just stitched together along an axis — either the row axis or column axis- ignoring the index. So you can use concat
for joining and for the union of DataFrames.
function | combines | default | how |
---|---|---|---|
join |
2 dataframes by indices | left | left, right, inner, outer |
merge |
2 dataframes by columns or index | inner | … |
concat |
n dataframes by columns, index, or rows | append rows | … |
Compare join
vs. concat
vs. merge
Let’s create some datasets, what would happen if we combine those datasets with their default parameters. Note that the indices differ!
A = pd.DataFrame({'ID': [1, 2, 3, 4],
'Name': ['Tom', 'Lisa', 'Sara', 'John'],
'Age': ['10', '11', '12', '11']},
index=[0, 1, 2, 3])
B = pd.DataFrame({'ID': [1,3,4,6],
'Gender': ['m', 'f', 'm', 'm']},
index=[0, 1, 5, 6])
join
A.join(B)
will throw and error columns overlap but no suffix specified: Index(['ID'], dtype='object')
. Since we have the ID
Column in both datasets, we have to specify a suffix.
A.join(B, lsuffix = "a")
IDa Name Age ID Gender
0 1 Tom 10 1.0 m
1 2 Lisa 11 3.0 f
2 3 Sara 12 NaN NaN
3 4 John 11 NaN NaN
Note that join
does a left join on the index and completely ignores the ID
column.
merge
Merge per default performs an inner join on the common columns. It does not care about the index.
A.merge(B)
ID Name Age Gender
0 1 Tom 10 m
1 3 Sara 12 f
2 4 John 11 m
Using the how parameter, you can determine the type of join
ID Name Age Gender
0 1 Tom 10 m
1 2 Lisa 11 NaN
2 3 Sara 12 f
3 4 John 11 m
concat
Per default, concat
will create a union of the datasets, ignoring the index. It just unions both DataFrames
and creates NaN
s for columns that are missing in one of the datasets.
pd.concat([A, B])
ID Name Age Gender
0 1 Tom 10 NaN
1 2 Lisa 11 NaN
2 3 Sara 12 NaN
3 4 John 11 NaN
0 1 NaN NaN m
1 3 NaN NaN f
5 4 NaN NaN m
6 6 NaN NaN m
I typically use merge
for joining DataFrames
in a SQL-like way and concat
for stitching DataFrames
together or SQL-like union
s.
Conclusion
We have seen several ways how to subset, filter and join DataFrames. Some of the functions will get you to the same results, while others have small differences you need to be aware of.