Working with Pandas DataFrame

Pandas is a python programming language library which is very useful for playing with data, doing very complex queries over tables, getting useful data, data filtering etc. This library helps a lot when you have a lot of data and need to do data mining. In this article, I will discuss about some very complex queries which can help anyone who is doing data mining, data engineering over a lot of data. I’ll discuss some queries which can help us to get useful data using pandas DataFrame. My much focus will be over pandas DataFrame and pandas Series.

In this article, I’ll give a little introduction to pandas DataFrame and Series. You should be familiar to pandas and NumPy as well.

Let first import required libraries over which we are going to work.

import pandas as pd
import numpy as np

Pandas Series

Pandas Series is an array which is able to contain data of any dtypes. Series is much like 1-D NumPy array instead of that, it is capable of renaming indexes. For example below an array is converted into a pandas Series with indexes .

ser = pd.Series([1,2,3], index = ['a', 'b', 'c'])
ser

Out:

a    1
b 2
c 3
dtype: int64

Then we can access Series with it’s indexes like following:

ser['a']

Out:

1

Pandas DataFrame

Pandas DataFrame is a data set which can contain any type of data. It is much like a table where we can store strings, numbers and even classes and methods. DataFrame helps a lot while working with huge amount of data, and we have to filter data, getting useful information from dataset etc. Entries of pandas DataFrame can be accessed by its rows and columns names. In pandas, rows names are represented by Index. Pandas DataFrame provide a lot of methods to do statistical anlaysis over data. It provides flexibility to iterate through either by rows or columns. In this section, I will give an introduction to pandas Dataframe structure.

First we will create a pandas DataFrame using NumPy random array. Let first make NumPy array in such a form so that, later we can make a quick look at the values of columns

arr = np.concatenate([np.random.rand(10, 1), np.random.rand(10, 1) + 1, np.random.rand(10, 1) + 2], axis = 1)df = pd.DataFrame(arr)
df

Out:

As we can notice that the values in column are in the range of , while column and lies in range of and respectively.

By default, pandas create indexes names and columns starting from . Let see how we can set as custom

df = pd.DataFrame(arr, index = np.arange(1,11, 1), columns = ['A', 'B', 'C'])
df

Out:

​As we can see, now indexes are and columns names are and .

Getting Pandas Columns by Its Names

We can access any columns of DataFrame by passing its name as following

df['A']

Out:

1     0.050407
2 0.908353
3 0.313918
4 0.903563
5 0.276211
6 0.553173
7 0.274751
8 0.559364
9 0.755465
10 0.814105
Name: A, dtype: float64

As we can see we got a pandas Series with values of DataFrame df column entries where pandas Series name is also . We can't get two columns in this call. See next how we can get two or more columns of pandas DataFrame.

If we are passing strings in the index of DataFrame then it will return a series. We can’t pass two columns names in the index position of df as following , it will raise an error because pandas will consider it as an multicolumn position. To get returned column of DataFrame as a DataFrame, instead of passing a string we'll pass an array with column names of DataFrame as following

df[['A']]

Out:

And in this case, we can also get multiple columns of DataFrame as following

df[['A', 'B']]

Out:

Getting Pandas Subset/Sub DataFrame by Index and Columns Names

We can also get any particular row(s) of pandas DataFrame of any particular column by its method where . Let we want get first 5 rows of column .

df.loc[:5, 'A']

Out:

1    0.050407
2 0.908353
3 0.313918
4 0.903563
5 0.276211
Name: A, dtype: float64

To get DataFrame, we will proceed as

df.loc[:5, ['A']]

Out:

We can also get distinct indexes as following

df.loc[[1,3,5], ['A']]

Out:

Getting pandas DataFrame subset/Sub DataFrame by Index and Column Position

Instead of indicating columns or index names, we can also get DataFrame entries by its position which starts from by using its method where we can define it as as following

df.iloc[:5, 1]

Out:

1    1.070382
2 1.560524
3 1.207124
4 1.527835
5 1.279891
Name: B, dtype: float64

We get a pandas Series with name as which is DataFrame 2nd column and at position .

Getting multiple columns or getting as a DataFrame instead of pandas Series

df.iloc[:5, [1]]

Out:

Getting multiple columns

df.iloc[:5, [0,1]]

Out:

Getting pandas Sub DataFrame by Conditions

We can get those entries of pandas DataFrame where our some conditions holds.

Checking condition over whole DataFrame

We can check any condition on dataframe and it returns us a dataframe in same shape with where condition hold and otherwise . Let see how

df > 1.5

Out:

And so we can get those entries where condition is and others as as following

Getting DataFrame Entries with Condition

df[df > 1.5]

Out:

We can also get any rows or columns or sub DataFrame by condition over entries of DataFrame. For this purpose, we use method by applying condition over rows or columns. First I'll explain how it works

How conditions work over DataFrame rows or columns

Instead of indicating rows or column name in method we can pass an array of booleans at the position of row or column as which we want to get and others as . Let suppose we want to get 2nd column then we will pass an array with the same size as number of columns in df and then we'll set booleans either to get a column or not.

For this lets demonstrate you how it will work, we’ll get column which is at position and as total number of columns are 3 so we'll write an array as following

get_col_b = [False, True, False]
df.loc[:, get_col_b]

Out:

​As we can see we get column which was at position and so we set at position and others .

Now lets apply condition over both rows and columns. At this time we’ll get those rows where value of column are greater than . First we will obtain an array where this condition hold

cond_on_rows = df['B'] > 1.5
cond_on_rows

Out:

1     False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
9 True
10 False
Name: B, dtype: bool

So we’ll apply this condition and get those rows where condition is as following

df.loc[cond_on_rows, :]

Out:

We can also get any column, like below we will get those values of column which are greater than

df.loc[cond_on_rows, ['B']]

Out:

Applying Conditions with and and or Operators on Pandas DataFrame

Instead of only one condition, we can also apply a set of conditions on Pandas DataFrame with operators and . But here in Pandas instead of and we use symbols as and respectively. Let see how

Lets we want to get those rows of pandas DataFrame where values of column lies in the interval by applying operator as following

First we can combine conditions by applying operator to get a conditional output array where both the conditions or .

Here, we will apply condition on column where it values lies in interval . Remember that NumPy array or pandas Series accepts and instead of and .

first_cond = (df['B'] > 1.25).rename('C1')
second_cond = (df['B'] < 1.75).rename('C2')
dual_conds = (first_cond & second_cond).rename('C1 & C2')
dual_conds
1 False
2 True
3 False
4 True
5 True
6 True
7 True
8 True
9 False
10 True
Name: C1 & C2, dtype: bool

Below a demonstration table to show that which rows we will get when we will apply condition over range of .

pd.concat([first_cond, second_cond, dual_conds], axis = 1)

Out:

We can apply this dual condition to get DataFrame rows where it is satisfied as following

df.loc[dual_conds, :]

Out:

We can also apply directly on DataFrame instead of method as following

df[dual_conds]

Out:

Applying Conditions over multiple Columns

We can get those rows of pandas DataFrame where conditions on multiple columns are . Here we'll use another approach instead of above where we were applying condition on a single column. As you can see we were first obtained a Series where both the conditions were for a single column, but here we'll test condition on columns. So, in this case we'll need a -dimensional array but Series is -dimensional. So first we'll apply condition on columns of DataFrame and then check where both the columns are

In this example, we will check condition on columns and where it's values lies in interval . See how

first_cond = df[['A', 'B']] > 0.5
second_cond = df[['A', 'B']] < 1.5
dual_conds = first_cond & second_cond
dual_conds

Out:

Since we tested over condition on two columns, so we will calculate sum of each row. Where sum is , its mean condition on both the column is and we want only those rows.

So now, we’ll sum each row, and apply another condition to check either both rows are or not.

true_on_2_columns = dual_conds.sum(axis = 1) == 2
true_on_2_columns

Out:

1     False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
dtype: bool

As we can see there is only one row i.e. th where condition on both the columns is . Lets see what this row is

df.loc[true_on_2_columns, ['A', 'B']]

Out:

Finally, we get that row where columns and values lies in interval .

Filling with Missing Values

Often working over big data, we come to need to fill missing values. Like if we are getting data against dates where we are getting data for last 30 days but database didn’t save data value as 0 where its not received value. In such case pandas DataFrame is very helpful to fill missing data very quickly.

Let first define a DataFrame for an item which is bought by users in different days. Suppose we are getting only those days values where it was bought and how much but we are not getting 0 if it was not bought because our server not saved any value for that day. So while analyzing over data we don’t want to miss any day value like if it was not bought then that should must be .

Let first define a DataFrame for 7 days data where index name will be date where item was bought

df = pd.DataFrame([1,2,6,9,2], 
index = ['2020-10-01', '2020-10-02', '2020-10-04', '2020-10-06', '2020-10-07'],
columns = ['Total Items Saled'])
df

Out:

As we requested for 7 days data but got only for 5 days, its mean that our item was not saled on any 2 days which in this case are and . So while analyzing on data we want to fill those days values as . So we will proceed filling as following for missing data

To fill missing data we’ll append a pandas Series with Series Indexes as date and value with any, value doesn’t matter. Here, we’ll set it to 0. First we’ll define a pandas Series with our selected date range as following

We can create pandas Series from a dictionary where key, value of dictionary will be pandas Series index, value

dicti = dict()
for day in np.arange(1,8):
dicti.update({f'2020-10-{day:02d}': 0})
dates = pd.Series(dicti, name = 'date')
dates

Out:

2020-10-01    0
2020-10-02 0
2020-10-03 0
2020-10-04 0
2020-10-05 0
2020-10-06 0
2020-10-07 0
Name: date, dtype: int64

Now lets append this dates Series in our DataFrame, so that if there is any missing date then that will added. We’ll append along columns but as our dates are along indexes of df, so first we’ll take transpose of DataFrame so that, rows and columns will be interchanged as following

df.T

Out:

Then we could append dates Series to fill missing dates values and finally we’ll retrieve it to its original shape. Lets append dates

df = df.T.append(dates)
df

Out:

As we can observe that those days values which were missing are appended and their value is . Now we'll replace with 0 and remove date row which is not in our interest and not the part of our original DataFrame.

df = df.fillna(0).drop('date')
df

Out:

Now finally, we’ll sort our columns in date ascending order and take transpose again to get original shape with missing dates values as 0

df = df.sort_index(axis = 1).T
df

Out:

As we can see that we have filled values for dates and as in just a few steps and with less effort. Even, we can do this whole process in a single line as following

df.T.append(dates).fillna(0).drop('date').sort_index(axis = 1).T

Isn’t it amazing?

If this article help you then please give me a clap. To remain attached with me to get more articles on Big Data, Machine Learning, Deep Learning and Python, please follow me. I’ll keep updating you with my experience.

Thank you

MS (Computational Mathematics), Data Scientist and Machine Learning Engineer, Mathematician, Programmer, Research Scientist, Writer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store