Working with Pandas DataFrame

Pandas is a python programming language library which is very useful for playing with data, doing very complex queries over tables, getting useful data, data filtering etc. This library helps a lot when you have a lot of data and need to do data mining. In this article, I will discuss about some very complex queries which can help anyone who is doing data mining, data engineering over a lot of data. I’ll discuss some queries which can help us to get useful data using pandas DataFrame. My much focus will be over pandas DataFrame and pandas Series.

In this article, I’ll give a little introduction to pandas DataFrame and Series. You should be familiar to pandas and NumPy as well.

Let first import required libraries over which we are going to work.

import pandas as pd
import numpy as np

Pandas Series

ser = pd.Series([1,2,3], index = ['a', 'b', 'c'])
ser

Out:

a    1
b 2
c 3
dtype: int64

Then we can access Series with it’s indexes like following:

ser['a']

Out:

1

Pandas DataFrame

First we will create a pandas DataFrame using NumPy random array. Let first make NumPy array in such a form so that, later we can make a quick look at the values of columns

arr = np.concatenate([np.random.rand(10, 1), np.random.rand(10, 1) + 1, np.random.rand(10, 1) + 2], axis = 1)df = pd.DataFrame(arr)
df

Out:

Image for post
Image for post

As we can notice that the values in column 0 are in the range of (0,1), while column 1 and 2 lies in range of (1,2) and (2,3) respectively.

By default, pandas create indexes names and columns starting from 0. Let see how we can set as custom

df = pd.DataFrame(arr, index = np.arange(1,11, 1), columns = ['A', 'B', 'C'])
df

Out:

Image for post
Image for post

​As we can see, now indexes are 1-10 and columns names are A and B.

Getting Pandas Columns by Its Names

df['A']

Out:

1     0.050407
2 0.908353
3 0.313918
4 0.903563
5 0.276211
6 0.553173
7 0.274751
8 0.559364
9 0.755465
10 0.814105
Name: A, dtype: float64

As we can see we got a pandas Series with values of DataFrame df column A entries where pandas Series name is also A. We can't get two columns in this call. See next how we can get two or more columns of pandas DataFrame.

If we are passing strings in the index of DataFrame then it will return a series. We can’t pass two columns names in the index position of df as following df['A', 'B'], it will raise an error because pandas will consider it as an multicolumn position. To get returned column of DataFrame as a DataFrame, instead of passing a string we'll pass an array with column names of DataFrame as following

df[['A']]

Out:

Image for post
Image for post

And in this case, we can also get multiple columns of DataFrame as following

df[['A', 'B']]

Out:

Image for post
Image for post

Getting Pandas Subset/Sub DataFrame by Index and Columns Names

df.loc[:5, 'A']

Out:

1    0.050407
2 0.908353
3 0.313918
4 0.903563
5 0.276211
Name: A, dtype: float64

To get DataFrame, we will proceed as

df.loc[:5, ['A']]

Out:

Image for post
Image for post

We can also get distinct indexes as following

df.loc[[1,3,5], ['A']]

Out:

Image for post
Image for post

Getting pandas DataFrame subset/Sub DataFrame by Index and Column Position

df.iloc[:5, 1]

Out:

1    1.070382
2 1.560524
3 1.207124
4 1.527835
5 1.279891
Name: B, dtype: float64

We get a pandas Series with name as B which is DataFrame df 2nd column and at position 1.

Getting multiple columns or getting as a DataFrame instead of pandas Series

df.iloc[:5, [1]]

Out:

Image for post
Image for post

Getting multiple columns

df.iloc[:5, [0,1]]

Out:

Image for post
Image for post

Getting pandas Sub DataFrame by Conditions

Checking condition over whole DataFrame

df > 1.5

Out:

Image for post
Image for post

And so we can get those entries where condition is True and others as NaN as following

Getting DataFrame Entries with Condition

df[df > 1.5]

Out:

Image for post
Image for post

We can also get any rows or columns or sub DataFrame by condition over entries of DataFrame. For this purpose, we use loc method by applying condition over rows or columns. First I'll explain how it works

How conditions work over DataFrame rows or columns

For this lets demonstrate you how it will work, we’ll get column B which is at position 1 and as total number of columns are 3 so we'll write an array as following

get_col_b = [False, True, False]
df.loc[:, get_col_b]

Out:

Image for post
Image for post

​As we can see we get column B which was at position 1 and so we set True at position 1 and others False.

Now lets apply condition over both rows and columns. At this time we’ll get those rows where value of column B are greater than 1.5. First we will obtain an array where this condition hold

cond_on_rows = df['B'] > 1.5
cond_on_rows

Out:

1     False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
9 True
10 False
Name: B, dtype: bool

So we’ll apply this condition and get those rows where condition is True as following

df.loc[cond_on_rows, :]

Out:

Image for post
Image for post

We can also get any column, like below we will get those values of column B which are greater than 1.5

df.loc[cond_on_rows, ['B']]

Out:

Image for post
Image for post

Applying Conditions with and and or Operators on Pandas DataFrame

Lets we want to get those rows of pandas DataFrame where values of column B lies in the interval (1.25,1.75) by applying & operator as following

First we can combine conditions by applying & operator to get a conditional output array where both the conditions or True.

Here, we will apply condition on column B where it values lies in interval (1.25,1.75). Remember that NumPy array or pandas Series accepts & and | instead of and and or.

first_cond = (df['B'] > 1.25).rename('C1')
second_cond = (df['B'] < 1.75).rename('C2')
dual_conds = (first_cond & second_cond).rename('C1 & C2')
dual_conds
1 False
2 True
3 False
4 True
5 True
6 True
7 True
8 True
9 False
10 True
Name: C1 & C2, dtype: bool

Below a demonstration table to show that which rows we will get when we will apply condition over range of (1.25,1.75).

pd.concat([first_cond, second_cond, dual_conds], axis = 1)

Out:

Image for post
Image for post

We can apply this dual condition to get DataFrame rows where it is satisfied as following

df.loc[dual_conds, :]

Out:

Image for post
Image for post

We can also apply directly on DataFrame instead of loc method as following

df[dual_conds]

Out:

Image for post
Image for post

Applying Conditions over multiple Columns

In this example, we will check condition on columns A and B where it's values lies in interval (0.5,1.5). See how

first_cond = df[['A', 'B']] > 0.5
second_cond = df[['A', 'B']] < 1.5
dual_conds = first_cond & second_cond
dual_conds

Out:

Image for post
Image for post

Since we tested over condition on two columns, so we will calculate sum of each row. Where sum is 2, its mean condition on both the column is True and we want only those rows.

So now, we’ll sum each row, and apply another condition to check either both rows are True or not.

true_on_2_columns = dual_conds.sum(axis = 1) == 2
true_on_2_columns

Out:

1     False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
dtype: bool

As we can see there is only one row i.e. 10th where condition on both the columns is True. Lets see what this row is

df.loc[true_on_2_columns, ['A', 'B']]

Out:

Image for post
Image for post

Finally, we get that row where columns A and B values lies in interval (0.5,1.5).

Filling with Missing Values

Let first define a DataFrame for an item which is bought by users in different days. Suppose we are getting only those days values where it was bought and how much but we are not getting 0 if it was not bought because our server not saved any value for that day. So while analyzing over data we don’t want to miss any day value like if it was not bought then that should must be 0.

Let first define a DataFrame for 7 days data where index name will be date where item was bought

df = pd.DataFrame([1,2,6,9,2], 
index = ['2020-10-01', '2020-10-02', '2020-10-04', '2020-10-06', '2020-10-07'],
columns = ['Total Items Saled'])
df

Out:

Image for post
Image for post

As we requested for 7 days data but got only for 5 days, its mean that our item was not saled on any 2 days which in this case are 2020-10-03 and 2020-10-5. So while analyzing on data we want to fill those days values as 0. So we will proceed filling as following for missing data

To fill missing data we’ll append a pandas Series with Series Indexes as date and value with any, value doesn’t matter. Here, we’ll set it to 0. First we’ll define a pandas Series with our selected date range as following

We can create pandas Series from a dictionary where key, value of dictionary will be pandas Series index, value

dicti = dict()
for day in np.arange(1,8):
dicti.update({f'2020-10-{day:02d}': 0})
dates = pd.Series(dicti, name = 'date')
dates

Out:

2020-10-01    0
2020-10-02 0
2020-10-03 0
2020-10-04 0
2020-10-05 0
2020-10-06 0
2020-10-07 0
Name: date, dtype: int64

Now lets append this dates Series in our DataFrame, so that if there is any missing date then that will added. We’ll append along columns but as our dates are along indexes of df, so first we’ll take transpose of DataFrame so that, rows and columns will be interchanged as following

df.T

Out:

Image for post
Image for post

Then we could append dates Series to fill missing dates values and finally we’ll retrieve it to its original shape. Lets append dates

df = df.T.append(dates)
df

Out:

Image for post
Image for post

As we can observe that those days values which were missing are appended and their value is NaN. Now we'll replace NaN with 0 and remove date row which is not in our interest and not the part of our original DataFrame.

df = df.fillna(0).drop('date')
df

Out:

Image for post
Image for post

Now finally, we’ll sort our columns in date ascending order and take transpose again to get original shape with missing dates values as 0

df = df.sort_index(axis = 1).T
df

Out:

Image for post
Image for post

As we can see that we have filled values for dates 2020-10-03 and 2020-10-5 as 0 in just a few steps and with less effort. Even, we can do this whole process in a single line as following

df.T.append(dates).fillna(0).drop('date').sort_index(axis = 1).T

Isn’t it amazing?

If this article help you then please give me a clap. To remain attached with me to get more articles on Big Data, Machine Learning, Deep Learning and Python, please follow me. I’ll keep updating you with my experience.

Thank you

MS (Computational Mathematics), Data Scientist and Machine Learning Engineer, Mathematician, Programmer, Research Scientist, Writer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store