# Working with Pandas DataFrame

Pandas is a python programming language library which is very useful for playing with data, doing very complex queries over tables, getting useful data, data filtering etc. This library helps a lot when you have a lot of data and need to do data mining. In this article, I will discuss about some very complex queries which can help anyone who is doing data mining, data engineering over a lot of data. I’ll discuss some queries which can help us to get useful data using pandas DataFrame. My much focus will be over pandas DataFrame and pandas Series.

In this article, I’ll give a little introduction to pandas DataFrame and Series. You should be familiar to pandas and NumPy as well.

Let first import required libraries over which we are going to work.

`import pandas as pd`

import numpy as np

# Pandas Series

Pandas Series is an array which is able to contain data of any dtypes. Series is much like 1-D NumPy array instead of that, it is capable of renaming indexes. For example below an array `[1,2,3]`

is converted into a pandas Series with indexes `a,b,c`

.

`ser = pd.Series([1,2,3], index = ['a', 'b', 'c'])`

ser

Out:

`a 1`

b 2

c 3

dtype: int64

Then we can access Series with it’s indexes like following:

`ser['a']`

Out:

`1`

# Pandas DataFrame

Pandas DataFrame is a data set which can contain any type of data. It is much like a table where we can store strings, numbers and even classes and methods. DataFrame helps a lot while working with huge amount of data, and we have to filter data, getting useful information from dataset etc. Entries of pandas DataFrame can be accessed by its rows and columns names. In pandas, rows names are represented by Index. Pandas DataFrame provide a lot of methods to do statistical anlaysis over data. It provides flexibility to iterate through either by rows or columns. In this section, I will give an introduction to pandas Dataframe structure.

First we will create a pandas DataFrame using NumPy random array. Let first make NumPy array in such a form so that, later we can make a quick look at the values of columns

arr = np.concatenate([np.random.rand(10, 1), np.random.rand(10, 1) + 1, np.random.rand(10, 1) + 2], axis = 1)df = pd.DataFrame(arr)

df

Out:

As we can notice that the values in column `0`

are in the range of `(0,1)`

, while column `1`

and `2`

lies in range of `(1,2)`

and `(2,3)`

respectively.

By default, pandas create indexes names and columns starting from `0`

. Let see how we can set as custom

`df = pd.DataFrame(arr, index = np.arange(1,11, 1), columns = ['A', 'B', 'C'])`

df

Out:

As we can see, now indexes are `1-10`

and columns names are `A`

and `B`

.

# Getting Pandas Columns by Its Names

We can access any columns of DataFrame by passing its name as following

`df['A']`

Out:

`1 0.050407`

2 0.908353

3 0.313918

4 0.903563

5 0.276211

6 0.553173

7 0.274751

8 0.559364

9 0.755465

10 0.814105

Name: A, dtype: float64

As we can see we got a pandas Series with values of DataFrame df column `A`

entries where pandas Series name is also `A`

. We can't get two columns in this call. See next how we can get two or more columns of pandas DataFrame.

If we are passing strings in the index of DataFrame then it will return a series. We can’t pass two columns names in the index position of df as following `df['A', 'B']`

, it will raise an error because pandas will consider it as an multicolumn position. To get returned column of DataFrame as a DataFrame, instead of passing a string we'll pass an array with column names of DataFrame as following

`df[['A']]`

Out:

And in this case, we can also get multiple columns of DataFrame as following

`df[['A', 'B']]`

Out:

# Getting Pandas Subset/Sub DataFrame by Index and Columns Names

We can also get any particular row(s) of pandas DataFrame of any particular column by its method `loc`

where `df.loc[Array of Index Name of Row(s), Column Name(s)]`

. Let we want get first 5 rows of column `A`

.

`df.loc[:5, 'A']`

Out:

`1 0.050407`

2 0.908353

3 0.313918

4 0.903563

5 0.276211

Name: A, dtype: float64

To get DataFrame, we will proceed as

`df.loc[:5, ['A']]`

Out:

We can also get distinct indexes as following

`df.loc[[1,3,5], ['A']]`

Out:

# Getting pandas DataFrame subset/Sub DataFrame by Index and Column Position

Instead of indicating columns or index names, we can also get DataFrame entries by its position which starts from `0`

by using its method `iloc`

where we can define it as `df.iloc[Index/Row Position, Column Position]`

as following

`df.iloc[:5, 1]`

Out:

`1 1.070382`

2 1.560524

3 1.207124

4 1.527835

5 1.279891

Name: B, dtype: float64

We get a pandas Series with name as `B`

which is DataFrame `df`

2nd column and at position `1`

.

Getting multiple columns or getting as a DataFrame instead of pandas Series

`df.iloc[:5, [1]]`

Out:

Getting multiple columns

`df.iloc[:5, [0,1]]`

Out:

# Getting pandas Sub DataFrame by Conditions

We can get those entries of pandas DataFrame where our some conditions holds.

## Checking condition over whole DataFrame

We can check any condition on dataframe and it returns us a dataframe in same shape with `True`

where condition hold and otherwise `False`

. Let see how

`df > 1.5`

Out:

And so we can get those entries where condition is `True`

and others as `NaN`

as following

## Getting DataFrame Entries with Condition

`df[df > 1.5]`

Out:

We can also get any rows or columns or sub DataFrame by condition over entries of DataFrame. For this purpose, we use `loc`

method by applying condition over rows or columns. First I'll explain how it works

## How conditions work over DataFrame rows or columns

Instead of indicating rows or column name in `loc`

method we can pass an array of booleans at the position of row or column as `True`

which we want to get and others as `False`

. Let suppose we want to get 2nd column then we will pass an array with the same size as number of columns in df and then we'll set booleans either to get a column or not.

For this lets demonstrate you how it will work, we’ll get column `B`

which is at position `1`

and as total number of columns are 3 so we'll write an array as following

`get_col_b = [False, True, False]`

df.loc[:, get_col_b]

Out:

As we can see we get column `B`

which was at position `1`

and so we set `True`

at position `1`

and others `False`

.

Now lets apply condition over both rows and columns. At this time we’ll get those rows where value of column `B`

are greater than `1.5`

. First we will obtain an array where this condition hold

`cond_on_rows = df['B'] > 1.5`

cond_on_rows

Out:

`1 False`

2 True

3 False

4 True

5 False

6 True

7 False

8 True

9 True

10 False

Name: B, dtype: bool

So we’ll apply this condition and get those rows where condition is `True`

as following

`df.loc[cond_on_rows, :]`

Out:

We can also get any column, like below we will get those values of column `B`

which are greater than `1.5`

`df.loc[cond_on_rows, ['B']]`

Out:

## Applying Conditions with and and or Operators on Pandas DataFrame

Instead of only one condition, we can also apply a set of conditions on Pandas DataFrame with operators `and`

and `or`

. But here in Pandas instead of `and`

and `or`

we use symbols as `&`

and `|`

respectively. Let see how

Lets we want to get those rows of pandas DataFrame where values of column `B`

lies in the interval `(1.25,1.75)`

by applying `&`

operator as following

First we can combine conditions by applying `&`

operator to get a conditional output array where both the conditions or `True`

.

Here, we will apply condition on column `B`

where it values lies in interval `(1.25,1.75)`

. Remember that NumPy array or pandas Series accepts `&`

and `|`

instead of `and`

and `or`

.

first_cond = (df['B'] > 1.25).rename('C1')

second_cond = (df['B'] < 1.75).rename('C2')

dual_conds = (first_cond & second_cond).rename('C1 & C2')

dual_conds1 False

2 True

3 False

4 True

5 True

6 True

7 True

8 True

9 False

10 True

Name: C1 & C2, dtype: bool

Below a demonstration table to show that which rows we will get when we will apply condition over range of `(1.25,1.75)`

.

`pd.concat([first_cond, second_cond, dual_conds], axis = 1)`

Out:

We can apply this dual condition to get DataFrame rows where it is satisfied as following

`df.loc[dual_conds, :]`

Out:

We can also apply directly on DataFrame instead of `loc`

method as following

`df[dual_conds]`

Out:

## Applying Conditions over multiple Columns

We can get those rows of pandas DataFrame where conditions on multiple columns are `True`

. Here we'll use another approach instead of above where we were applying condition on a single column. As you can see we were first obtained a Series where both the conditions were `True`

for a single column, but here we'll test condition on `2`

columns. So, in this case we'll need a `2`

-dimensional array but Series is `1`

-dimensional. So first we'll apply condition on columns of DataFrame and then check where both the columns are `True`

In this example, we will check condition on columns `A`

and `B`

where it's values lies in interval `(0.5,1.5)`

. See how

`first_cond = df[['A', 'B']] > 0.5`

second_cond = df[['A', 'B']] < 1.5

dual_conds = first_cond & second_cond

dual_conds

Out:

Since we tested over condition on two columns, so we will calculate sum of each row. Where sum is `2`

, its mean condition on both the column is `True`

and we want only those rows.

So now, we’ll sum each row, and apply another condition to check either both rows are `True`

or not.

`true_on_2_columns = dual_conds.sum(axis = 1) == 2`

true_on_2_columns

Out:

`1 False`

2 False

3 False

4 False

5 False

6 False

7 False

8 False

9 False

10 True

dtype: bool

As we can see there is only one row i.e. `10`

th where condition on both the columns is `True`

. Lets see what this row is

`df.loc[true_on_2_columns, ['A', 'B']]`

Out:

Finally, we get that row where columns `A`

and `B`

values lies in interval `(0.5,1.5)`

.

# Filling with Missing Values

Often working over big data, we come to need to fill missing values. Like if we are getting data against dates where we are getting data for last 30 days but database didn’t save data value as 0 where its not received value. In such case pandas DataFrame is very helpful to fill missing data very quickly.

Let first define a DataFrame for an item which is bought by users in different days. Suppose we are getting only those days values where it was bought and how much but we are not getting 0 if it was not bought because our server not saved any value for that day. So while analyzing over data we don’t want to miss any day value like if it was not bought then that should must be `0`

.

Let first define a DataFrame for 7 days data where index name will be date where item was bought

`df = pd.DataFrame([1,2,6,9,2], `

index = ['2020-10-01', '2020-10-02', '2020-10-04', '2020-10-06', '2020-10-07'],

columns = ['Total Items Saled'])

df

Out:

As we requested for 7 days data but got only for 5 days, its mean that our item was not saled on any 2 days which in this case are `2020-10-03`

and `2020-10-5`

. So while analyzing on data we want to fill those days values as `0`

. So we will proceed filling as following for missing data

**To fill missing data** we’ll append a pandas Series with Series Indexes as date and value with any, value doesn’t matter. Here, we’ll set it to 0. First we’ll define a pandas Series with our selected date range as following

We can create pandas Series from a dictionary where key, value of dictionary will be pandas Series index, value

`dicti = dict()`

for day in np.arange(1,8):

dicti.update({f'2020-10-{day:02d}': 0})

dates = pd.Series(dicti, name = 'date')

dates

Out:

`2020-10-01 0`

2020-10-02 0

2020-10-03 0

2020-10-04 0

2020-10-05 0

2020-10-06 0

2020-10-07 0

Name: date, dtype: int64

Now lets append this dates Series in our DataFrame, so that if there is any missing date then that will added. We’ll append along columns but as our dates are along indexes of df, so first we’ll take transpose of DataFrame so that, rows and columns will be interchanged as following

`df.T`

Out:

Then we could append dates Series to fill missing dates values and finally we’ll retrieve it to its original shape. Lets append dates

`df = df.T.append(dates)`

df

Out:

As we can observe that those days values which were missing are appended and their value is `NaN`

. Now we'll replace `NaN`

with 0 and remove date row which is not in our interest and not the part of our original DataFrame.

`df = df.fillna(0).drop('date')`

df

Out:

Now finally, we’ll sort our columns in date ascending order and take transpose again to get original shape with missing dates values as 0

`df = df.sort_index(axis = 1).T`

df

Out:

As we can see that we have filled values for dates `2020-10-03`

and `2020-10-5`

as `0`

in just a few steps and with less effort. Even, we can do this whole process in a single line as following

`df.T.append(dates).fillna(0).drop('date').sort_index(axis = 1).T`

Isn’t it amazing?

If this article help you then please give me a clap. To remain attached with me to get more articles on Big Data, Machine Learning, Deep Learning and Python, please follow me. I’ll keep updating you with my experience.