Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

Pandas:Understanding Data Frames

pandas-dataframe
python-pandas-dataframe
pandas-what-is-dataframe-explained
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Image description
We will get a brief insight on all these basic operation which can be performed on Pandas DataFrame :

Creating a DataFrame
Dealing with Rows and Columns
Indexing and Selecting Data
Working with Missing Data
Iterating over rows and columns

In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:

Creating a dataframe using List: DataFrame can be created using a single list or a list of lists.

import pandas as pd

import pandas as pd

list of strings

lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']

Calling DataFrame constructor on list

df = pd.DataFrame(lst)
print(df)
Run on IDE
Output:
Image description

Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.

Python code demonstrate creating

DataFrame from dict narray / lists

By default addresses.

import pandas as pd

intialise data of lists.

data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Age':[20, 21, 19, 18]}

Create DataFrame

df = pd.DataFrame(data)

Print the output.

print(df)
Run on IDE
Output:

Image description

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.

Column Selection: In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

Import pandas package

import pandas as pd

Define a dictionary containing employee data

data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

Convert the dictionary into DataFrame

df = pd.DataFrame(data)

select two columns

print(df[['Name', 'Qualification']])

Image description

Row Selection: Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an iloc[] function.

Note: We’ll be using nba.csv file in below examples.

# importing pandas package
import pandas as pd

# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
Enter fullscreen mode Exit fullscreen mode

print(first, "\n\n\n", second)
Output:
As shown in the output image, two series were returned since there was only one parameter both of the times.

Image description

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

Indexing a Dataframe using indexing operator [] :
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[].

In order to select a single column, we simply put the name of the column in-between the brackets

importing pandas package

import pandas as pd

making data frame from csv file

data = pd.read_csv("nba.csv", index_col ="Name")

retrieving columns by indexing operator

first = data["Age"]

print(first)

Image description

Indexing a DataFrame using .loc[ ] :
This function selects data by the label of the rows and columns. The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns.

Selecting a single row
In order to select a single row using .loc[], we put a single row label in a .loc function.

importing pandas package

import pandas as pd

making data frame from csv file

data = pd.read_csv("nba.csv", index_col ="Name")

retrieving row by loc method

first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]

print(first, "\n\n\n", second)
Output:
As shown in the output image, two series were returned since there was only one parameter both of the times.

Image description

Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

Selecting a single row
In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.

import pandas as pd

making data frame from csv file

data = pd.read_csv("nba.csv", index_col ="Name")

retrieving rows by iloc method

row2 = data.iloc[3]

print(row2)
Output:

Image description

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas.

Checking for missing values using isnull() and notnull() :
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

importing pandas as pd

import pandas as pd

importing numpy as np

import numpy as np

dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

creating a dataframe from list

df = pd.DataFrame(dict)

using isnull() function

df.isnull()
Output:

Image description

convert list to pandas table

df=pd.DataFrame({'Job_title':job_title,'Job_location':job_location,'Company_name':company_name,'Experience':exp_Reqd})
Enter fullscreen mode Exit fullscreen mode
import pandas as pd

# Create a list of elements
data_list = [
    ["Alice", 25],
    ["Bob", 30],
    ["Charlie", 35],
    ["David", 40]
]

# Convert the list into a Pandas DataFrame
df = pd.DataFrame(data_list, columns=["Name", "Age"])

# Display the DataFrame
print(df)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)