In this tutorial we will discuss about Dataframe, a powerful pandas object that gives you an immense power to handle complex data. What is a Dataframe? , Dataframe is like a table or excel sheet with rows and columns. Dataframe is a 2D, size-mutable and capable of holding heterogenous data.
Dataframe is the primary pandas object for any data analyst or scientist who works with data. You can read a csv or Excel or other formatted file in to dataframe and start accessing or modify data as you need. To get more information about reading a csv file refer to my earlier blogs.
This tutorial covers wide variety of dataframe basics that deals with getting different kinds information about a dataframe, reading values based on index, column and modifying values in a dataframe.
Let us have a quick look at various attributes and methods of dataframe.
1. View the data format in the Dataframe
Once you read a file into a dataframe, view the data to get a quick glance on how the data is arranged. You can use the head method that displays top 5 rows by default, but you can change the number of rows you would like to view by passing the value as below.
Similar to head method, we have tail method to display the bottom rows of the dataframe as below, the default number of rows displayed is set to 5.
2.Get Dataframe Column names
All the column names will not be displayed using head or tail method, when the number of columns are large enough to fit on a line. In such cases, to get columns names in a dataframe, just use the columns attribute as below. it will list all the columns names in the dataframe.
If you would like to view the index range and column names at the same time, use axes attribute as below.
Now, you can see the index range that starts at 0 and stops at 9 and it doesn’t include 9 in the index.
You can observe, there is no index 9 in the dataframe.
3. Column data types
In a dataframe each column is capable to hold single datatype, and to view the data type of all columns in a dataframe use the attribute dtypes.
If you like to view a column based on the data type, use the select_dtypes() method as below.
You can see only the column with float data type is displayed i.e. ‘value’.
if you want to convert a column to a specific data type. you can use the astype method.
The date column now is converted to string method from datetime. Also using astype method, you can convert all the columns of the dataframe to a single datatype.
Now, the dataframe df1 contains all columns of string datatype.
3. Shape, Size, Dimension & other attributes
a. To get the dimensions of a dataframe, use the ndim attribute.
b. To get the shape of the dataframe i.e how many rows and columns it has we use the shape attribute.
You can see the dataframe has 9 rows and 3 columns.
c. For getting the size of a dataframe, use the size attribute. This will provide the number of unit values the dataframe can hold i.e multiplication of rows with columns.
You can observe, it is a multiplication of 9 rows with 3 columns .
d. To validate if a dataframe is empty or not, we use the empty attribute as below
The response of the empty attribute will be Boolean value. The result False indicates the dataframe is not empty.
When using shape, size, empty, ndim attributes, you don’t need to includes parentheses.
5. Get complete information of Dataframe
You have a method info() that provides a detailed information about a dataframe that includes index, datatypes, column names, number of rows, columns and number of non-null values in each column and memory usage.
You can observe that the dataframe df1 has 9 rows, 3 columns , index starting from o to 8, the column names along with the memory occupied by the entire dataframe. This is a simple way to view information about the dataframe.
6. Memory Usage of Dataframe
If you would like to know, the memory usage of each column. The memory_usage method will help you.
The memory usage of each column is displayed in bytes. By default, the memory usage of index is also calculated , you can set not to display it by using index=False. This value is displayed in DataFrame.info by default and it can be suppressed by setting
pandas.options.display.memory_usage to False.
7. Copying a dataframe
Copying a dataframe is like a creating a copy for the existing dataframe, and we can do it as below.
We are setting deep=True to ensure the changes to the new copy don’t get reflected in the original dataframe. If we don’t set the deep attribute to false, any changes to the new copy would be reflected in the original copy. The default value is set to True for deep.
8. Access and modify data:
Now, we discuss the most popular method of accessing data in a dataframe i.e. loc method. Most people get confused with behaviour of loc , let me explain it briefly.
All you need to know is one simple thing that is we pass the values of rows and columns for the data that we would like to access. So, the ideal syntax of loc is dataframe.loc[rows,columns]. Important point to be observed is that we are using square brackets and not parenthesis.
The row value is based on index value.
Follow the below rules for passing values to rows and columns, they are:
- For accessing a unit or cell value directly pass the row and column number.
- For selecting rows at random order, you need pass them in list.
- For selecting rows in sequence or in a range, you provide the start of the row and end of the row separated with a colon.
- You need to provide column names by text only.
- For selecting multiple columns, pass column names by a list.
a. If you want select specific unit or cell value at a given row and column you use loc as below.
b. If you want select a sequence of rows use colon symbol to mention start and end of rows.
c. If you want to select multiple rows that are non-sequential you need to use a list to provide the row numbers.
d. If you need to select multiple columns you need use list to provide the names.
You can observe, that order of retrieving data is based on the list of values we provided.
e. In loc the column names should be mentioned by text only. else we will get below error.
f. If we don’t pass the column number, it will display all column names by default.
g. We can select rows based on condition.
Only the records matching the condition will be selected.
h. We can modify values in a dataframe using low as below
i. To set all values in a column to a constant value
There are a lot of ways you can modify, you just need to try different options of accessing.
9. Accessing and modifying data based on Integer values
In the loc method , we passed the name of the column in string method. In the iloc method we use only integers to access the data. This is really useful, when you have huge number of columns as typing text names can create unwanted errors due to case-sensitivity, spelling and other isssues. With numbering of the columns accessing data becomes much easier. Also, now we can set up a range for columns as well.
iloc also uses the same rules of loc method, mentioned earlier in this blog.
a. To read a dataframe based on row and column numbers, to display a cell or unit value.
b. To display rows and columns in a specific range. Here, we are selecting data from columns 0 and 1. Always, the first column in dataframe is 0.
c. To select data from random rows and columns.
In this tutorial we learned a basics of dataframe that includes wide variety of attributes and methods that deals with retrieving information about dataframe, modifying the values , viewing dataframe records as needed and other tasks.
Hope you enjoyed this episode on dataframe basics. Keep reading. Keep Learning !!!