Pandas, a powerful tool for analyst

image from unsplash.com

In this tutorial, we will learn some tricks to handle data better while reading files in python using read_csv method. This is a continuation of my prior tutorials on read_csv, you can view those tutorials for some foundation on reading files and other tips and tricks.

In this learning tutorial, we will learn some advanced stuff about pandas.

Below topics would be covered in this blog.

1.Data type Specification at the time of importing data

2. Converters

3. Iterators

4. Chunk Size

The read_csv method of Pandas provides us an attribute to specify the data type of the columns at the time of importing data using read_csv. The attribute dtype helps us to achieve this functionality. The attribute can be set in two ways.

In a first way, you can set the datatypes of all columns by specifying the datatype value for the attribute dtype as below.

We can observe that all the columns are of the same data type that is an object, as mentioned in the dtype attribute.

The second procedure involves passing a dictionary containing the column names as the key and the data type as value. Using this procedure, we can set the data type of selected columns.

We changed the datatype of ‘MPG’ and ‘Weight” columns to float from object type and the image reflects the same.

This rich feature takes Pandas read_csv method to the next level of data pre-processing, it is such an incredible feature but very little used among the analyst community due to lack of awareness on this feature. So, let us explore this attribute and add it to our pandas toolkit.

The converters attribute helps us to execute functions that are passed to it using a dictionary. Running functions at the time of reading? Isn’t that an amazing feature ?. This powerful attribute can make our life easy in doing performing many pre-processing tasks, which are not possible via other attributes.

Let us learn this by example, suppose we have column ‘Origin’ with country code as US, as shown in the below image.

Now if we, want to replace the country code with USA , it is not possible with other attributes. But, with the help of converters, we can achieve this.

Let us have a look at the code of this simple task.

This code converts all the values in the column ‘Origin’ with values ‘US’ to ‘USA’. We achieved this by writing a simple lambda function and passing it as a value to the key ‘Origin’, which is the column name in the file.

Quick Note: The converters supersede dtype i.e if both the attributes are used data type specification rules will be applied as per converters.

This attribute helps to read a small piece of data from a huge file. We can control, the amount of data we would like to pull from the file. Once, the file is read with the iterator attribute set to true, you can work with data in chunks and pull the data in desired sizes and file sizes can vary and we pull the data using the get_chunk method.

Below is the code snippet to apply iterators on a file.

The dataset used in the image has 406 rows and each get_chunk method pulls data with no overlap on previous data pull. Even if the number of records to be pulled exceeds the length of dataframe, it pulls all the records of the dataframe, ensuring the pull of the data is complete.

chunksize works similar to iterators that are both the attributes have similar functionality i.e reading a small piece of data from the file. However, the key difference between both the attributes is that in chunksize all the chunks are of similar size, wherein iterators the chunk size can be varied.

The code reads a file with chunk size of 101 i.e. it reads 101 rows at a time. When we run a loop to validate the size of each chunk, we can observe that all the chunks are of equal size excluding the last one. As there are not enough records to fill the chunk and the last chunk was allotted with only 2 rows.

Note : One important point to be noticed, that is in both Iterator and chunksize, the read file is not a dataframe anymore. Once we apply these parameters, a TextFileReader object gets created as below.

In this tutorial, we learned advanced concepts of python in reading files that help us to change datatype & execute functions on columns at the time of importing data.

Also, we learned how to read small chunks of data using chunksize & iterators along with a small discussion on the variation between these attributes.

Hope you enjoyed this tutorial. Thank you for reading. Keep Learning !!!

Data Science and machine learning enthusiast