Reading files as you like it with Pandas

In this tutorial, we will go through various attributes of read_csv method that would expose you to advanced capabilities of pandas library and give you the power to be able to handle most of the routine tasks while handling files in python. Please refer to my previous blog Learning for Beginners in Pandas for how to start reading files in Python. We discuss below items in this learning section.

  1. Reading file path in to variable
  2. Working with header in Files
  3. Loading only selected columns from files
  4. Skipping top rows
  5. Skipping bottom rows
  6. Loading selected number of records

First of all we read the path of the file into a variable, so that we can use the variable instead of typing entire file path multiple times in the code. This not only helps in saving time, but also makes the code error free to a large extent, as any changes to the path can be done at a single location rather than multiple locations in the code.

Now, we can use this variable in place of the file location. If there are changes in file path you just to need to update the variable.

The python read_csv method has an inbuilt feature to set header at a desired line of your choice from the input csv file [ comma separated values]. So, what is header? you can consider header as a list of all labels that represent column names of the dataframe. Ideally, the header location should be at the top of the file, but for various reasons the header location might be at a different position then expected, in such cases this header attribute would be really handy to set the header for the data frame.

By default, the header attribute is set to zero for read_csv method. However, it can be changed as per the csv file formatting and by adjusting the value of the header attribute as desired. To view the header, just view the head of the file, which display top 5 rows of the dataframe.

Python chooses the first of the file as the header by default and there may be cases where you don’t have a header in your data, and you don’t want python to consider the first row as the header in your data. In such cases, we use header=None option as below, this ensures no header is picked from the csv file. Also, python creates alternate labels for all columns in the dataframe as below.

The read_csv method also lets you choose the columns you desire to work with. There will be situations where you might have hundreds of columns in a file, but you just need few of them to work with. Usually, in most scenarios’ beginner analysts read the entire csv file and then drop all the unwanted columns, but that is not an ideal way to handle un-wanted columns. The better way is to read only the desired columns, off-course you do this only if you are certain about usage of all columns. This could save your ample time in reading extra columns and deleting them later manually. We achieve this with “usecols” attribute in read_csv.

There is one more attribute that can control header location of the file that is skiprows attribute in read_csv method. The skiprows method skips the number of lines at the start of the file and as mentioned earlier the header attribute is set to zero by default, so the starting line of the file after skipping ’n’ lines is now considered as the header of the dataframe. To be more specific the below code skips first line of the file from reading and makes 2nd line of the file as the header following default settings of header attribute.

But when using skiprows attribute, you need to be much cautious in using the usecols option. As you skip the rows, there is chance of missing the column names specified in the list of usecols, as header might get changed that may lead to below error. So, you ensure the column names in usecols are legitimate

Similar to skiprows, which handles rows at the top of the file, we have skipfooter attribute that handles lines at the bottom of the file. Using the attribute skipfooter we can skip ’n’ number of lines from the bottom.

The above line of code would skip reading the bottom 9990 rows, that’s a huge number of skipped rows from the bottom of the file. This is not the ideal way of skipping rows, but you can adjust as per your requirement. When executing this line of code you might get a warning message as below

Don’t worry it is just a warning and you can avoid this message by adding engine parameter as below.

To select only partial number of records from a huge file, we can use ‘nrows’ attribute of read_csv method. This attribute helps us to select desired number of records from a selected file as below.

The above line of code reads only 550 records from the file.

In this tutorial we learned different ways to read the desired content and choose ways to set desired headers for the files. Hope this will help you in your daily routine analyst tasks.

Follow my future blogs for more tips on reading files using pandas library with different kinds of errors and different ways to handle them. Keep Learning …

Data Science and machine learning enthusiast