Importing data in a right way using Pandas

Photo by Etienne Girardet on Unsplash

In continuation to my previous blogs on read_csv method in pandas , this is the last blog covering some interesting topics that help to read csv files in a better way.

The topics covered in this tutorial are

1) Handling date columns

2) Bad input lines

3) Dealing with Missing values

4) Reading compressed files

We have a special attribute in read_csv called parse_dates that has the ability to parse the column into a date datatype. Generally, when a date-based column is read, it was treated as a string object, but with the help of this attribute we can control the datatype of date-based column to reflect correctly. The default value of this attribute is false, to enable this attribute we should explicitly mention the value as True or pass values in list or dictionary for this attribute. Let us have quick glance on the dataset being used for this.

The dataset has 9 rows, the first & third column are date and time. We also observe that these columns are treated as object type. Now, we will convert the columns to correct datatype at the time importing data using parse_dates attribute.

The parse_dates attribute takes values in lists and dictionary. We can pass the names of the columns to be parsed or the column number that needs to be parsed.

Setting of parse_dates can be done in multiple ways, let us view few formats.

a. Setting by column name, the list can include multiple names

b. Setting by column number

c. a. To combine multiple columns to form a date column by using list of lists

You can see there are only two columns now, the date and time are merged to form the date_time column with data type as datetime, and both the original columns were deleted. If you wish to keep original columns , set the attribute keep_date_col to true. By default, the keep_date_col option is set to false.

When reading a file, if the python interpreter encounters a bad line, it halts the program. This can be quite challenging when we are processing a huge file and not aware how many bad lines may exist in the file. Here comes the attribute error_bad_lines to save us from reading the same file multiple times to detect and correct bad lines. By default, this attribute is set to True, as we set it to False the bad lines will be skipped from reading. We get below error when the python interpreter encounters a bad line.

As we read the file with the attribute error_bad_lines set to false as below, all the bad lines get skipped. Parallelly, we get messages for the skipped lines, the message can be viewed in the image below.

However, make sure you don’t change the default setting of warn_bad_lines from true to false. If you change, the value of warn_bad_lines to false, you will not receive notification of the line that was skipped while reading. Better to leave it as default, unless you need to change it.

If you want to skip bad lines and don’t want any messages to be displayed, use the below line of code.

One of the most common challenge in reading any file is about handling missing values. Sometimes, we might have missing data in different formats such as No data, Not Available or No information. Fortunately, read_csv method has attributes that can help us to format the missing data and organize it better while importing itself. Let us have quick glance on the attribute na_values.

Let us take this sample dataset that has missing data with different entries.

Now after setting the attribute with missing data values, we are able to replace all different entries of missing data with NaN. Now, with the replaced dataset we would be able to recognize all missing values better and that help us to handle missing data more effectively.

You can observe all missing data is named as NaN i.e. Not a Number.

We can also read compressed files using read_csv method, the attribute compression helps us to decompress files on the fly, that eliminates the need of decompressing files before reading it. Let us see how the magic is happening.

The compression attribute will take values ‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None with default value as ‘infer’.

In this tutorial we learned handling date columns , bad lines, missing values and reading compressed files at the time of importing.

This completes my tutorial on read_csv method of Pandas, hope you enjoyed the learning. Thanks for reading. Keep Learning !!!

Data Science and machine learning enthusiast