Importing data in a right way using Pandas

Photo by Etienne Girardet on Unsplash

In continuation to my previous blogs on read_csv method in pandas , this is the last blog covering some interesting topics that help to read csv files in a better way.

The topics covered in this tutorial are

1) Handling date columns

2) Bad input lines

3) Dealing with Missing values

4) Reading compressed files

1.Handling Date Columns:

The dataset has 9 rows, the first & third column are date and time. We also observe that these columns are treated as object type. Now, we will convert the columns to correct datatype at the time importing data using parse_dates attribute.

The parse_dates attribute takes values in lists and dictionary. We can pass the names of the columns to be parsed or the column number that needs to be parsed.

Setting of parse_dates can be done in multiple ways, let us view few formats.

a. Setting by column name, the list can include multiple names

b. Setting by column number

c. a. To combine multiple columns to form a date column by using list of lists

You can see there are only two columns now, the date and time are merged to form the date_time column with data type as datetime, and both the original columns were deleted. If you wish to keep original columns , set the attribute keep_date_col to true. By default, the keep_date_col option is set to false.

2. Avoid reading bad lines:

As we read the file with the attribute error_bad_lines set to false as below, all the bad lines get skipped. Parallelly, we get messages for the skipped lines, the message can be viewed in the image below.

However, make sure you don’t change the default setting of warn_bad_lines from true to false. If you change, the value of warn_bad_lines to false, you will not receive notification of the line that was skipped while reading. Better to leave it as default, unless you need to change it.

If you want to skip bad lines and don’t want any messages to be displayed, use the below line of code.

3. Dealing With Missing Values

Let us take this sample dataset that has missing data with different entries.

Now after setting the attribute with missing data values, we are able to replace all different entries of missing data with NaN. Now, with the replaced dataset we would be able to recognize all missing values better and that help us to handle missing data more effectively.

You can observe all missing data is named as NaN i.e. Not a Number.

4. Reading Compressed files

The compression attribute will take values ‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None with default value as ‘infer’.

In this tutorial we learned handling date columns , bad lines, missing values and reading compressed files at the time of importing.

This completes my tutorial on read_csv method of Pandas, hope you enjoyed the learning. Thanks for reading. Keep Learning !!!

Data Science and machine learning enthusiast