Reading files is a fundamental task for any data analyst. Pandas, is one such library which helps us to achieve this task more efficiently. The Pandas library has rich features that makes the life of an analyst easy. In this blog, I would like to show some tips to read files using Pandas library. And, I will also discuss some errors and solutions to avoid them. These are the most frequent challenges encountered by early pandas programmers.
The first task is to install and import the pandas library and it is very simple.
If you haven’t installed pandas library earlier, install it using below line
Use the below line of code to import the library
pd is a widely accepted common shortcut to pandas. Off-course, you can use your own shortcuts, but make sure you don’t get confuse by creating new shortcuts.
Now to read a file in python we use the pandas method read_csv [ only to read csv files ] and we store it in a dataframe. So, what is a dataframe? . You can consider dataframe as a simple excel sheet containing rows and columns or like any other SQL table with relevant column names.
To read a file we just execute this simple one line code.
However, this will throw below error.
As backslash “\” is an escape character in python, to fix this we need to use double backslash “\\” as below
Alternatively, we can use raw string as below instead of adding double back slashes i.e simple addition of ‘r’ character at the start of the path.
Once we read the file, we can check how the data is arranged in the dataframe using the head method of the dataframe. The head method displays the top 5 rows of the data frame by default.
Here, we find all the columns are placed in a single column, this is not an ideal way to read a file. The file was read incorrectly, as the file contains semicolon as the delimiter, that separates each column and now we use “sep” parameter of read_csv method. This ensures the columns are separated as per the delimiter in the file.
In most cases the above method of reading files work fine, but there are situations where we get below error, this happens when the files have different type of encoding.
By default, python3 expects ‘utf-8 ‘ encoding in case of any deviation from this, it would simply throw above error. And, fortunately we can fix this issue easily. Just execute below code.
Here we are using the encoding parameter to mention the encoding that is being followed in the file.
Now, this is a big challenge to identify the encoding in a file. Unfortunately, we will not be able to identify all encodings but we can identify some of the most frequently used encodings using a python package chardet.
To install chardet
Import the package with below line of code
Run the below code snippet to identify the file encoding
The code opens the file and reads in bytes and chardet package reads the first hundred bytes and determines the type of coding the file that is being followed in the file. And displays the below result when the parameter “Encoding_Details” is printed.
In this tutorial we learned how to read files, handle path issues, encoding issues, validating the dataframe format and a way to detect encoding in files. This is a basic stuff and would really help pandas beginner.
Follow my future blogs for more tips on reading files using pandas library with different kinds of errors and different ways to handle them. Keep Learning …