Importing data using Python
December 13, 2023
n today's world, there is a lot of data being generated from various devices. The format of data varies from flat files to tabular structures. In this post, we will be looking into Python packages, for importing data using Python. We will be looking at techniques to import the following file types using Python packages:
- Flat files - .txt, .csv files
- Pickled file
- Excel files
- SAS files
- STATA file
- HDF5 files
- mat file
- Relational database
- Reading data from the web
Let's get started with flat files.
Flat Files
A flat file is a file where data is stored in plain text format. Some of the formats of flat files are .txt (text file), .csv (comma-separated file), etc. Let's see how to read data from a text file.
To read data from a text file there are three steps:
- Open the file
- Read data from the file
- Close the file
Code demonstrating reading data from the text file:
file = open('sample_file.txt', mode='r') # r syands for read mode
print(file.readline()) # Reading 1st line of file
file.close()
There is a better approach to read a file using the context manager, using this approach we don't have to worry about closing the file after reading data from it.
with open('sample_file.txt') as file:
print(file.readline()) # Reading 1st line of file
You can try with file sample_file.txt.
We can read a .csv file using numpy or Pandas. Here I will be demonstrating both ways to read .csv files.
Using numpy array:
Numpy array provides two different functions using which we can read .csv files. These are:
loadtxt()
recfromcsv()
Let's see how to read data from files using two functions. Please refer code shown below:
import numpy as np
file = 'population.csv'
population = np.recfromcsv(file, delimiter=',')
population_data = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
Here we have used the skiprows option as 1st row of the data set is headers.
You can try with the population.csv file
Using pandas:
If you have been following around we have already imported the .csv file using pandas. If not, don't worry I will demonstrate it one more time. But here we will go one step forward and convert the pandas dataset generated to a numpy array. Let's see this in action.
data = pd.read_csv(file, nrows =5) # Reading 5 rows
data_array = data.values
print(type(data_array))
#output
Pickled file
The pickle file extension is associated with the Python Pickle module used to implement the powerful algorithms for serializing and de-serializing a Python object structure.
The pickle file stores the byte stream representing objects. Let's first understand how to create a pickled file and then we will see how to import a pickled file.
import pickle
# Writting to pickled file
a = ['Python','Pickled','file']
file_Name = "pickledDemo"
fileObject = open(file_Name,'wb')
pickle.dump(a,fileObject)
fileObject.close()
# Reading from pickled file
fileObject = open(file_Name,'rb')
b = pickle.load(fileObject)
print(b)
Now let's see how to import pickle files using a context manager approach:
import pickle
file_Name = "pickledDemo"
with open(file_Name, 'rb') as file:
text = pickle.load(file)
print(text)
You can try with the pickledDemo file.
Excel files
If you have used Microsoft Office you must be familiar with the Excel files. Excel file looks very similar to tables. Refer to the example shown below to import an Excel file using the pandas package:
import pandas as pd
file = 'sampleFile.xlsx'
xl = pd.ExcelFile(file)
#sheet names in excel file
print(xl.sheet_names)
# Load a sheet into a DataFrame by name
df = xl.parse('Sheet1')
print(df.head())
# Load a sheet into a DataFrame by index
df_index = xl.parse(0)
# Print the head of the DataFrame df2
print(df_index.head())
You can test with sampleFile.xlsx
SAS files
SAS files are generally generated by the software named Statistical Analytics Software.
This software is generally used for data modeling and analytics, therefore making it popular among data scientists. We can import SAS files using the package SAS7BDAT in Python. Let's look into the code to import the SAS file:
from sas7bdat import SAS7BDAT
with SAS7BDAT('test.sas7bdat') as file:
df_sas = file.to_data_frame()
print(df_sas.head())
You can test with the test.sas7bdat
STATA files
There could be multiple resources to generate a .dta file, but the most common way to generate a .dta file is using Stata data analysis software. Using this software we can generate a .dta file where the file stores datasets.
The way to import a .dta file using Python is as shown below:
import pandas as pd
df = pd.read_stata('profits.dta')
print(df.head())
You can test with profits.dta file.
HDF5 files
HDF5 is a technology to manage extremely large and complex files. It contains multidimensional arrays of scientific data. H5 files are commonly used in aerospace, physics, engineering, finance, etc. Let's see how we can import the HDF5 file using the h5py package:
import numpy as np
import h5py
file = 'test_data.hdf5'
data = h5py.File(file, 'r')
for key in data.keys():
print(key)
group = data['testing']
for key in group.keys():
print(key)
You can try with test_data.hd5.
mat file
This is the type of file that is generated by the MATLAB program.
To import the .mat file using the Python package scipy.io refer to the code shown below:
import scipy.io
mat = scipy.io.loadmat('test_file.mat')
# Print the keys of the MATLAB file
print(mat.keys())
You can try with test_file.mat.
Relational database
The relational database consists of a set of tables that have data in the form of columns and rows. And relation can be established between the columns of the different tables.
Popular relational database:
- Postgres
- Mysql
- SqlLite
I won't be covering the basics of how relational database works. An assumption is that you have some knowledge about any of the relational databases.
Let's discuss the SqlLite database and import a table into the dataset.
from sqlalchemy import create_engine
engine = create_engine('sqlite:///localhost.database') #connection string to database
table_names = engine.table_names()
print(table_names)
Here we are using the sqlalchemy package to connect to the SQLite database. The above code will fetch out all the tables from our database.
Let's see how we can import data using SQL queries:
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///Chinook.sqlite')
with engine.connect() as con:
rs = con.execute("SELECT LastName, Title FROM Employee")
dfAll = pd.DataFrame(rs.fetchall()) # To fetch all the results
df = pd.DataFrame(rs.fetchmany(size=3)) # To fetch top 3 results
df.columns = rs.keys()
print(df.head())
It is recommended to use a context manager to connect to the database. Doing so will reduce the efforts of closing the database connections.
Using the way shown above we are querying the database by creating an engine, connecting to the engine, querying, and storing results in pandas. There is a more sorted way that pandas provide to query the database:
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///Chinook.sqlite')
df = pd.read_sql_query("select * from Album", engine)
print(df.head())
Pandas provide read_sql_query
function using which we can directly query and store results to data sets.
Reading data from the web
Python packages make it possible to read a file from the web.
There can be two ways to create our datasets using data sources such as files from the web. These are:
- We can import files locally from the web and create a dataset
- We can directly create a dataset without storing a file on the local system
Firstly, let's see how we can create a dataset by downloading a file to our local system:
from urllib.request import urlretrieve
import pandas as pd
# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Save file locally
urlretrieve(url, 'test.csv')
df = pd.read_csv('test.csv', sep=';')
print(df.head())
We can avoid saving a file to the local system and reading directly from the URL. Let's see how we can directly create a dataset without downloading the file to our local system:
import pandas as pd
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
df = pd.read_csv(url, sep=';')
print(df.head())
NOTE: We can also read excel file from web using
pd.read_excel()
function. The output of function will be sheet names.
Requests and urlib package:
Using packages urllib and requests package we can read the files directly from the web.
Firstly, let's see the way to import files from the web using the urllib package:
from urllib.request import urlopen, Request
url = "https://www.techladder.in"
request = Request(url, headers={'User-Agent' : "Magic Browser"})
response = urlopen(request)
html = response.read()
print(html)
response.close()
We can also perform the same operations that we have performed with the urllib package using the requests package. Among all the package requests, this is the most downloaded package. Let's see the requests package in action:
import requests
url = "https://www.techladder.in"
r = requests.get(url)
text = r.text
print(text)
BeautifulSoup
The data that we got above is not useful. To make the most out of it we need to get the structured data from the HTML that we got above. To do so we use the beautifulSoup package:
import requests
from bs4 import BeautifulSoup
url = 'https://www.techladder.in'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, "lxml")
# Pretify HTML DOM
pretty_soup = soup.prettify()
print(pretty_soup)
# Extract page title
page_title = soup.title
print(page_title)
#Extract all the text from the page
page_text = soup.get_text()
print(page_text.replace('\n\n','\n'))
# Find all a tags on page
a_tags = soup.find_all('a')
for link in a_tags:
print(link.get('href'))
Read API response as JSON
JSON is used to communicate between the server and the browser. JSON is a collection of key-value pairs. Let's see a sample API that returns the response as JSON.
import requests
url = 'https://reqres.in/api/users?page=1'
r = requests.get(url)
json_data = r.json()
print(json_data)
# Iterate over the keys and value of JSON
for k in json_data.keys():
print(k + ': ', json_data[k])
NOTE: We use
json.load()
function to read JSON file stored in local system.
Now, we are familiar with the technique of loading any type of file into a dataset. In the next post, I will explain the techniques that data scientists use to clean the data.