Skip to main content

Importing Datasets

Google Colaboratory

Key Concepts

  • Understand the Data
  • Importing and Exporting Data with Python
  • Getting Started Analyzing Data with Python
  • Python Packages for Data Science

Understanding the Data

Importing Datasets

Data is usually stored in plain formats that facilitate teh delivery and use of it. With python libraries like pandas the call and processing to a dataframe is eased while having a convenient way of manipulating the data like matrices and tensors in a fluid fashion with the flexible embedded functions.

Some basic/useful commands:

import pandas as pd
df = pd.read_csv("path_to_datafile") #can swap the "_csv" with _json, _excel, _hdf, _sql
df.describe(include = "all") # Returns a dataframe with basic statistics from the dataset columns like count, mean, std, min, max and quartiles
df.[['col1','col2']].describe()
df.info() # Provides a concise summary of the dataframe
df.head(n) # shows "n" first values
df.tail(n) # shows "n" last values
df.columns = headers # Strings arrays with names, ej. headers = [col1, col2, col3, col4, coln]
df.dropna(subset = ["col_name"], axis = 0)
df.to_csv("path_to_file_export") # exports dataframe to csv, can also swap to other data formats
df.dtypes # shows dataframe data types

Python Packages for Data Science

  • Scientific Computing:
    • Pandas (Data structures and tools)
    • NumPy (Arrays and Matrices)
    • SciPy (Integrals, solving differential equations, optimization)
  • Visulization:
    • Matplotlib (Plots and Graphs)
    • Seaborn (Plots, avanced (?))
  • Algorithmic:
    • Scikit-learn (Machine Learning, regression, classification)
    • Statsmodels

Basic Data Types

week_1_importing_datasets/screen_shot_20200906_at_192149.png

Connecting to Databases

Python can connect to databases and extract data using SQL API's. Basic functionality is shown:

week_1_importing_datasets/screen_shot_20200906_at_192629.png

A basic code template is like:

from dbmodule import connect

# Create connection object
connection = connect('database_name','username','password')

# Create a cursor object
cursor = connection.cursor()

# Run queries
cursor.execute('select * from table')
results = cursor.fetchall()

# Free resources
cursor.close()
connection.close()