DS2002 Data Science Systems

Course materials and documentation for DS2002

View the Project on GitHub ksiller/ds2002-course

Working with file formats and cleaning data

The goal of this activity is to familiarize you with working with various file formats (JSON, CSV, TSV, XML, SQL) and cleaning data. These skills are essential for processing real-world data, handling different data structures, and preparing data for analysis. They also lay the foundation for working with databases.

Note: Work through the examples below in your terminal (Codespace or local), experimenting with each command and its various options. If you encounter an error message, don’t be discouraged—errors are learning opportunities. Reach out to your peers or instructor for help when needed, and help each other when you can.

In-class Exercises

Let’s dive into a real-world case study. Start with Lab 04: Data Formats - ETL Pipeline Basics

Additional Practice

Working with csv and tsv files

CSV (Comma-Separated Values) files use commas to separate columns:

name,age,city
Alice,25,New York
Bob,30,San Francisco
Charlie,35,Chicago

TSV (Tab-Separated Values) files use tabs to separate columns:

name	age	city
Alice	25	New York
Bob	30	San Francisco
Charlie	35	Chicago

Exercise: Downloading and Processing CSV Data

In this exercise, we’ll download the 2018 Central Park Squirrel Census data and perform various operations using Linux CLI commands.

Step 1: Download the CSV file

Download the squirrel census data and save it as squirrel-census.csv:

curl -o squirrel-census.csv https://data.cityofnewyork.us/api/views/ej9h-v6g2/rows.csv

Explanation: The curl command fetches the file from the URL. The -o flag specifies the output filename.

Step 2: View the column headers

Display the first line (header row) of the CSV file:

head -n 1 squirrel-census.csv

Step 3: Count the number of rows

Count the total number of rows in the file:

wc -l squirrel-census.csv

The wc -l command counts the number of lines in the file. Note: This includes the header row, so subtract 1 for the actual data row count.

Step 4: Count the number of columns

Count the number of columns by counting commas in the header row:

head -n 1 squirrel-census.csv | tr ',' '\n' | wc -l

Approach:

Step 5: Filter rows by index and save to a new file

Extract rows 1-100 (including header) and save to a new file:

head -n 101 squirrel-census.csv > squirrel-census-sample.csv

The head -n 101 command gets the first 101 lines (1 header + 100 data rows), and > redirects the output to a new file.

Step 6: Extract specific rows by index

Extract rows 50-100 (excluding header) and save to a new file:

sed -n '51,101p' squirrel-census.csv > squirrel-census-rows50-100.csv

The sed -n '51,101p' command prints lines 51 through 101 (0-indexed would be rows 50-100). We start at line 51 because line 1 is the header.

Step 7: Extract the first three columns using cut

Extract the first three columns from the original file:

cut -d',' -f1,2,3 squirrel-census.csv > squirrel-census-first3cols.csv

Approach:

Step 8: Filter rows based on content and extract columns

Filter rows containing a specific value in the first column, then extract the first three columns:

grep "AM" squirrel-census.csv | cut -d',' -f1,2,3 > squirrel-census-filtered.csv

Approach:

Step 9: Combine filtering and column extraction

Extract rows 1-50 and get only the first three columns:

head -n 51 squirrel-census.csv | cut -d',' -f1,2,3 > squirrel-census-sample-first3cols.csv

This combines head to limit rows and cut to limit columns, creating a smaller subset of the data.

Convert CSV to TSV

Now let’s convert the squirrel census CSV file to TSV format. We’ll use the squirrel-census.csv file we downloaded earlier and convert it to tab-separated format.

Method 1: Using tr

Replace commas with tabs:

tr ',' '\t' < squirrel-census.csv > squirrel-census.tsv

The tr command translates characters. ',' is replaced with '\t' (tab character). The < redirects input from the CSV file, and > redirects output to the TSV file.

Method 2: Using sed

Replace commas with tabs using sed:

sed 's/,/\t/g' squirrel-census.csv > squirrel-census.tsv

The sed command performs stream editing. s/,/\t/g substitutes commas with tabs globally (all occurrences in each line).

Method 3: Using awk

Convert CSV to TSV using awk:

awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' squirrel-census.csv > squirrel-census.tsv

Approach:

Method 4: Using Python with string replace

Create a Python script csv_to_tsv_simple.py:

#!/usr/bin/env python3
import sys

# Get input and output filenames from command line arguments
if len(sys.argv) != 3:
    print("Usage: python3 csv_to_tsv_simple.py <input.csv> <output.tsv>")
    sys.exit(1)

input_file = sys.argv[1]
output_file = sys.argv[2]

# Read CSV file as text and convert to TSV
with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
    for line in infile:
        # Replace commas with tabs
        tsv_line = line.replace(',', '\t')
        outfile.write(tsv_line)

Run the script:

python3 csv_to_tsv_simple.py squirrel-census.csv squirrel-census.tsv

Explanation:

Method 5: Using Python with pandas

Create a Python script csv_to_tsv_pandas.py:

#!/usr/bin/env python3
import pandas as pd
import sys

# Get input and output filenames from command line arguments
if len(sys.argv) != 3:
    print("Usage: python3 csv_to_tsv_pandas.py <input.csv> <output.tsv>")
    sys.exit(1)

input_file = sys.argv[1]
output_file = sys.argv[2]

# Read CSV file
df = pd.read_csv(input_file)

# Write as TSV (tab-separated)
df.to_csv(output_file, sep='\t', index=False)

print(f"Conversion complete! {input_file} -> {output_file}")

Run the script:

python3 csv_to_tsv_pandas.py squirrel-census.csv squirrel-census.tsv

Explanation:

Verify the conversion:

Check that the TSV file was created correctly:

head -n 3 squirrel-census.tsv

Explanation: View the first 3 lines to verify the tabs are in place. You can also use cat -A to see tab characters as ^I:

head -n 1 squirrel-census.tsv | cat -A

Working with JSON files

CLI: jq

Core Concepts to Learn About jq

Python: requests and json packages

Data Cleaning Exercises

In this section, we’ll practice cleaning the squirrel census CSV data using Linux CLI commands. Data cleaning is an essential step before analysis to ensure data quality and consistency.

Exercise 1: Identify and count empty fields

Find rows with empty values in a specific column:

awk -F',' '$2 == "" {print NR, $0}' squirrel-census.csv | head -20

Explanation:

Exercise 2: Remove duplicate rows

Identify and remove duplicate rows:

# First, identify duplicates
sort squirrel-census.csv | uniq -d | head -10

Explanation:

Exercise 3: Filter rows by date range

Extract rows where a date column falls within a specific range:

awk -F',' '$1 >= "2018-10-01" && $1 <= "2018-10-31" {print}' squirrel-census.csv > squirrel-census-october.csv

Explanation: Adjust the date column index ($1) and date range based on your data structure.

Exercise 4: Count unique values in a column

Count how many unique values exist in a specific column:

cut -d',' -f5 squirrel-census.csv | sort | uniq | wc -l

Explanation:

Exercise 5: Replace specific values

Replace a specific value throughout the file:

sed 's/OldValue/NewValue/g' squirrel-census.csv > squirrel-census-replaced.csv

Explanation: sed substitutes all occurrences of “OldValue” with “NewValue” globally (g flag).

Additional Resources:

For more advanced data cleaning techniques, explore these Kaggle tutorials:

  1. Handling Missing Values
  2. Scaling and Normalization
  3. Parsing Dates
  4. Character Encoding
  5. Inconsistent Data

JSON - JavaScript Object Notation

Let’s move on to JSON files.

Some iterations to try using the jq tool in the command-line:

Filter the mock_data.json file containing “flat”, non-nested data.

cd /root/course/01-data/
cat mock_data.json
cat mock_data.json | jq -r .[]
cat mock_data.json | jq -r .[] | jq ."dob"
cat mock_data.json | jq -r .[] | jq ."dob" | grep "1998"
cat mock_data.json | jq -r .[] | jq ."dob" | grep "1998" | wc -l

Filter the mock_data_nested.json file containing nested data.

cd /root/course/01-data/
cat mock_data_nested.json
cat mock_data_nested.json | jq ."healthChecks"
cat mock_data_nested.json | jq ."healthChecks" | jq .[]."delaySeconds"
cat mock_data_nested.json | jq ."healthChecks" | jq -r .[]."delaySeconds"
cd /root/course/01-data/
cat mock_data_nested.json | jq ."container"
cat mock_data_nested.json | jq ."container" | jq ."volumes"
cat mock_data_nested.json | jq ."container" | jq ."volumes" | jq -r .[]."hostPath"

Using Python to Parse JSON

The above example can easily be implemented in Python.

Create a Python script extract_hostpath.py:

#!/usr/bin/env python3
import json

with open('mock_data_nested.json', 'r') as f:
    data = json.load(f)

for volume in data['container']['volumes']:
    print(volume['hostPath'])

Run the script:

python3 extract_hostpath.py

Notice the -r flag to toggle “raw” output versus quote-wrapped output.

Explore jq play for more lessons, inputs, filters, etc.

csv

cat mock_data.csv

Note that the 6 columns are separated by 5 commas. Fields that must contain a comma should be quote-enclosed.

tsv

Like CSV files separated by commas, tab-separated files are delimited by tabs. This can fool the naked eye, and throw off import processes when stray tabs are inserted into the data fields.

To convert TSV to CSV, or vice versa, use a text search+replace function such as awk, tr, or a good IDE/text editor:

tr

tr '\t' ',' < file.tsv > file.csv

sed

sed 's/'$'\t''/,/g' file.tsv > file.csv

awk

awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv

xml

Structured data. Note that every record, and every data field within each record, is fully wrapped in markup that is opened and closed:

<dataset>
  . . .
  <record>
    <id>97</id>
    <first_name>Tamarra</first_name>
    <last_name>Jeannaud</last_name>
    <email>tjeannaud2o@fema.gov</email>
    <ip_address>26.106.176.174</ip_address>
    <dob>11/19/1981</dob>
  </record>
  <record>
    <id>98</id>
    <first_name>Korney</first_name>
    <last_name>Hazlegrove</last_name>
    <email>khazlegrove2p@wsj.com</email>
    <ip_address>218.117.101.96</ip_address>
    <dob>01/06/1981</dob>
  </record>
  . . .
</dataset>

sql

cat and head and tail the SQL snippet. Notice that each line consists of an isolated query. The SQL file is therefore not a bulk insert statement (not properly) but a concatenated series of independent SQL statements. This is a best practice so that any single line that triggers a failure can be more easily identified and the previous inserts will have succeeded.

INSERT INTO mock_data_tbl (id, first_name, last_name, email, ip_address, dob) VALUES (1, 'Berkley', 'Annon', 'bannon0@accuweather.com', '193.95.255.138', '10/20/1991');
INSERT INTO mock_data_tbl (id, first_name, last_name, email, ip_address, dob) VALUES (2, 'Doro', 'Morse', 'dmorse1@moonfruit.com', '170.67.183.172', '12/01/1995');
INSERT INTO mock_data_tbl (id, first_name, last_name, email, ip_address, dob) VALUES (3, 'Charmain', 'Halden', 'chalden2@europa.eu', '170.112.37.136', '03/03/1982');
INSERT INTO mock_data_tbl (id, first_name, last_name, email, ip_address, dob) VALUES (4, 'Allissa', 'Wakefield', 'awakefield3@usgs.gov', '23.46.25.161', '10/05/1988');

Working with remote data

Fetch Remote Data using curl and jq

Replace USER with your GitHub username

curl https://api.github.com/users/USER/events

Scroll through events

curl https://api.github.com/users/nmagee/events | jq .[] | less

Filter out values:

curl https://api.github.com/users/nmagee/events | jq .[].id
curl https://api.github.com/users/nmagee/events | jq .[].payload.commits
curl https://api.github.com/users/nmagee/events | jq .[].payload.commits | jq .[].message

Format output:

curl 'https://api.github.com/repos/stedolan/jq/commits'
curl 'https://api.github.com/repos/stedolan/jq/commits' \
   | jq '.[] | {message: .commit.message, name: .commit.committer.name}'

There are plenty of other examples in the tutorial at https://stedolan.github.io/jq/tutorial/

Tools for Retrieving Data

curl - is a common Linux-based tool to fetch raw files. You’ve been using it in the exercises above.

curl https://www.virginia.edu/ > index.html

wget - another common Linux-based tool, similar to curl.

wget https://www.virginia.edu/

http - runs the HTTPie tool to fetch web resources:

http --head https://www.virginia.edu/
http --body https://www.virginia.edu/

Windows 10 and above come with curl.exe installed:

# example 1
curl.exe --output index.html --url https://superuser.com
# example 2
curl.exe -o index.html https://superuser.com

Working with San Francisco Airport Passenger Data

In this exercise, we’ll work with the San Francisco Airport Passenger Statistics dataset. The data is available as JSON from the Socrata Open Data API.

Download the data:

curl -o airports.json https://data.sfgov.org/api/views/rkru-6vcg/rows.json

Exercise 1: View basic information with jq

View the dataset name:

jq '.meta.view.name' airports.json

Explanation: The .meta.view.name path accesses the name field in the nested JSON structure.

Exercise 2: Count the number of records

Count how many data records are in the file:

jq '.data | length' airports.json

Explanation:

Exercise 3: View a sample record

Look at the first data record:

jq '.data[0]' airports.json

Explanation: .data[0] accesses the first element (index 0) of the data array.

Exercise 4: Extract a specific field

Extract the operating airline from the first 10 records:

jq -r '.data[0:10][] | .[10]' airports.json

Explanation:

Exercise 5: Using Python’s json package

Create a simple Python script parse_airports.py:

#!/usr/bin/env python3
import json

# Load the JSON file
with open('airports.json', 'r') as f:
    data = json.load(f)

# Print basic information
print("Dataset name:", data['meta']['view']['name'])
print("Number of records:", len(data['data']))

# View the first record
print("\nFirst record:")
print(data['data'][0])

# Extract operating airline from first 10 records
print("\nOperating airlines (first 10 records):")
for i in range(10):
    airline = data['data'][i][10]  # Index 10 contains operating airline
    print(f"  Record {i+1}: {airline}")

Run the script:

python3 parse_airports.py

Explanation:

Advanced Concepts (Optional)

Streaming Large Files

When working with very large files that don’t fit in memory, you need to process them in chunks rather than loading everything at once. Python’s pandas library supports chunking:

import pandas as pd

# Process CSV in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    process(chunk)

For JSON files, use streaming parsers like ijson:

import ijson

# Stream parse large JSON files
with open('large_file.json', 'rb') as f:
    parser = ijson.items(f, 'item')
    for item in parser:
        process(item)

Schema Validation

Validating data against a schema ensures data quality and catches errors early. For JSON, use JSON Schema:

import jsonschema

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"}
    },
    "required": ["name", "age"]
}

jsonschema.validate(instance=data, schema=schema)

For CSV files, consider using libraries like pandera or great_expectations for data validation.

Binary and Columnar Formats

For better performance with large datasets, consider binary formats:

These formats offer:

Example with Parquet:

import pandas as pd

# Write to Parquet
df.to_parquet('data.parquet', compression='snappy')

# Read from Parquet
df = pd.read_parquet('data.parquet')

Resources

File Format Documentation

Tools and Libraries

Data Cleaning and Validation

API and Web Data