Pandas

Sure! Here’s a detailed tutorial for learning pandas, from beginner to advanced, in simple language. pandas is a powerful and versatile library for data manipulation and analysis in Python. This tutorial will cover the core concepts and functionalities of pandas.

### Introduction to pandas

pandas is built on top of NumPy and provides easy-to-use data structures and data analysis tools. The two primary data structures in pandas are Series and DataFrame.

#### Installation

To install pandas, you can use pip:

```bash

pip install pandas

```

### Basics

#### Importing pandas

First, you need to import pandas to use it in your Python script:

```python

import pandas as pd

```

#### Series

A Series is a one-dimensional labeled array capable of holding any data type.

```python

import pandas as pd

# Creating a Series

s = pd.Series([1, 3, 5, 7, 9])

print(s)

```

#### DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

```python

import pandas as pd

# Creating a DataFrame

data = {

'Name': ['John', 'Anna', 'Peter', 'Linda'],

'Age': [28, 24, 35, 32],

'City': ['New York', 'Paris', 'Berlin', 'London']

}

df = pd.DataFrame(data)

print(df)

```

### Data Manipulation

#### Reading Data

pandas can read data from various file formats like CSV, Excel, SQL, etc.

```python

# Reading data from a CSV file

df = pd.read_csv('path/to/file.csv')

# Reading data from an Excel file

df = pd.read_excel('path/to/file.xlsx')

```

#### Viewing Data

You can quickly view the top and bottom rows of the DataFrame using `head()` and `tail()`.

```python

print(df.head()) # First 5 rows

print(df.tail()) # Last 5 rows

```

#### Data Information

Get a summary of the DataFrame using `info()` and basic statistics using `describe()`.

```python

print(df.info())

print(df.describe())

```

### Indexing and Selecting Data

#### Selecting Columns

```python

# Select a single column

print(df['Name'])

# Select multiple columns

print(df[['Name', 'City']])

```

#### Selecting Rows

Use `loc` for label-based indexing and `iloc` for integer-based indexing.

```python

# Select rows by label

print(df.loc[0]) # First row

print(df.loc[0:2]) # First three rows

# Select rows by position

print(df.iloc[0]) # First row

print(df.iloc[0:2]) # First two rows

```

#### Conditional Selection

```python

# Select rows where Age is greater than 30

print(df[df['Age'] > 30])

```

### Data Cleaning

#### Handling Missing Values

```python

# Check for missing values

print(df.isnull().sum())

# Drop missing values

df = pd.DataFrame({

'A': [1, 2, None],

'B': [4, None, 6],

'C': [7, 8, 9]

})

df_cleaned = df.dropna()

print(df_cleaned)

# Fill missing values

df_filled = df.fillna(0)

print(df_filled)

```

#### Removing Duplicates

```python

df = pd.DataFrame({

'Name': ['John', 'Anna', 'John', 'Linda'],

'Age': [28, 24, 28, 32],

'City': ['New York', 'Paris', 'New York', 'London']

})

# Drop duplicates

df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates)

```

### Data Transformation

#### Adding/Removing Columns

```python

# Add a new column

df['Country'] = ['USA', 'France', 'Germany', 'UK']

print(df)

# Remove a column

df = df.drop('Country', axis=1)

print(df)

```

#### Renaming Columns

```python

# Rename columns

df = df.rename(columns={'Name': 'Full Name', 'Age': 'Years'})

print(df)

```

#### Changing Data Types

```python

# Change data type of a column

df['Age'] = df['Age'].astype(float)

print(df.dtypes)

```

### Aggregation and Grouping

#### Grouping Data

```python

# Group by a column and calculate mean

grouped = df.groupby('City').mean()

print(grouped)

```

#### Aggregating Data

```python

# Apply multiple aggregate functions

agg_df = df.groupby('City').agg({

'Age': ['mean', 'max'],

'Name': 'count'

})

print(agg_df)

```

### Merging and Joining

#### Merging DataFrames

```python

df1 = pd.DataFrame({

'Name': ['John', 'Anna', 'Peter'],

'Age': [28, 24, 35]

})

df2 = pd.DataFrame({

'Name': ['John', 'Anna', 'Peter'],

'City': ['New York', 'Paris', 'Berlin']

})

# Merge DataFrames on a common column

merged_df = pd.merge(df1, df2, on='Name')

print(merged_df)

```

#### Joining DataFrames

```python

df1 = pd.DataFrame({

'Name': ['John', 'Anna', 'Peter'],

'Age': [28, 24, 35]

}).set_index('Name')

df2 = pd.DataFrame({

'City': ['New York', 'Paris', 'Berlin'],

'Name': ['John', 'Anna', 'Peter']

}).set_index('Name')

# Join DataFrames

joined_df = df1.join(df2)

print(joined_df)

```

### Time Series Data

#### Working with Dates

```python

# Convert a column to datetime

df['Date'] = pd.to_datetime(df['Date'])

# Set the date column as index

df.set_index('Date', inplace=True)

print(df)

```

#### Resampling

```python

# Resample time series data

resampled_df = df.resample('M').sum()

print(resampled_df)

```

### Advanced Topics

#### Pivot Tables

```python

# Create a pivot table

pivot_df = df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='sum')

print(pivot_df)

```

#### Applying Functions

```python

# Apply a function to each column

df = df.apply(lambda x: x * 2 if x.name == 'Sales' else x)

print(df)

# Apply a function to each row

df['Discount'] = df.apply(lambda row: row['Sales'] * 0.1 if row['Product'] == 'A' else row['Sales'] * 0.05, axis=1)

print(df)

```

### Visualization

#### Plotting with pandas

pandas integrates with matplotlib for easy plotting.

```python

import matplotlib.pyplot as plt

# Line plot

df.plot(kind='line', x='Date', y='Sales')

plt.show()

# Bar plot

df.plot(kind='bar', x='Product', y='Sales')

plt.show()

# Histogram

df['Sales'].plot(kind='hist')

plt.show()

```

### Conclusion

This tutorial covers the basics to advanced features of pandas. Practice these concepts by working on real datasets to solidify your understanding. Here are some additional resources to help you along the way:

- [pandas Documentation](https://pandas.pydata.org/docs/)

- [Kaggle Datasets](https://www.kaggle.com/datasets)

- [Python for Data Analysis by Wes McKinney](https://www.oreilly.com/library/view/python-for-data/9781491957653/)

Happy coding!

Shivaram Babar

Search This Blog

Pandas

Comments

Post a Comment