Pandas DataFrame¶
Pandas is a powerful library for data manipulation and analysis. It provides data structures and operations for manipulating numerical tables and time series.
Basic Operations¶
import pandas as pd
# Basic DataFrame Creation
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
# Accessing Columns
column_a = df['A'].values
column_means = df.mean()
Search:
# Find index of specific value
index_of_value = df[df['A'] == 3].index[0]
# Find rows meeting a condition
filtered_rows = df[df['A'] > 3]
Delete: The drop()
method removes rows or columns from the DataFrame:
Slicing¶
# Create Sample DataFrame
sample_df = pd.DataFrame({
'A': range(10),
'B': range(10, 20),
'C': range(20, 30)
})
# Slicing Techniques
# First 3 rows
first_three_rows = sample_df.iloc[0:3]
# Select specific columns
selected_columns = sample_df.loc[:, ['A', 'B']]
# Conditional Filtering
filtered_rows = sample_df[sample_df['A'] > 5]
# Complex Slicing
specific_slice = sample_df.loc[1:3, ['A', 'C']]
Apply Function¶
# Apply Functions to DataFrame
# Exponential of each element
exp_df = df.apply(np.exp)
# Calculate Mean
column_means = df.apply(np.mean)
row_means = df.apply(np.mean, axis=1)
Performance Considerations¶
NumPy vs Lists¶
- Use Lists for:
- General-purpose programming
- Small datasets
-
Mixed data types
-
Use NumPy Arrays for:
- Scientific computing
- Vectorized operations
- Large numerical computations
Memory Efficiency¶
# Memory Comparison
list_size = sys.getsizeof(list(range(10000000)))
numpy_size = sys.getsizeof(np.arange(10000000))