Summary: Learn to detect, filter, and clean missing data using pandas and numpy. This guide covers identifying nulls, filtering rows, and removing or replacing incomplete data entries.
import pandas as pd
import numpy as np
# Create the DataFrame
data = {
'EmployeeID': [101, 102, 103, 104, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [28, np.nan, 34, 45, np.nan],
'Department': ['HR', 'IT', np.nan, 'Finance', np.nan],
'Salary': [50000, 60000, 70000, np.nan, np.nan]
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print("\nMissing Values (True indicates missing):")
print(df.isnull())
print("\nOnly Missing Values (True where data is missing):")
print(df.isnull())
print("\nOnly Non-Missing Values (True where data is present):")
print(df.notnull())
print("\nMissing Values Count per Column:")
print(df.isnull().sum())
print("\nTotal Missing Values in the DataFrame:")
print(df.isnull().sum().sum())
print("\nRows with Missing Age:")
print(df[df['Age'].isnull()])
print("\nRows where Department AND Salary are missing:")
print(df[df['Department'].isnull() & df['Salary'].isnull()])
print("\nRows with All Non-Missing Values:")
print(df[df.notnull().all(axis=1)])
print("\nRows with Any Missing Value:")
print(df[df.isnull().any(axis=1)])
missing_counts = df.isnull().sum()
most_missing_column = missing_counts.idxmax()
print(f"\nColumn with the most missing values: {most_missing_column} ({missing_counts[most_missing_column]} missing)")
print("\nSuggestion: Use mean/median to fill missing 'Age' or 'Salary', and mode or a placeholder like 'Unknown' for 'Department'.")
clean_df = df.dropna()
print("\nCleaned DataFrame (rows with any missing values dropped):")
print(clean_df)
This program demonstrates how to handle missing data in a pandas DataFrame. Each section of the code identifies and processes missing values using functions such as isnull(), notnull(), and dropna().
isnull() - Detects missing data (returns True for NaNs)notnull() - Detects valid (non-missing) entriessum() - Counts missing valuesdropna() - Removes rows with missing valuesmean(), median(), or mode() to fill missing numerical data.mode() or a placeholder such as "Unknown".dropna() when removing rows is acceptable.