Revision Guide
📚 Quick Quiz Section
Parsing Questions
Q1: What’s the key difference between .find() and .index() when searching for a substring?
Show Answer
.find()returns-1if not found (safe but need to check).index()raisesValueErrorif not found (use with try/except)
Corner case: Using .find() result in slice without checking:
# BAD - if '[' not found, find() returns -1, slice becomes weird
start = line.find('[')
timestamp = line[start+1:end] # Could slice from end!
# GOOD
try:
start = line.index('[')
except ValueError:
continueQ2: What does .get() return if the key doesn’t exist and you don’t provide a default?
Show Answer
Returns None
Corner case: This can cause issues when you expect a specific type:
count = event.get("count")
count += 1 # TypeError if count is None!
# Better:
count = event.get("count", 0)
count += 1Q3: How do you safely access nested JSON like event["involvedObject"]["name"]?
Show Answer
Chain .get() calls with empty dict as default:
obj = event.get("involvedObject", {})
name = obj.get("name", "unknown")
# Or one-liner:
name = event.get("involvedObject", {}).get("name", "unknown")Why {} as default: If key missing, you get empty dict which also has .get() method, so chain doesn’t break.
Q4: When parsing /etc/passwd, what should you check BEFORE splitting the line?
Show Answer
Three things:
- Strip whitespace first:
cleaned = line.strip() - Check if empty:
if not cleaned: - Check if comment:
if cleaned.startswith("#"):
Corner case: Whitespace-only lines will pass if line: but fail when split.
# CORRECT ORDER:
cleaned = line.strip()
if not cleaned or cleaned.startswith("#"):
continue
fields = cleaned.split(":")Counting Questions
Q5: Fill in the blank - count occurrences using plain dict:
counts = {}
for item in items:
counts[item] = counts._____(item, ___) + 1Show Answer
counts[item] = counts.get(item, 0) + 1Q6: What’s the danger of using Counter when you just need simple counting?
Show Answer
No real danger, just:
- Slight memory overhead
- Overkill for simple cases
- Import dependency
However: If you might need .most_common() later, using Counter from start is fine.
Corner case: Counter returns 0 for missing keys (doesn’t raise KeyError), which might hide bugs:
c = Counter({'a': 5})
print(c['b']) # Returns 0, not KeyErrorQ7: When should you use a set instead of counting?
Show Answer
When you only care about uniqueness, not frequency:
- Unique IPs, pod names, user IDs
- Checking “have I seen this before?” (O(1) lookup)
- Deduplication
Don’t use set when: You need to know “how many times?” - use Counter/dict instead.
Q8: What’s wrong with this code?
endpoint_counts = defaultdict(int)
for log in logs:
endpoint_counts[log['endpoint']] + 1Show Answer
Missing = assignment! Should be:
endpoint_counts[log['endpoint']] += 1This is a silent bug - the code runs but doesn’t count anything. Python evaluates x + 1 but doesn’t store it.
Statistics Questions
Q9: What’s the critical check you MUST do before calculating an average?
Show Answer
Check for division by zero:
# BAD
avg = total / count
# GOOD
avg = total / count if count > 0 else 0
# Or explicit:
if count > 0:
avg = total / count
else:
avg = 0 # or None, or raise errorCorner case: Even if you “know” there’s data, defensive programming prevents crashes.
Q10: Why can’t you calculate percentiles in a single pass like you can with average?
Show Answer
Percentiles require sorted data or knowing the distribution.
- Average: Just need sum and count (single pass)
- Percentile: Need to know position in sorted order
Pattern: Collect all values → sort (or use numpy) → calculate percentile
# Must collect first:
latencies = []
for log in logs:
latencies.append(log['latency'])
p95 = np.percentile(latencies, 95) # Requires all dataQ11: What’s wrong with this max() call on a filtered list?
errors = [log for log in logs if log['status'] >= 400]
slowest_error = max(errors, key=lambda x: x['latency'])Show Answer
It crashes if errors is empty!
# Better:
slowest_error = max(errors, key=lambda x: x['latency'], default=None)
# Then check:
if slowest_error:
print(f"Slowest error: {slowest_error}")Corner case: Always use default parameter with max()/min() on filtered data.
Time & Date Questions
Q12: What type does subtracting two datetime objects return?
Show Answer
timedelta object
start = datetime.fromisoformat("2025-01-15T10:00:00")
end = datetime.fromisoformat("2025-01-15T10:05:23")
duration = end - start # timedelta object
print(duration) # "0:05:23"
seconds = duration.total_seconds() # 323.0Key: timedelta can be added together and divided by integers (for averages).
Q13: How do you calculate average duration across multiple sessions?
Show Answer
from datetime import timedelta
total_duration = timedelta(0) # Initialize to zero timedelta
count = 0
for session in sessions:
duration = session['end'] - session['start']
total_duration += duration # Add timedeltas
count += 1
avg_duration = total_duration / count # Divide by intCorner case: Initialize with timedelta(0), NOT 0 (won’t work with += timedelta).
Q14: When should you sort logs by timestamp before processing?
Show Answer
When the order matters for correctness:
- Session tracking (need login before logout in order)
- Calculating durations between sequential events
- Time-series analysis
- Any time you reference “previous” or “next” events
Trade-off: Must load all into memory, but ensures correctness.
logs = sorted(logs, key=lambda x: datetime.fromisoformat(x['timestamp']))Filtering & Grouping Questions
Q15: What’s the difference between these two?
# A
results = [x for x in items if condition]
# B
results = []
for x in items:
if condition:
results.append(x)Show Answer
Functionally identical, but:
A (list comprehension):
- More Pythonic
- Slightly faster
- Best for simple conditions
- One-liner
B (loop):
- More readable for complex logic
- Can have multiple statements
- Can use continue/break
- Better for debugging (can add prints)
When to use B: Complex multi-line conditions, need to track state, debugging.
Q16: Complete the pattern - track first occurrence only:
active_sessions = {}
for event in events:
user = event['user_id']
if event['action'] == 'login' and user ___ active_sessions:
active_sessions[user] = event['timestamp']Show Answer
if event['action'] == 'login' and user not in active_sessions:
active_sessions[user] = event['timestamp']Key pattern: not in ensures first occurrence only.
Alternative: Check if value is None:
if event['action'] == 'login' and active_sessions.get(user) is None:Q17: Why use defaultdict(list) for grouping instead of regular dict?
Show Answer
Avoids initialization boilerplate:
# Without defaultdict:
groups = {}
for item in items:
if key not in groups:
groups[key] = [] # Must initialize!
groups[key].append(item)
# With defaultdict:
groups = defaultdict(list)
for item in items:
groups[key].append(item) # Auto-creates empty listCleaner and less error-prone.
Corner case: Creates key on first access, even if you don’t append. Usually fine, but be aware.
🔥 Corner Case Drills
Drill 1: Empty Data
Scenario: Your log file is empty or has no matching records.
What breaks in this code?
latencies = [log['latency'] for log in logs if log['endpoint'] == '/checkout']
avg = sum(latencies) / len(latencies)
p95 = np.percentile(latencies, 95)Show Answer
Two things break:
avgcalculation: Division by zero iflatenciesis emptynp.percentile(): Raises error on empty array
Fixed:
if latencies:
avg = sum(latencies) / len(latencies)
p95 = np.percentile(latencies, 95)
else:
print("No data available")
avg = 0
p95 = 0Drill 2: Missing Keys
Scenario: JSON event is missing expected fields.
What’s wrong?
for event in events:
pod_name = event["involvedObject"]["name"]
if event["type"] == "Warning":
print(pod_name)Show Answer
KeyError if any key missing!
Fixed:
for event in events:
obj = event.get("involvedObject", {})
pod_name = obj.get("name")
event_type = event.get("type")
if event_type == "Warning" and pod_name:
print(pod_name)Always use .get() with nested JSON from external sources.
Drill 3: Whitespace Lines
Scenario: File has empty lines, whitespace-only lines, and comments.
What’s wrong?
with open("passwd.txt") as f:
for line in f:
if line.startswith("#"):
continue
fields = line.split(":")
uid = fields[2]Show Answer
Multiple issues:
- Whitespace-only lines will pass through
- Empty lines cause index error on
fields[2] - Not stripping means comment check might fail if indented
Fixed:
with open("passwd.txt") as f:
for line in f:
cleaned = line.strip()
if not cleaned or cleaned.startswith("#"):
continue
fields = cleaned.split(":")
if len(fields) < 3: # Extra safety
continue
uid = fields[2]Golden rule: Strip → check empty → check comments → parse
Drill 4: Out of Order Logs
Scenario: Session logs are not in chronological order.
What breaks?
active_sessions = {}
for log in logs:
if log['action'] == 'login':
active_sessions[log['user']] = log['timestamp']
elif log['action'] == 'logout':
duration = log['timestamp'] - active_sessions[log['user']]Show Answer
If logout comes before login in file: KeyError on active_sessions[log['user']]
Also: Duration could be negative if timestamps are out of order
Fixed:
# Sort first
logs = sorted(logs, key=lambda x: datetime.fromisoformat(x['timestamp']))
active_sessions = {}
for log in logs:
if log['action'] == 'login':
active_sessions[log['user']] = datetime.fromisoformat(log['timestamp'])
elif log['action'] == 'logout' and log['user'] in active_sessions:
start = active_sessions[log['user']]
end = datetime.fromisoformat(log['timestamp'])
duration = end - start
# Now guaranteed positive duration
del active_sessions[log['user']]Drill 5: Case Sensitivity
Scenario: Log levels appear as “ERROR”, “error”, “Error”.
What’s wrong?
valid_levels = {'ERROR', 'WARN', 'INFO', 'DEBUG'}
for line in logs:
level = extract_level(line)
if level in valid_levels:
counts[level] += 1Show Answer
Case variations won’t match, undercounting!
Fixed:
valid_levels = {'ERROR', 'WARN', 'INFO', 'DEBUG'}
for line in logs:
level = extract_level(line).upper() # Normalize
if level in valid_levels:
counts[level] += 1Always normalize case when matching log levels, error messages, etc.
Drill 6: Modifying While Iterating
What’s wrong?
for item in my_list:
if item['status'] == 'expired':
my_list.remove(item)Show Answer
Skips items! Removing during iteration changes indices.
Fixed options:
# Option 1: Create new list (cleanest)
my_list = [item for item in my_list if item['status'] != 'expired']
# Option 2: Iterate over copy
for item in my_list[:]: # [:] creates shallow copy
if item['status'] == 'expired':
my_list.remove(item)
# Option 3: Iterate backwards
for i in range(len(my_list) - 1, -1, -1):
if my_list[i]['status'] == 'expired':
del my_list[i]Best practice: Create new list with comprehension
🎯 Pattern Recognition Drill
For each scenario, identify the best approach:
Scenario 1
“Count how many requests each IP made”
Show Answer
defaultdict(int) or dict.get()
from collections import defaultdict
ip_counts = defaultdict(int)
for log in logs:
ip_counts[log['ip']] += 1Why not Counter? Could work, but overkill if you don’t need most_common().
Scenario 2
“Find the top 5 most common error messages”
Show Answer
Counter with most_common()
from collections import Counter
errors = [log['message'] for log in logs if log['level'] == 'ERROR']
Counter(errors).most_common(5)Why? Built-in sorting by frequency, clean API.
Scenario 3
“Track which pods have had at least one warning event”
Show Answer
set
pods_with_warnings = set()
for event in events:
if event['type'] == 'Warning':
pods_with_warnings.add(event['pod_name'])Why? Only care about uniqueness, not frequency.
Scenario 4
“Group all events by pod name”
Show Answer
defaultdict(list)
from collections import defaultdict
events_by_pod = defaultdict(list)
for event in events:
events_by_pod[event['pod_name']].append(event)Why? Auto-initializes empty lists, clean grouping pattern.
Scenario 5
“Calculate P95 latency for /api/checkout endpoint”
Show Answer
Collect in list → numpy.percentile()
import numpy as np
latencies = [log['latency'] for log in logs if log['endpoint'] == '/api/checkout']
if latencies:
p95 = np.percentile(latencies, 95)Why? Percentiles need sorted/distributed data, can’t do single-pass.
Scenario 6
“Calculate average session duration (login to logout)”
Show Answer
Sort by timestamp → Track active sessions → Sum timedeltas
from datetime import datetime, timedelta
logs = sorted(logs, key=lambda x: datetime.fromisoformat(x['timestamp']))
active = {}
total = timedelta(0)
count = 0
for log in logs:
user = log['user']
if log['action'] == 'login':
active[user] = datetime.fromisoformat(log['timestamp'])
elif log['action'] == 'logout' and user in active:
duration = datetime.fromisoformat(log['timestamp']) - active[user]
total += duration
count += 1
del active[user]
avg = total / count if count > 0 else timedelta(0)Why sort? Need chronological order for paired events.
🧠 Memory Tricks
“Strip Before Check”
When parsing files, always:
1. Strip whitespace: cleaned = line.strip()
2. Check empty: if not cleaned: continue
3. Check comments: if cleaned.startswith("#"): continue
4. Parse: fields = cleaned.split(":")“Get with Default”
For nested JSON:
Level 1: obj = event.get("key", {})
Level 2: value = obj.get("subkey", default)
One-liner: value = event.get("key", {}).get("subkey", default)“Check Before Divide”
Always:
result = numerator / denominator if denominator > 0 else 0“Default on Max”
When filtering:
result = max(filtered_list, key=lambda x: x['field'], default=None)
if result:
# use result“Not In for First Only”
Track first occurrence:
if key not in dict:
dict[key] = value“Sort for Sessions”
When events must be in order:
logs = sorted(logs, key=lambda x: datetime.fromisoformat(x['timestamp']))✅ Review Checklist
Use this for quick reviews:
Parsing:
- Do I strip before checking empty lines?
- Do I use .get() for JSON fields?
- Do I handle comments in config files?
Counting:
- Did I choose the right structure (dict/defaultdict/Counter/set)?
- Am I using += with counters, not just +?
Statistics:
- Did I check for division by zero?
- Did I use default with max()/min() on filtered data?
- Did I collect all values before calculating percentiles?
Time:
- Did I sort by timestamp if order matters?
- Did I initialize timedelta(0) not 0?
- Did I check for paired events before calculating duration?
General:
- Did I handle empty input data?
- Did I normalize case for comparisons?
- Did I avoid modifying list while iterating?