|
1 | 1 | --- |
2 | 2 | name: data-analysis |
3 | | -description: Analyze datasets, generate charts, and create summary reports. Use when the user needs to work with CSV, Excel, or other tabular data formats for analysis or visualization. |
| 3 | +description: > |
| 4 | + Analyze datasets, generate charts, and create summary reports from CSV, Excel, |
| 5 | + JSON, Parquet, or other tabular data. Capabilities: statistical profiling, |
| 6 | + outlier detection, pivot tables, groupby aggregation, time-series analysis, |
| 7 | + correlation matrices, and publication-ready visualizations. |
| 8 | + Trigger terms: analyze data, plot chart, summarize CSV, data profiling, |
| 9 | + statistics, histogram, scatter plot, dashboard, EDA, exploratory analysis. |
4 | 10 | --- |
5 | 11 |
|
6 | 12 | # Data Analysis |
7 | 13 |
|
8 | 14 | ## When to use this skill |
9 | 15 | Use this skill when the user needs to: |
10 | | -- Analyze CSV or Excel files |
11 | | -- Generate charts and visualizations |
12 | | -- Calculate statistics and summaries |
13 | | -- Clean and transform data |
14 | | - |
15 | | -## How to analyze data |
16 | | -1. Use pandas for data analysis: |
17 | | - ```python |
18 | | - import pandas as pd |
19 | | - df = pd.read_csv('data.csv') |
20 | | - summary = df.describe() |
21 | | - ``` |
22 | | - |
23 | | -## How to create visualizations |
24 | | -1. Use matplotlib or seaborn for charts: |
25 | | - ```python |
26 | | - import matplotlib.pyplot as plt |
27 | | - df.plot(kind='bar') |
28 | | - plt.savefig('chart.png') |
29 | | - ``` |
| 16 | +- Analyze CSV, Excel, JSON, or Parquet files |
| 17 | +- Generate charts and visualizations (bar, line, scatter, heatmap) |
| 18 | +- Calculate statistics, correlations, or distributions |
| 19 | +- Clean, transform, pivot, or aggregate data |
| 20 | +- Perform exploratory data analysis (EDA) |
| 21 | + |
| 22 | +## Workflow |
| 23 | + |
| 24 | +1. **Load & validate** -- Read the file, confirm shape and dtypes, report missing values. |
| 25 | +2. **Profile** -- Run `df.describe()`, check nulls, detect outliers (IQR or z-score). |
| 26 | +3. **Transform** -- Filter, group, pivot, or resample as needed. |
| 27 | +4. **Visualize** -- Generate charts; save to file with `plt.savefig()`. |
| 28 | +5. **Report** -- Summarize key findings in plain language. |
| 29 | + |
| 30 | +## Error Recovery |
| 31 | + |
| 32 | +| Problem | Action | |
| 33 | +|---------|--------| |
| 34 | +| File not found / wrong path | List directory contents, ask user to confirm filename | |
| 35 | +| Encoding error on read | Retry with `encoding='latin-1'` then `'cp1252'` | |
| 36 | +| Mixed dtypes in column | Use `pd.to_numeric(col, errors='coerce')` and report coerced rows | |
| 37 | +| Empty dataframe after filter | Warn user, show original value counts for filter column | |
| 38 | +| Chart rendering fails | Fall back to text-based summary table | |
| 39 | + |
| 40 | +## Example: end-to-end EDA |
| 41 | + |
| 42 | +```python |
| 43 | +import pandas as pd, matplotlib.pyplot as plt, seaborn as sns |
| 44 | + |
| 45 | +df = pd.read_csv('sales.csv', parse_dates=['date']) |
| 46 | +assert not df.empty, "Dataset is empty" |
| 47 | + |
| 48 | +# Profile |
| 49 | +print(df.describe()) |
| 50 | +print(f"Missing values:\n{df.isnull().sum()}") |
| 51 | + |
| 52 | +# Visualize |
| 53 | +fig, axes = plt.subplots(1, 2, figsize=(12, 5)) |
| 54 | +df.groupby('region')['revenue'].sum().plot.bar(ax=axes[0], title='Revenue by Region') |
| 55 | +sns.heatmap(df.select_dtypes('number').corr(), annot=True, ax=axes[1]) |
| 56 | +plt.tight_layout() |
| 57 | +plt.savefig('eda_report.png', dpi=150) |
| 58 | +``` |
0 commit comments