Robots.txt 配置完全指南

什么是 robots.txt

robots.txt 是一个放置在网站根目录的纯文本文件，用于告诉搜索引擎爬虫（如 Googlebot、Bingbot）哪些页面可以抓取，哪些不可以。

robots.txt 的作用示意：

                   ┌──────────────────────────┐
                   │     你的网站             │
                   │  ┌────────────────────┐  │
                   │  │  robots.txt        │  │
  Googlebot ───────┼─→│  User-agent: *     │  │
                   │  │  Allow: /          │  │
  Bingbot ─────────┼─→│  Disallow: /admin  │  │
                   │  │  Disallow: /api    │  │
  其他爬虫 ────────┼─→│  Sitemap: /sitemap │  │
                   │  └────────────────────┘  │
                   │                          │
                   │  ✅ /products  → 允许    │
                   │  ✅ /blog      → 允许    │
                   │  ❌ /admin     → 禁止    │
                   │  ❌ /api       → 禁止    │
                   └──────────────────────────┘

为什么需要 robots.txt

1. 优化爬取预算（Crawl Budget）

搜索引擎分配给每个网站的爬取资源是有限的。如果爬虫花费大量时间爬取无用页面，真正重要的页面可能得不到及时更新。

爬取预算分配示例：

没有 robots.txt:
├── /admin/*          ████████ 30%  ← 浪费！
├── /api/*            ██████ 25%   ← 浪费！
├── /search?q=*       ████ 15%     ← 无穷尽！
├── /products/*       ██ 10%       ← 核心内容被挤压
├── /blog/*           ██ 10%       ← 核心内容被挤压
└── /pages/*          ██ 10%

配置 robots.txt 后:
├── /products/*       ████████████ 40%  ✅
├── /blog/*           ████████████ 40%  ✅
├── /pages/*          ██████ 20%        ✅
└── 禁止区域          0%                ✅

2. 保护敏感内容

某些页面不应该出现在搜索结果中：

管理后台
用户个人信息页面
测试/开发环境
重复内容（如打印版页面）

3. 防止服务器过载

设置爬取速率限制，避免爬虫请求过于频繁。

robots.txt 语法详解

基本语法结构

# 这是注释
# robots.txt 文件示例

# 针对所有爬虫的规则
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

# 针对特定爬虫的规则
User-agent: Googlebot
Disallow: /no-google/

# Sitemap 声明
Sitemap: https://example.com/sitemap.xml

User-agent 指令

User-agent 指定规则适用于哪些爬虫：

# 适用于所有爬虫
User-agent: *

# 适用于 Google 爬虫
User-agent: Googlebot

# 适用于 Google 图片爬虫
User-agent: Googlebot-Image

# 适用于 Bing 爬虫
User-agent: Bingbot

# 适用于百度爬虫
User-agent: Baiduspider

# 可以指定多个 User-agent 共享规则
User-agent: Googlebot
User-agent: Bingbot
Disallow: /private/

常见搜索引擎爬虫标识：

爬虫名称	User-agent	搜索引擎
Googlebot	Googlebot	Google 网页搜索
Googlebot-Image	Googlebot-Image	Google 图片搜索
Googlebot-Video	Googlebot-Video	Google 视频搜索
Bingbot	Bingbot	Bing
Slurp	Slurp	Yahoo
Baiduspider	Baiduspider	百度
Sogou	Sogou Spider	搜狗
360Spider	360Spider	360 搜索
DuckDuckBot	DuckDuckBot	DuckDuckGo

Disallow 和 Allow 指令

# 禁止访问单个路径
Disallow: /admin

# 禁止访问目录及其所有子内容
Disallow: /admin/

# 禁止访问所有 .pdf 文件
Disallow: /*.pdf

# 禁止访问所有包含 "?" 的 URL（查询参数）
Disallow: /*?

# 禁止访问以 .html 结尾的特定模式
Disallow: /*print*.html

# 允许特定路径（优先于 Disallow）
Allow: /admin/public/

# 组合使用
User-agent: *
Disallow: /admin/
Allow: /admin/login

路径匹配规则

路径匹配示例：

Disallow: /path
├── /path          ✅ 匹配
├── /path/         ✅ 匹配
├── /path/page     ✅ 匹配
├── /path.html     ✅ 匹配
├── /pathname      ✅ 匹配
└── /other-path    ❌ 不匹配

Disallow: /path/
├── /path          ❌ 不匹配
├── /path/         ✅ 匹配
├── /path/page     ✅ 匹配
└── /path/sub/     ✅ 匹配

Disallow: /*.pdf
├── /file.pdf      ✅ 匹配
├── /docs/file.pdf ✅ 匹配
├── /file.pdf?v=1  ✅ 匹配
└── /file.PDF      ❌ 不匹配（区分大小写）

Disallow: /file$
├── /file          ✅ 匹配（$ 表示结尾）
├── /file.html     ❌ 不匹配
└── /file/         ❌ 不匹配

Crawl-delay 指令

控制爬虫的访问频率：

# 每次请求间隔 10 秒（非 Google 支持）
User-agent: Bingbot
Crawl-delay: 10

# Google 不支持 Crawl-delay
# 应在 Google Search Console 中设置

注意：Google 不遵守 Crawl-delay 指令，需要在 Google Search Console 中设置爬取速率。

Sitemap 指令

声明网站地图位置：

# 单个 Sitemap
Sitemap: https://example.com/sitemap.xml

# 多个 Sitemap
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-pages.xml

# Sitemap 索引文件
Sitemap: https://example.com/sitemap-index.xml

实战配置示例

电商网站 robots.txt

# 电商网站 robots.txt
# 更新日期: 2024-12-24

User-agent: *

# 允许主要内容区域
Allow: /products/
Allow: /categories/
Allow: /brands/
Allow: /blog/
Allow: /pages/

# 禁止管理区域
Disallow: /admin/
Disallow: /api/
Disallow: /dashboard/

# 禁止用户私有页面
Disallow: /my-account/
Disallow: /cart/
Disallow: /checkout/
Disallow: /wishlist/

# 禁止搜索结果页（避免无限爬取）
Disallow: /search
Disallow: /*?q=
Disallow: /*?search=

# 禁止排序/筛选参数页面
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# 禁止内部追踪链接
Disallow: /*?ref=
Disallow: /*?utm_
Disallow: /*?fbclid=

# 禁止重复内容
Disallow: /print/
Disallow: /*print=true

# 禁止临时活动页面
Disallow: /promo/temp/

# Google 特殊设置
User-agent: Googlebot
Allow: /api/products/schema
Disallow: /api/

# 图片爬虫
User-agent: Googlebot-Image
Allow: /images/products/
Allow: /images/categories/
Disallow: /images/admin/
Disallow: /images/temp/

# Sitemap
Sitemap: https://shop.example.com/sitemap-index.xml
Sitemap: https://shop.example.com/sitemap-products.xml
Sitemap: https://shop.example.com/sitemap-categories.xml

内容网站 robots.txt

# 博客/内容网站 robots.txt
# 更新日期: 2024-12-24

User-agent: *

# 核心内容
Allow: /articles/
Allow: /tutorials/
Allow: /guides/
Allow: /news/
Allow: /authors/

# 禁止后台
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cms/
Disallow: /login/

# 禁止草稿和预览
Disallow: /draft/
Disallow: /preview/
Disallow: /*?preview=

# 禁止标签聚合页（可能造成重复内容）
Disallow: /tag/
Disallow: /tags/

# 允许重要分类
Allow: /category/

# 禁止存档页面
Disallow: /archive/
Disallow: /*?date=

# 禁止评论相关
Disallow: /comments/
Disallow: /*?replytocom=

# 禁止打印和 PDF 版本
Disallow: /print/
Disallow: /*format=pdf

# RSS 和 API
Disallow: /feed/
Disallow: /api/

# 静态资源目录
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Allow: /wp-content/uploads/

# Sitemap
Sitemap: https://blog.example.com/sitemap.xml

SaaS 应用 robots.txt

# SaaS 应用 robots.txt
# 更新日期: 2024-12-24

User-agent: *

# 只允许公开页面
Allow: /
Allow: /features/
Allow: /pricing/
Allow: /blog/
Allow: /docs/
Allow: /about/
Allow: /contact/
Allow: /legal/

# 禁止应用内部
Disallow: /app/
Disallow: /dashboard/
Disallow: /settings/
Disallow: /workspace/

# 禁止认证相关
Disallow: /login/
Disallow: /signup/
Disallow: /register/
Disallow: /forgot-password/
Disallow: /reset-password/
Disallow: /verify/
Disallow: /oauth/

# 禁止 API
Disallow: /api/
Disallow: /graphql/
Disallow: /webhook/

# 禁止用户内容
Disallow: /u/
Disallow: /users/
Disallow: /profile/

# 禁止内部工具
Disallow: /admin/
Disallow: /internal/
Disallow: /debug/
Disallow: /health/
Disallow: /metrics/

# 禁止营销追踪
Disallow: /*?ref=
Disallow: /*?utm_

# 允许帮助中心文章被索引
Allow: /help/
Allow: /support/articles/
Disallow: /support/tickets/

# Sitemap
Sitemap: https://app.example.com/sitemap.xml

常见错误与陷阱

错误1：误禁止整个网站

# ❌ 错误 - 禁止了所有内容
User-agent: *
Disallow: /

# ✅ 正确 - 只禁止特定目录
User-agent: *
Disallow: /admin/
Disallow: /private/

错误2：文件位置错误

robots.txt 必须放在根目录：

✅ 正确: https://example.com/robots.txt
❌ 错误: https://example.com/files/robots.txt
❌ 错误: https://www.example.com/robots.txt (如果规范 URL 是 example.com)

子域名需要单独的 robots.txt:
https://example.com/robots.txt        → example.com
https://www.example.com/robots.txt    → www.example.com
https://blog.example.com/robots.txt   → blog.example.com

错误3：规则顺序混乱

# ❌ 错误 - 规则顺序导致意外结果
User-agent: *
Disallow: /admin/
User-agent: Googlebot
Allow: /admin/public/
# Googlebot 的规则和 * 的规则是分开的！

# ✅ 正确 - 同一个 User-agent 块内处理
User-agent: *
Disallow: /admin/
Allow: /admin/public/

User-agent: Googlebot
Disallow: /no-google-only/

错误4：语法错误

# ❌ 常见语法错误

# 错误: 使用了相对路径
Disallow: admin/

# 错误: 多余的空格
Disallow : /admin/

# 错误: 使用了错误的通配符
Disallow: /admin/*  # 这是正确的
Disallow: /admin/** # 这是错误的

# 错误: 大小写错误（指令区分大小写）
user-agent: *       # 应该是 User-agent
disallow: /admin/   # 应该是 Disallow

# ✅ 正确格式
User-agent: *
Disallow: /admin/

错误5：依赖 robots.txt 做安全防护

⚠️ 重要提醒：

robots.txt 不是安全机制！

❌ 不能阻止：
- 恶意爬虫（它们会忽略规则）
- 直接访问 URL
- 敏感信息泄露

✅ 真正的安全措施：
- 使用认证保护敏感页面
- 使用 noindex meta 标签
- 服务器端访问控制
- 加密敏感数据

robots.txt 与其他 SEO 指令的配合

robots.txt vs meta robots

<!-- meta robots 标签 -->
<meta name="robots" content="noindex, nofollow">

<!-- 或者使用 HTTP 头 -->
X-Robots-Tag: noindex, nofollow

二者的区别：

                    robots.txt              meta robots / X-Robots-Tag
──────────────────────────────────────────────────────────────────────
作用阶段            爬取前                  爬取后
是否被索引          不爬取=可能被索引       明确指示不索引
传递链接权重        不确定                  可控制（nofollow）
粒度                URL 路径模式            单个页面
更新生效            快速                    需要重新爬取
推荐场景            大批量禁止              精确控制单页

最佳实践组合

# robots.txt - 控制爬取
User-agent: *
Disallow: /admin/        # 不需要爬取
Disallow: /temp/         # 临时内容

# 不要禁止需要 noindex 的页面！
# 否则爬虫看不到 noindex 指令
# Allow: /user/profile/  # 允许爬取，但在页面上设置 noindex

<!-- 页面内 - 控制索引 -->
<!-- /user/profile/ 页面 -->
<head>
  <meta name="robots" content="noindex, nofollow">
</head>

Nuxt.js 中配置 robots.txt

使用 @nuxtjs/robots 模块

# 安装模块
npm install @nuxtjs/robots

// nuxt.config.ts
export default defineNuxtConfig({
  modules: ['@nuxtjs/robots'],
  
  robots: {
    // 基础配置
    UserAgent: '*',
    Disallow: ['/admin', '/api', '/user'],
    Allow: ['/api/public'],
    
    // Sitemap
    Sitemap: 'https://example.com/sitemap.xml',
    
    // 或者使用函数动态生成
    rules: [
      { UserAgent: '*' },
      { Disallow: '/admin' },
      { BlankLine: true },
      { UserAgent: 'Googlebot' },
      { Allow: '/api/schema' },
      { Sitemap: 'https://example.com/sitemap.xml' }
    ]
  }
});

根据环境动态配置

// nuxt.config.ts
export default defineNuxtConfig({
  modules: ['@nuxtjs/robots'],
  
  robots: () => {
    // 非生产环境禁止所有爬虫
    if (process.env.NODE_ENV !== 'production') {
      return {
        UserAgent: '*',
        Disallow: '/'
      };
    }
    
    // 生产环境正常配置
    return {
      UserAgent: '*',
      Disallow: ['/admin', '/api', '/private'],
      Allow: ['/api/public'],
      Sitemap: `https://${process.env.SITE_URL}/sitemap.xml`
    };
  }
});

完整的 robots.txt 服务端路由

// server/routes/robots.txt.ts
export default defineEventHandler((event) => {
  const config = useRuntimeConfig();
  const isProduction = process.env.NODE_ENV === 'production';
  
  // 非生产环境禁止爬取
  if (!isProduction) {
    return `User-agent: *
Disallow: /`;
  }
  
  const baseUrl = config.public.siteUrl;
  
  const robots = `
# robots.txt for ${baseUrl}
# Generated: ${new Date().toISOString()}

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /user/
Disallow: /checkout/
Disallow: /cart/
Disallow: /*?*

User-agent: Googlebot
Allow: /api/products/
Disallow: /api/internal/

User-agent: Googlebot-Image
Allow: /images/
Disallow: /images/private/

Sitemap: ${baseUrl}/sitemap.xml
`.trim();
  
  setHeader(event, 'Content-Type', 'text/plain');
  setHeader(event, 'Cache-Control', 'public, max-age=86400'); // 缓存 1 天
  
  return robots;
});

验证和测试

Google Search Console 测试

登录 Google Search Console
选择你的网站
进入「URL 检查」工具
输入要测试的 URL
查看 robots.txt 是否阻止了该 URL

使用 robots.txt 测试工具

// 简单的 robots.txt 解析器
class RobotsTxtParser {
  constructor(content) {
    this.rules = this.parse(content);
  }
  
  parse(content) {
    const rules = {};
    let currentUserAgent = null;
    
    const lines = content.split('\n');
    
    for (const line of lines) {
      const trimmed = line.trim();
      
      // 跳过空行和注释
      if (!trimmed || trimmed.startsWith('#')) continue;
      
      const [directive, ...valueParts] = trimmed.split(':');
      const value = valueParts.join(':').trim();
      
      if (directive.toLowerCase() === 'user-agent') {
        currentUserAgent = value.toLowerCase();
        if (!rules[currentUserAgent]) {
          rules[currentUserAgent] = { allow: [], disallow: [] };
        }
      } else if (currentUserAgent) {
        if (directive.toLowerCase() === 'allow') {
          rules[currentUserAgent].allow.push(value);
        } else if (directive.toLowerCase() === 'disallow') {
          rules[currentUserAgent].disallow.push(value);
        }
      }
    }
    
    return rules;
  }
  
  isAllowed(url, userAgent = 'googlebot') {
    const ua = userAgent.toLowerCase();
    const rules = this.rules[ua] || this.rules['*'] || { allow: [], disallow: [] };
    
    const path = new URL(url).pathname;
    
    // 检查是否有匹配的规则
    let matchedAllow = null;
    let matchedDisallow = null;
    
    for (const pattern of rules.allow) {
      if (this.matches(path, pattern)) {
        if (!matchedAllow || pattern.length > matchedAllow.length) {
          matchedAllow = pattern;
        }
      }
    }
    
    for (const pattern of rules.disallow) {
      if (this.matches(path, pattern)) {
        if (!matchedDisallow || pattern.length > matchedDisallow.length) {
          matchedDisallow = pattern;
        }
      }
    }
    
    // 如果没有匹配的规则，默认允许
    if (!matchedAllow && !matchedDisallow) return true;
    
    // 更长的规则优先
    if (matchedAllow && matchedDisallow) {
      return matchedAllow.length >= matchedDisallow.length;
    }
    
    return matchedAllow !== null;
  }
  
  matches(path, pattern) {
    // 将 robots.txt 模式转换为正则表达式
    const regex = pattern
      .replace(/\*/g, '.*')
      .replace(/\$/g, '$');
    
    return new RegExp(`^${regex}`).test(path);
  }
}

// 使用示例
const robotsTxt = `
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/

User-agent: Googlebot
Allow: /admin/public/
`;

const parser = new RobotsTxtParser(robotsTxt);

console.log(parser.isAllowed('https://example.com/products', 'Googlebot')); // true
console.log(parser.isAllowed('https://example.com/admin/', 'Googlebot')); // false
console.log(parser.isAllowed('https://example.com/admin/public/', 'Googlebot')); // true
console.log(parser.isAllowed('https://example.com/api/users', 'Bingbot')); // false
console.log(parser.isAllowed('https://example.com/api/public/data', 'Bingbot')); // true

最佳实践总结

robots.txt 最佳实践清单：

基础配置：
✓ 始终放在网站根目录
✓ 使用 UTF-8 编码，无 BOM
✓ 文件大小保持在 500KB 以内
✓ 包含 Sitemap 指令

内容策略：
✓ 禁止管理后台和 API
✓ 禁止用户私有页面
✓ 禁止搜索结果页
✓ 禁止参数化 URL
✓ 允许主要内容区域

安全注意：
✓ 不要依赖 robots.txt 做安全防护
✓ 真正敏感的内容用认证保护
✓ 需要彻底禁止索引使用 noindex
✓ 定期审查禁止列表是否泄露信息

维护建议：
✓ 在文件中添加更新日期注释
✓ 定期检查是否有错误配置
✓ 使用 Google Search Console 验证
✓ 监控爬取统计数据

robots.txt 是 SEO 的基础配置之一。正确配置可以帮助搜索引擎更高效地爬取你的网站，同时保护不应公开的内容。记住，它是一个"君子协定"，好的爬虫会遵守，但不能依赖它做安全防护。

Robots.txt 配置完全指南

Robots.txt 配置完全指南

什么是 robots.txt

为什么需要 robots.txt

robots.txt 语法详解

基本语法结构

User-agent 指令

Disallow 和 Allow 指令

路径匹配规则

Crawl-delay 指令

Sitemap 指令

实战配置示例

电商网站 robots.txt

内容网站 robots.txt

SaaS 应用 robots.txt

常见错误与陷阱

错误1：误禁止整个网站

错误2：文件位置错误

错误3：规则顺序混乱

错误4：语法错误

错误5：依赖 robots.txt 做安全防护

robots.txt 与其他 SEO 指令的配合

robots.txt vs meta robots

最佳实践组合

Nuxt.js 中配置 robots.txt

使用 @nuxtjs/robots 模块

根据环境动态配置

完整的 robots.txt 服务端路由

验证和测试

Google Search Console 测试

使用 robots.txt 测试工具

最佳实践总结

相关标签

相关文章推荐

JavaScript SEO 完整解决方案

Schema.org 结构化数据实战指南：让搜索引擎真正理解你的内容

Google Search Console 完全指南 - 从入门到精通的站长工具使用手册