当前位置：移动技术网 > IT编程>开发语言>Java > 如何识别恶意请求，进行反爬虫操作？

如何识别恶意请求，进行反爬虫操作？

2019年06月02日 | 移动技术网IT编程 | 我要评论

前言

最近这几天，真的越来越感受到了。业务需求推动技术的发展。没有业务需求支持，一切都是扯。

之前在知乎回答了一个问题突然火了，导致我的小程序流量暴增，如下图：

最高峰的时候，每分钟200多个不同ip请求。大概每秒5个请求。也就是5qps。(突然感觉好小好小)

我这个系统有限流，有缓存，qps上千是没什么问题的。

所以今天我想写的不是高并发，而是如何识别恶意请求，恶意攻击，并且拦截他们。

因为代码是开源的，接口什么的完全暴漏出去了，所以总会有些人，恶意请求我的接口，虽然没啥大的影响，但总归很不爽。

限制ip

这个也是我一直都有的代码，具体如下：

  1 package com.gdufe.osc.interceptor;
  2 
  3 import com.alibaba.fastjson.json;
  4 import com.gdufe.osc.common.oscresult;
  5 import com.gdufe.osc.enums.oscresultenum;
  6 import com.gdufe.osc.service.redishelper;
  7 import com.gdufe.osc.utils.iputils;
  8 import lombok.extern.slf4j.slf4j;
  9 import org.apache.commons.lang3.stringutils;
 10 import org.springframework.beans.factory.annotation.autowired;
 11 import org.springframework.lang.nullable;
 12 import org.springframework.web.servlet.handlerinterceptor;
 13 import org.springframework.web.servlet.modelandview;
 14 
 15 import javax.servlet.http.httpservletrequest;
 16 import javax.servlet.http.httpservletresponse;
 17 import java.util.map;
 18 
 19 /**
 20  * @author: yizhen
 21  * @date: 2018/12/28 12:11
 22  */
 23 @slf4j
 24 public class ipblockinterceptor implements handlerinterceptor {
 25 
 26     /** 10s内访问50次，认为是刷接口，就要进行一个限制 */
 27     private static final long time = 10;
 28     private static final long cnt = 50;
 29     private object lock = new object();
 30 
 31     /** 根据浏览器头进行限制 */
 32     private static final string useragent = "user-agent";
 33     private static final string crawler = "crawler";
 34 
 35     @autowired
 36     private redishelper<integer> redishelper;
 37 
 38     @override
 39     public boolean prehandle(httpservletrequest request, httpservletresponse response, object handler) throws exception {
 40         synchronized (lock) {
 41             boolean checkagent = checkagent(request);
 42             boolean checkip = checkip(request, response);
 43             return checkagent && checkip;
 44         }
 45     }
 46 
 47     private boolean checkagent(httpservletrequest request) {
 48         string header = request.getheader(useragent);
 49         if (stringutils.isempty(header)) {
 50             return false;
 51         }
 52         if (header.contains(crawler)) {
 53             log.error("请求头有问题，拦截 ==> user-agent = {}", header);
 54             return false;
 55         }
 56         return true;
 57     }
 58 
 59     private boolean checkip(httpservletrequest request, httpservletresponse response) throws exception {
 60         string ip = iputils.getclientip(request);
 61         string url = request.getrequesturl().tostring();
 62         string param = getallparam(request);
 63         boolean isexist = redishelper.isexist(ip);
 64         if (isexist) {
 65             // 如果存在,直接cnt++
 66             int cnt = redishelper.incr(ip);
 67             if (cnt > ipblockinterceptor.cnt) {
 68                 oscresult<string> result = new oscresult<>();
 69                 response.setcharacterencoding("utf-8");
 70                 response.setheader("content-type", "application/json;charset=utf-8");
 71                 result = result.fail(oscresultenum.limit_exception);
 72                 response.getwriter().print(json.tojsonstring(result));
 73                 log.error("ip = {}, 请求过快，被限制", ip);
 74                 // 设置ip不过期 加入黑名单
 75                 redishelper.set(ip, --cnt);
 76                 return false;
 77             }
 78             log.info("ip = {}, {}s之内第{}次请求{}，参数为{}，通过", ip, time, cnt, url, param);
 79         } else {
 80             // 第一次访问
 81             redishelper.setex(ip, ipblockinterceptor.time, 1);
 82             log.info("ip = {}, {}s之内第1次请求{}，参数为{}，通过", ip, time, url, param);
 83         }
 84         return true;
 85     }
 86 
 87     private string getallparam(httpservletrequest request) {
 88         map<string, string[]> map = request.getparametermap();
 89         stringbuilder sb = new stringbuilder("[");
 90         map.foreach((x, y) -> {
 91             string s = stringutils.join(y, ",");
 92             sb.append(x + " = " + s + ";");
 93         });
 94         sb.append("]");
 95         return sb.tostring();
 96     }
 97 
 98     @override
 99     public void posthandle(httpservletrequest request, httpservletresponse response, object handler, @nullable modelandview modelandview) throws exception {
100     }
101 
102     @override
103     public void aftercompletion(httpservletrequest request, httpservletresponse response, object handler, @nullable exception ex) throws exception {
104     }
105 }

代码我大致解释一个。

可以看到41行和42行代码；我做了两层的拦截：

第一层是先拦截不合规的浏览器头，比如浏览器头包含有爬虫的信息，全部拦截掉。

第二层是一个ip的拦截。如果在10s之内，访问我的接口大于50次，我就认为你是刷接口过快，是一个爬虫。

此时我直接存入redis，永不过期，下次直接拦截掉。

这是第一个办法。

统计ip访问次数

但总有些ip访问很慢，比如10s才访问，20-30次，但又不间断的访问，爬取，永不停歇。

虽然没啥大的影响，总归很不爽。

我们看看程序大致打印的日志把：

2019-06-01 16:21:24.271 [http-nio-8083-exec-5] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 106.121.145.154, 10s之内第1次请求zhihu/spider/get，参数为[type = 1;offset = 80;limit = 10;]，通过
2019-06-01 16:21:24.271 [http-nio-8083-exec-5] info  c.gdufe.osc.service.impl.zhihuspiderimpl - [] - 图片随机位置为：356
2019-06-01 16:21:24.775 [http-nio-8083-exec-3] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 120.229.218.95, 10s之内第1次请求zhihu/spider/get，参数为[type = 1;offset = 70;limit = 10;]，通过
2019-06-01 16:21:24.775 [http-nio-8083-exec-3] info  c.gdufe.osc.service.impl.zhihuspiderimpl - [] - 图片随机位置为：612
2019-06-01 16:21:32.050 [http-nio-8083-exec-10] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 105.235.134.202, 10s之内第1次请求zhihu/spider/get，参数为[type = 2;offset = 0;limit = 10;]，通过
2019-06-01 16:21:32.050 [http-nio-8083-exec-10] info  c.gdufe.osc.service.impl.zhihuspiderimpl - [] - 图片随机位置为：93
2019-06-01 16:21:32.320 [http-nio-8083-exec-7] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 120.229.218.95, 10s之内第2次请求zhihu/spider/get，参数为[type = 1;offset = 80;limit = 10;]，通过
2019-06-01 16:21:32.320 [http-nio-8083-exec-7] info  c.gdufe.osc.service.impl.zhihuspiderimpl - [] - 图片随机位置为：100
2019-06-01 16:21:33.755 [http-nio-8083-exec-2] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 106.17.6.118, 10s之内第1次请求zhihu/spider/get，参数为[type = 1;offset = 80;limit = 10;]，通过
2019-06-01 16:21:33.755 [http-nio-8083-exec-2] info  c.gdufe.osc.service.impl.zhihuspiderimpl - [] - 图片随机位置为：107
2019-06-01 16:21:33.805 [http-nio-8083-exec-9] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 123.120.29.78, 10s之内第1次请求zhihu/spider/get，参数为[type = 1;offset = 80;limit = 10;]，通过
2019-06-01 16:21:33.805 [http-nio-8083-exec-9] info  c.gdufe.osc.service.impl.zhihuspiderimpl - [] - 图片随机位置为：1057
2019-06-01 16:21:35.697 [http-nio-8083-exec-6] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 106.121.145.154, 10s之内第1次请求zhihu/spider/get，参数为[type = 1;offset = 90;limit = 10;]，通过
2019-06-01 16:21:35.697 [http-nio-8083-exec-6] info  c.gdufe.osc.service.impl.zhihuspiderimpl - [] - 图片随机位置为：1030
2019-06-01 16:21:36.197 [http-nio-8083-exec-1] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 120.229.218.95, 10s之内第1次请求zhihu/spider/get，参数为[type = 2;offset = 0;limit = 10;]，通过
2019-06-01 16:21:36.198 [http-nio-8083-exec-1] info  c.gdufe.osc.service.impl.zhihuspiderimpl - [] - 图片随机位置为：2384
2019-06-01 16:21:36.725 [http-nio-8083-exec-8] info  c.g.osc.interceptor.ipblockinterceptor - [] - ip = 183.236.187.208, 10s之内第1次请求zhihu/spider/get，参数为[type = 1;offset = 0;limit = 10;]，通过

一个访问ip，应该会打印出两条日志。一条日志他的ip以及访问的路径。一条则与本题无关。

但我如何统计每个ip总共访问了多少次呢？

shell代码如下：

 1 #!/bin/bash 
 2 # 复制日志到当前目录
 3 cp /home/tomcat/apache-tomcat-8.5.23/workspace/osc/osc.log /home/shell/java/osc.log 
 4 # 将日志中的ip点号如： 120.74.147.123 换为 120：74：147：123
 5 sed -i "s/\./:/g" osc.log
 6 # 筛选出只包含ip的行，并且只打印ip出来
 7 awk '/limit/ {print $11}' osc.log > temp.txt
 8 # 根据ip的所有位数进行排序 并且统计次数 最后输出前50行
 9 cat temp.txt | sort -t ':' -k1n -k2n -k3n -k4n | uniq -c | sort -nr | head -n 50 > result.txt
10 # 删除无关紧要文件
11 rm -rf temp.txt osc.log

这其中涉及到了好多命令，都是今天临时一一学的(临时抱佛脚)。

最后执行结果如下：

第一列是访问的次数，第二列是ip。

那么一看43.243.12.43这个ip就不正常了。肯定是爬虫来的。那么直接封了就是。

后言

业务需求推动技术真的学到了。有需求，有业务，才会推动技术的进步。

各位大佬们，还有没有其他反爬虫的技巧。一起交流一下。

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

springcloud中feign调用处理mybatis-plus Ipage反序列化问题。

通过feign接口返回的分页列表IPage，出现报错，具体报错内容如下：feign.codec.DecodeEx... [阅读全文]
大数据云计算——Azkaban组件

什么是 AzkabanAzkaban 是由 Linkedin 公司推出的一个批量工作流任务调度器，主要用于在一个... [阅读全文]
荐聊聊数据库表结构设计心得

本文讨论是一般表的设计，有一定的普遍性和通用性，当然对于特殊性的考量则不在本文讨论之列。自增 idJava 层的... [阅读全文]
SpringCloud各个组件最强总结

一、概念1.1 什么是Spring Cloud？Spring Cloud就是微服务系统架构的一站式解决方案，在平... [阅读全文]
微服务之SpringCloud

# 微服务介绍将一个原本独立的系统拆分成多个小型服务，这些小型服务都在各自独立的进程中运行，并使用轻量级机制通信... [阅读全文]
劝退记：如你如我，平庸且不甘 | 年中总结

关于我前端劝退师，最近也叫前端失业师。没能力没学历没背景，三无前端。工作第四年，想和优秀的一批人竞争，永远和自己... [阅读全文]
Flume 史上最全面的大数据学习第十篇（一）别再说不知道flume是什么了

昨天没有增加小粉丝，我反思了一下自己，是不是我写的东西太过枯燥了呀！挺难受的！算了还是不说了，每天都是美好的一天... [阅读全文]
一直想学习Java网络编程，却不知道怎么入门？

其实，我刚学习Netty的时候，也是很迷茫的，直到有一天，一个同事收到了阿里的offer，他要去阿里做中台了，临... [阅读全文]
SpringCloud相关笔记一系统架构的演变(一)

SpringCloudSpringBoot 四种属性注入小结1.@Autowired注入2.构造方法注入3.@B... [阅读全文]
面试官：说说Kafka控制器事件处理全流程

前言大家好，我是 yes。这是Kafka源码分析第四篇文章，今天来说说 Kafka控制器，即 Kafka Con... [阅读全文]