Jiajie's tech blog: 2017

Friday, December 15, 2017

NodeJS模拟HTTP server

遇到问题是CORS，需要模拟一些allow-header和allow-origin
https://www.npmjs.com/package/node-mock-server
https://github.com/node-nock/nock

Friday, December 1, 2017

Friday, November 17, 2017

CORS是一个W3C标准，全称是"跨域资源共享"（Cross-origin resource sharing）。

CORS只针对浏览器，因为浏览器安全性的考虑，会拒绝向不允许访问的资源网站发送请求。过程是
1. 某网站的Ajax代码尝试去访问其他网站的资源（html，图片），
2. 这时候浏览器会自动加入先发一个CORS请求
3. 服务器whitelist了这个请求，返回response包含允许allow-origin=该网站的header。如果服务器不允许，浏览器收到response不含origin信息就会停止发送真正的请求。
4. 其后，每一个请求都会自动加入allow-origin的header。

其实，这只是浏览器的限制，如果用fiddler或者postman，甚至HTTPclient的lib是不会有这些限制的。

HTTP方法包括GET, POST, PUT, DELETE, OPTIONS

CORS的Jersey(Restful web service in Java)实现
https://gist.github.com/yunspace/36b0546245c5348a34ed

ref:
http://www.ruanyifeng.com/blog/2016/04/cors.html

Sunday, July 16, 2017

MySQL难点

select ... for update将会锁住任何其他connection的select for update, update, insert。当然还要放入到transaction中。

CREATE DEFINER=`qfu`@`%` PROCEDURE `updateIndex`(IN batchNum INT)
BEGIN

DROP TEMPORARY TABLE IF EXISTS temp;
create temporary table temp(id int);
START TRANSACTION;
insert into temp
select id from Dow_Jones_Index2 where isUpdated=0 limit batchNum for update;
update Dow_Jones_Index2 set isUpdated=1 where id in (select id from temp);
commit;
select * from Dow_Jones_Index2 where id in (select id from temp);
DROP TEMPORARY TABLE IF EXISTS temp;
END

临时表生命期是session内，如果数据库连接断了，临时表也会消失。相当于mssql中的#temp

---
explain select * from Student where id=5;
可以解释是否用了index。
Using where: 没用到index
Using index: index含有所有信息不需要再查表。The difference is that "Using index" doesn't need a lookup from the index to the table, while "Using index condition" sometimes has to.
Using index condition: 需要再查表

clear cache for performance tuning (workbench)
RESET QUERY CACHE;

ref:
select for update

Saturday, June 24, 2017

Alexa Skill: Word Helper

Word Helper

{
"intents": [
{
"intent": "AMAZON.HelpIntent"
},
{
"intent": "AMAZON.StopIntent"
},
{
"intent": "AMAZON.CancelIntent"
},
{
"intent": "WordHelperAddIntent",
"slots": [
{
"name": "Word",
"type": "AMAZON.Food"
}
]
},
{
"intent": "WordHelperGetIntent"
},
{
"intent": "WordHelperDeleteIntent",
"slots": [
{
"name": "Word",
"type": "AMAZON.Food"
}
]
}
]
}

WordHelperAddIntent add {Word}
WordHelperAddIntent put {Word}
WordHelperAddIntent save {Word}
WordHelperDeleteIntent remove {Word}
WordHelperDeleteIntent delete {Word}
WordHelperGetIntent tell me a word
WordHelperGetIntent give me a word
WordHelperGetIntent tell me something
WordHelperGetIntent give me something

Lambda Test:
use Alexa Intent - GetNewFact

Skill Test:
add bacon
put scrambled egg
save chocolate cake
tell me a word
give me something
remove chocolate cake
delete lemon juice

Ref:
Create a skill steps
Example: Skill with Lambda
How to Config intent with parameter
Skill NodeJS lambda example
Skill NodeJS project example
Skill Java project example
Slot Type reference

https://stackoverflow.com/questions/41358552/how-to-get-the-account-info-of-the-user-when-the-user-uses-an-alexa-skill

Sunday, June 18, 2017

AWS Pipeline简介

CodeCommit: 代码repository。只要有代码commit就会触发pipeline
CodeBuild: 选用python:2.7.12来编译代码，Python需要写buildspec.yml （Python例子），target files表示要copy到目标机器的文件
CodeDeploy: 配置要deploy的目标机器，在目标机器安装CodeDeploy agent，以及触发初始run

CodeDeploy official
CodeDeploy other

Thursday, June 15, 2017

Python常用library

Linux
sudo apt-get install python2.7

SQLAlchemy: python最流行的ORM工具
Logging: http://www.jianshu.com/p/feb86c06c4f4
YAML: http://www.ruanyifeng.com/blog/2016/07/yaml.html

__init__: 双下划线为保留函数
__name: 私有函数/属性

person_1.py:

class Person:

def __init__(self):
self.__name = 'haha'#私有属性
self.age = 22

def __get_name(self):##私有方法
return self.__name

def get_age(self):
return self.age

person = Person()
print person.get_age()
print person.__get_name()

不带self就是与object无关
def read_tag(file_name, tag, freq_cutoff=-1, init_dict=True):

Dictionary.read_tag(file_name, 'dialogue', freq_cutoff=freq_cutoff)

带self就是与object相关
def add_word(self, word):

dictionary.add_word(token)

继承：
http://www.cnblogs.com/feeland/p/4419121.html

Python语法

Wednesday, June 7, 2017

Squid反向代理简介 - Linux

sudo su
yum update
yum upgrade
yum -y install squid

vi /etc/squid/squid.conf

http_access allow all

sudo service squid stop
sudo service squid start

https://www.youtube.com/watch?v=BkjKINJIHsk

Sunday, May 21, 2017

Squid反向代理简介 - Ubuntu

ifconfig

inet addr:172.31.64.201 Bcast:172.31.79.255 Mask:255.255.240.0
sudo su
vi /etc/network/interfaces

iface lo inet static
address 172.31.64.201
netmask 255.255.240.0
network 172.31.79.0
broadcast 172.31.79.255
gateway 172.31.79.1
dns-nameservers 172.31.79.1 8.8.8.8

use ec2 console to reboot (status changed in 4 mins)

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install squid3

sudo nano /etc/squid3/squid.conf

--------------------------------------------------------
Ctrl+W search/next search result
Ctrl+X exit
Ctro+O write

(cache is disable by default)
visible_hostname testproxy
http_port 3128
acl network src 172.31.0.0/20
http_access allow network
(http_access allow all, http://ipcalc.nmonitoring.com/)

cache_peer 10.1.2.3 parent 80 0 no-query default login=my_username:my_password
never_direct allow all

--------------------------------------------------------

sudo service squid3 stop
sudo service squid3 start

inbound TCP rule: 3128 source: 0.0.0.0/0

Firefox:
in setting: 172.31.64.201(privite IP in AWS) and port 3128
in browser link type 172.31.64.201->see Access Denial page

Connection to 172.31.64.201 failed -> good

Log file:
/var/log/squid/access.log

Proxy:
Install node.js
sudo apt-get install build-essential curl git m4 ruby texinfo libbz2-dev libcurl4-openssl-dev libexpat-dev libncurses-dev zlib1g-dev

54.211.223.78

https://www.youtube.com/watch?v=iKtkp80gV04

http://www.cnblogs.com/derekchen/archive/2011/02/25/1964909.html

Thursday, May 18, 2017

AWS EC2和VPC简介

SSH连接EC2步骤：

1. 下载ppk文件（私钥文件）
2. 用PuTTYgen转化成Putty能读的私钥文件my-key-pair
3. 用PuTTY输入用户名ubuntu@DNS，SSH->Auth引入步骤2产生的文件作为密码。不同的EC2可以share同一个key，但用putty登录时同样要SSH->Auth引入步骤2产生的文件作为密码。
更多

模板

用现有instance模板产生更多此类实例（比如已经安装了很多软件）
1. 在“实例”页面上，选择要使用的实例。
2. 选择 Actions，然后选择 Launch More Like This。
更多
或者创建一个实例模板AMI（amazon machine image）
1. 在“实例”页面上，选择要使用的实例。
2. 选择 Actions，然后选择 Images->Create image。
这样创建实例时候可以从my AMIs中创建
更多

terminate vs stop

区别在于stop只是关机，硬盘（EBS）还在，terminate就是整个instance都会被删掉

界面GUI

1. 安装
sudo -s
sudo apt-get update
sudo apt-get install ubuntu-desktop
sudo apt-get install vnc4server
sudo apt-get install gnome-panel

2. 初始化进程
vncserver
vncserver -kill :1

3. 修改config
vi .vnc/xstartup
删掉第一行注释：
unset SESSION_MANAGER
加入
gnome-session -session=gnome-classic &
gnome-panel&

4. 开始进程
vncserver

5. 打开5901端口
在security groups中的inbound规则加入5901端口，而source为0.0.0.0/0。有一点要特别注意的是，一定要在该instance对应的security group加入端口，若不在正确的security group上加，就不能被访问。

6. TightVPC
连接publicIP::5901

https://www.youtube.com/watch?v=ljvgwmJCUjw (最好的指南)
http://stackoverflow.com/questions/25657596/how-to-set-up-gui-on-amazon-ec2-ubuntu-server

VPC:

如果EC2只是内部资源，不需要暴露在public（如非网站）。如果给ec2一个public IP，这样很容易被攻击，所以可以只给ec2内部IP，确保它在VPC内部，然后加入带public IP的bastion host作为进入VPC的入口。

bastion host: public IP, sg1->只允许公司内部IP的端口22的访问（不要开放给所有IP）
vm01:, private IP, sg2->只允许sg1的端口22的访问，且outbound允许所有IP的80和443端口流出

Putty只要登录bastion host，登陆后再ssh 到vm01。至于bastion host的设置在SSH->Auth中要勾上Allow agent forwarding并且留空Private key file for authentication，并且开着Pageant(windows service)，把key放入即可（原理是把key放入内存中）. 参见教程

在private的EC2默认不能访问Internet，所以可以在VPC中的Elastic IPs创建一个分配给该EC2。private的EC2默认没有outbound rule，与public的EC2默认允许All trafic, all ports完全不同。所以对每种服务都要加入到outbound，如EC2要读写RDS，就要允许通过3306访问RDS的security group。

如下表，ec2的outbound和mysql的inbound对接起来，这样ec2就可以顺利访问RDS

security group	Inbound	Outbound	IsPublic
bastion	SSH:devIP	ALL	Yes
ec2	SSH:bastion	HTTP(s):ALL MySql:mysql	no
mysql	MySql:ec2 MySql:devIP	ALL	yes

Monday, May 15, 2017

AWS CodeCommit

如何在windows安装aws codecommit
http://docs.aws.amazon.com/codecommit/latest/userguide/setting-up-https-windows.html?icmpid=docs_acc_console_connect#setting-up-https-windows-install-git

主要分为
1. 创建IAM用户gcen作为git的credential，权限加入
AWSCodeCommitFullAccess, IAMSelfManageServiceSpecificCredentials, IAMReadOnlyAccess并且获得Access Key ID和Secret Access Key相当于git访问密码
2. 安装git，注意不能勾选Git Credential Manager utility，它与CodeCommit不兼容，否则会有 403错误
3. 安装aws cli，最简单是下载MSI安装包或者python的pip安装。安装完毕后，输入aws configure输入Access Key ID和Secret Access Key，地区与CodeCommit上的repository地区保持一致us-east-1(从CodeCommit可得到)，否则会有403错误。
4. 本地文件夹git clone https://git-codecommit.us-east-2.amazonaws.com/v1/repos/demo demo同时更改git显示的用户名和邮箱。git https可以在IAM产生username和password绕过SSH。

git config --global user.name "Gary"
git config --global user.email "gary@gmail.com"

ref:
codecommit的git命令

Wednesday, May 10, 2017

Machine learning简介

关联规则学习Apriori先验算法：用于发现变量间的关系，比如超市中同一次购买的产品间关系。
数据：

TID	网球拍	网球	运动鞋	羽毛球
1	1	1	1	0
2	1	1	0	0
3	1	0	0	0
4	1	0	1	0
5	0	1	1	1
6	1	1	0	0

结论：
关联规则网球拍=>网球是有趣的，认为购买网球拍和购买网球之间存在强关联。

机器学习流程

以分类问题为例。这个例子的数据是通过多个因素来判断学生期末考试成绩合格与否。
目标：期末考试G3>=10即为及格，否则不及格，所以是Binary的分类问题
输入：包括32个attributes，比如famsize家庭大小，Mjob母亲工作，Fjob父亲工作，reason选择此校原因，G1和G2为前两次小测的成绩（对结果有重大影响的attribute）。

数据(http://archive.ics.uci.edu/ml/datasets/Student+Performance)：

sex	age	address	famsize	Mjob	Fjob	reason	G1	G2	G3
F	18	U	GT3	at_home	teacher	course	0	11	11
F	17	U	GT3	at_home	other	course	9	11	11
F	15	U	LE3	at_home	other	other	12	13	12
F	15	U	GT3	health	services	home	14	14	14

总体流程还包括前期数据收集（预测产品销售：季节，promotion，历史销售，价格；广告是否点击：广告位置，用户性别，广告颜色，价钱）、清理(null用mean代替)、sanity check(平均值，方差)，后期就是在prod应用model计算新数据的目标值。

Feature extraction特征数值化：上例如Fjob是String，可以取这列的distinct数值（N个），按1-N编号，再建立一个map把特征string和编号对应起来。即使numeric的列也需要做同一处理以缩小值的范围。而目标列就按二值处理。

Feature selection特征选择：有些特征对目标预测有强关联关系，比如G1、G2对G3，而有些特征如age对G3可能弱关联。选择合适的特征对预测结果score（准确度）有至关重要的作用。至于哪些特征对目标有影响，可以通过关联规则来确定。

Split data训练测试数据分离：一般采用10次交叉验证10-fold cross-validation，随机分成10份样品，9份训练，一份测试，进行10次这样的训练取平均结果。

Fit model训练模型：就是用训练数据去训练model。每种model都会有自己的参数（比如决策树的深度max_depth=5）。这个过程需要parameter tuning。

Predict预测测试数据：用训练好的model去计算测试数据的目标特征值G3.

Score计算分数（准确度）：用测试数据实际结果G3和model计算出来的G3比较，看false positive和false negative的数量确定分数。

机器学习问题分类

回归预测：预测产品销售
分类问题：产品分类，错误订单，用户会否点击广告
推荐系统：推荐产品
NLP: 产品是否重复
http://docs.aws.amazon.com/zh_cn/machine-learning/latest/dg/types-of-ml-models.html

Model应用场合

Supervised Learning适用与已知LABEL的情况.

Semi-supervised Learning适用于有Latent Variable的情况

Unsupervised Learning适用于有Latent Variable的情况

		数据维度	易受攻击	简介	场景
随机森林	random forest	不高	不	决策树的进阶版，集成算法，不容易被攻击。不需要很多参数调整就可以达到不错的效果。首选尝试	几乎所有
支持向量机	SVM	高	极不	找到不同类别之间的分界面。第二选择
神经网络	Neural network			利用训练样本(training sample)来逐渐地完善参数。如预测身高，若输入的特征中有一个是性别，而输出的特征是身高。那么当训练样本是有大部分高的男生，在神经网络中，从“男”到“高”的路线就会被强化。有很多很多层。受限于计算机的速度。	数据量庞大，参数之间存在内在联系。以及生成数据，用来做降维
近邻	KNN			找到离它最近的几个数据点，根据它们的类型决定待判断点的类型。它的特点是完全跟着数据走，没有数学模型可言。容易解释的模型。	推荐算法
贝叶斯	Bayesian	高		根据条件概率计算待判断点的类型	垃圾邮件过滤器
决策树	Decision tree		是	它总是在沿着特征做切分。随着层层递进，这个划分会越来越细。一些更有用的算法的基石。	N/A
逻辑斯蒂回归	Logistic regression			回归方法的核心就是为函数找到最合适的参数，使得函数的值和样本的值最接近。例如线性回归(Linear regression)就是对于函数f(x)=ax+b，找到最合适的a,b。它拟合的是一个概率学中的函数。虽然效果一般，却胜在模型清晰

应用场景
https://www.zhihu.com/question/26726794

classification->random forest, SVM, neural
clustering
regression
recommendation
spam detection
NLP

推荐系统
https://www.zhihu.com/question/19971859
https://jlunevermore.github.io/2016/06/25/36.python%E5%AE%9E%E7%8E%B0%E6%8E%A8%E8%8D%90%E7%B3%BB%E7%BB%9F/
http://blog.csdn.net/u012050154/article/details/51438906

评估分类器
https://www.zybuluo.com/littlekid/note/66980#评估分类器

Thursday, May 4, 2017

AWS RDS

Relational Database大家应该比较熟悉，这里主要讲述配置问题

新建一个数据库实例一定要注意security groups，publicly accessible选yes
如果create new security group的话，本地客户端MySql Workbench就可以直接访问，但是它不能其他AWS资源如lambda访问，所以不推荐
这里推荐用default(VPC), 这时候其他AWS可以访问，但本地默认不能访问，这时候就要点击security groups中的default进入EC2的security group，在inbound rules加入TCP 3306(MySql)允许My IP访问即可。

Thursday, March 16, 2017

lombok

ORM工具，你只要写field就可以产生自动产生getter setter

@Data

=
其他
@ToString
@EqualsAndHashCode
@NoArgsConstructor(access=AccessLevel.PROTECTED)

boolean getter
@Getter
private boolean isGood; // => isGood()

@Getter
private boolean good; // => isGood()

@Getter
private Boolean isGood; // => getIsGood()

Code
import lombok.Getter;
import lombok.Setter;
@Getter
@Setter
public class User {
private int id;
private String username;
private String email;
public User() {
}
public User(int id, String username, String email) {
this.id = id;
this.username = username;
this.email = email;
}
// 如果提供了访问函数，则 Lombok 不会为其再生成
public String getEmail() {
return "Email: " + email;
}
}

<=>
public class User {
private int id;
private String username;
private String email;
public User() {
}
public User(int id, String username, String email) {
this.id = id;
this.username = username;
this.email = email;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getUsername() {
return username;
}
public void setUsername(String username) {
this.username = username;
}
public String getEmail() {
return email;
}
public void setEmail(String email) {
this.email = email;
}
}

@Builder
Student.builder().id(5).build();

Others:
import lombok.Data;

@Data
public class Teacher {
private String name;
private String title;
}

Eclipse安装Lombok这样自动产生class

https://stackoverflow.com/questions/22310414/how-to-configure-lombok-in-eclipse-luna

ref:

http://xtuer.github.io/java-lombok/

Saturday, March 11, 2017

AWS SQS

云端队列服务，客户端polling获取。用于不同系统间通信和解耦。

使用场景

SQS是非实时处理任务，将同步变异步解耦削峰，是一个分布式的FIFO队列。需要创建一个叫QueueName的队列，然后通过SQSclient发消息到此队列，另一个SQSclient从此队列收消息。

架构

队列（保存多个消息）在多个 Amazon SQS 服务器上冗余存储消息。

短轮询

Wait time set to 0. 采样方式获取信息，因为采用冗余，这次获取不了某信息，下次将会获得。

Queue参数

Delivery delay: 延时一段时间sender的SQS client才发送到云端Queue中

Receive message wait time: recevier的SQS客户端每隔多长时间会从云端Queue中poll信息。默认为
20s，即使空的信息也会poll。会看到客户端的latency就会是这个数。用SNS可以将Poll变成Push模式。

Retention period: msg保留时长1min-14天

sendMessage, receiveMessage API的latency是几十到低几百毫秒。

最佳实践

常与SNS连用。
DynamoDB stream->SQS/SNS->Elastic search
System 1->SQS->System 2

Queue Actions 中，选择 Subscribe Queue to SNS Topic (或 Subscribe Queues to SNS Topic)。

队列操作中，选择Configure Trigger for Lambda Function (为函数配置触发器)。

SQS作为触发器，会有5个线程每20秒poll一次，所以一分钟是15个 empty receives. 一天是21600个SQS

request，free tier是1百万次，如果超过，基本上是1个SQS用2蚊一个月。

底层API

final SendMessageResult sendMessageResult = sqs.sendMessage(
new SendMessageRequest(myQueueUrl, "This is my message text."));

final String messageId = sendMessageResult.getMessageId();

SendMessageRequest request = new SendMessageRequest(myQueueUrl, "This is my message

text.");

request.setDelaySeconds(5);

final ReceiveMessageRequest receiveMessageRequest = new

ReceiveMessageRequest(myQueueUrl);

final List<Message> messages = sqs.receiveMessage(receiveMessageRequest).getMessages();

for (final Message message : messages) {

System.out.println("Message");

System.out.println(" MessageId: " + message.getMessageId());

System.out.println(" ReceiptHandle: " + message.getReceiptHandle());

System.out.println(" MD5OfBody: " + message.getMD5OfBody());

System.out.println(" Body: " + message.getBody());

for (final Entry<String, String> entry : message.getAttributes().entrySet()) {

System.out.println("Attribute");

System.out.println(" Name: " + entry.getKey());

System.out.println(" Value: " + entry.getValue());

} }

功能

标准队列

无限吞吐量 – 标准队列每个操作支持接近无限的每秒事务数 (TPS)。

(含重)至少传送一次 – 消息至少传送一次，但偶尔会传送消息的多个副本。

标准队列每个操作支持接近无限的每秒事务数 (TPS)。标准队列支持至少一次消息传递。但是，
由于存在允许近乎无限吞吐量的高度分布式架构，偶尔会有一条消息的多个副本不按顺序传送。标准
队列会尽最大努力进行排序，保证了消息大致按其发送的顺序进行传递。亲身经历，即使我发了两条

信息到SQS，但看到message in flight是6，重复了2倍。如果lambda是heavy计算且会写入S3或DB的

话，最好用FIFO。

In flight

当一个信息被一个lambda pickup，该message就会被移到in flight，当可见性超时，就会从in flight移回

queue中。

FIFO 队列

高吞吐量 – 默认情况下，借助批处理，FIFO 队列每秒支持多达 3000 条消息。

(去重)仅传输一次处理 – 消息传递一次并在使用者处理并删除它之前保持可用。不会将重复项引入
到队列中。这个通过指定dedupeID(messageGroupId specified by sender)实现或者enable

Content-Based-Deduplication option。

先进先出传递 – 严格保持消息的发送和接收顺序。

吞吐量很重要时用标准队列，当事件的顺序重要时用FIFO。

可见性超时: 收到消息后，消息将立即保留在队列中。为防止其他使用者再次处理消息，Amazon
SQS 设置了可见性超时，这是 Amazon SQS 防止其他使用者接收和处理消息的时间段。消息的
默认可见性超时为 30 秒。最小值为 0 秒。最大值为 12 小时。这是console的配置，针对所有信息。
API可针对单个或多个信息。

一般设置为lambda timeout的6倍，因为2次retry

To allow your function time to process each batch of records, set the source queue's visibility timeout to at least 6 times the timeout that you configure on your function.

延迟队列

延迟队列可让您将针对队列的新消息传递操作推迟特定的秒数。如果您创建延迟队列，则发送到该
队列的任何消息在延迟期间对用户都保持不可见。

死信队列Dead letter queue：死信队列是其他(源)队列可将其作为无法成功处理(使用)的消息的目标的队列。当消息的 ReceiveCount 超出队列的 maxReceiveCount 时，Amazon SQS 会将该消息移到死信队列。FIFO 队列的死信队列也必须为 FIFO 队列。同样，标准队列的死信队列也必须为标准队列。死信队列的主要任务是处理消息失败。利用死信队列，您可以留出和隔离无法正确处理的消息以确定其处理失败的原因。

我做过一个的例子是用sqs触发lambda，只要在lambda中throw exception就可以让msg等待可见性时段，SQS会自动将它放回原队列中，让它重试，这是maxReceiveCount+1。当超过maxReceiveCount时，msg会被放入DLQ，所以若maxReceiveCount设置比较大，会lambda会多次运行。Retention period要多于maxReceiveCount，否则，msg会被删除而没有机会进入DLQ。SQS-Lambda的可视时间设置为15分钟(=lambda的timeout时间)，若maxReceiveCount=2，要等30分钟，msg会被放入DLQ。DLQ每poll一次（包括打开UI），count就会+1. msg被成功处理后，会被自动删除，不需要consumer做。Lambda最好不要有cache(class variable甚至是static)，因为如果某段时间内，若lambda没有收到任何request，含lambda的容器才会重启，cache才有机会被refresh。所以若一直有request，cache永远不会被更新。

总结：

1. Lambda不需要retry，因为maxReceiveCount就是retry

2. retention period和maxReceiveCount共同决定retry次数。默认值分别为4天和500，若lambda时间为15分钟，500是一个非常大的值，lambda会跑4天，所以我的经验是设置rentention为12小时，maxReceives为3.

使用 Amazon S3 管理大型 Amazon SQS 消息。只适用于Java SDK

与Kinesis的区别：Kinesis是实时的且可用于大数据。

官网
Github的Java例子
SQS->Lambda
Lambda中的cache

Friday, March 10, 2017

二级缓存

一级缓存是内存，二级缓存是硬盘或者分散式网络内存，它是一级缓存的扩展，弥补一级缓存容量有限且不能共享的缺点，并且可以让整个application（硬盘）或不同server（分散式网络内存）共享，虽然它比一级缓存稍慢。

Monday, March 6, 2017

Guice简介

Guice也是采用DI标准。Guice比Spring轻量级，启动速度稍快，但是没有Spring完善。
Guice发明目的为了1.分离依赖 2.方便测试(插入fake依赖)，当然这也正是DI提出的目标。

bind主要用于勾画类的关系（如继承）

与Spring比较
@Named -> @Inject

没继承 -> @Inject 不需要@Named
@provides -> @Inject 如果提供实例需另外处理。当然还可以加上@Singleton，它可在provides或类名前。

最简单的例子

模拟在商店用信用卡买食品的系统：

public class App
{
public static void main( String[] args )
{
Injector injector = Guice.createInjector(new BillingModule());
BillingService billingService = injector.getInstance(BillingService.class);
billingService.chargeOrder("pizza", new CreditCard());
}
}

public class BillingModule extends AbstractModule {
@Override
protected void configure() {}
}

public class CreditCardProcessor {
String name = "Paypal";
}

public class BillingService {
private final CreditCardProcessor processor;
private final TransactionLog transactionLog;

@Inject
BillingService(CreditCardProcessor processor, TransactionLog transactionLog) {
this.processor = processor;
this.transactionLog = transactionLog;
}

public int chargeOrder(String order, CreditCard creditCard) {
System.out.println("processing purchase");
System.out.println(processor.toString());
System.out.println(transactionLog.toString());
return 0;
}
}

1. 在Module里面的binding不是必须的，如果有类关系（如继承）才需要在这里配置。
2. Guice的Inject一般在构造函数，这样私有成员可以声明final，比较安全。只要用@Inject， Guice就会自动注入，不需要在CreditCardProcessor加上Named（Spring做法）
3. Guice首先根据Module生成Injector，然后获得启动类再启动服务。
比较一下Spring的启动方式，很相似，区别就在于类关系表现在class还是xml
ApplicationContext context = new ClassPathXmlApplicationContext("com/vtasters/beans.xml");
Video t = (Video) context.getBean("video");

Binding绑定

当Inject一个interface时候，需要在config说明用哪个实现类代替这个interface，这就是binding的作用。
@Inject
BillingService(CreditCardProcessor processor, TransactionLog transactionLog)
CreditCardProcessor此时是Interface非实现类，所以要说明其实现类。

在config中加入bind(CreditCardProcessor.class).to(PaypalCreditCardProcessor.class);

interface CreditCardProcessor {
String getName();
}

public class PaypalCreditCardProcessor implements CreditCardProcessor{
String name = "Paypal";
public String getName(){return name;}
}

另一方法是JustInTimeBindings用@ImplementedBy
@ImplementedBy(PayPalCreditCardProcessor.class)
public interface CreditCardProcessor
这样就不用写bind

BindingAnnotations注释绑定

用注释来提高绑定的代码可读性

第一种方法是@interface法，用@interface创建自定义注释。
粗体部分为新加入。加入@PayPal后目的是提高代码易读性，这样不用去binding就知道CreditCardProcessor绑定到PaypalCreditCardProcessor。当然同时也产生了额外的代码(@interface类)

@BindingAnnotation
@Target({ FIELD, PARAMETER, METHOD })
@Retention(RUNTIME)
public @interface PayPal {}

bind(CreditCardProcessor.class).annotatedWith(PayPal.class).to(PayPalCreditCardProcessor.class);

@Inject BillingService(@PayPal CreditCardProcessor processor, TransactionLog transactionLog)

第二种方法是@Named法，内嵌式注释。此法省去@interface创建，但忘记绑定或绑定字符串typo编译器不能发现，所以Guice不推荐

public class DatabaseTransactionLog implements TransactionLog加入实现类

bind(TransactionLog.class).annotatedWith(Names.named("Database")).to(DatabaseTransactionLog.class);

@Inject BillingService(@PayPal CreditCardProcessor processor,
@Named("Database") TransactionLog transactionLog)

InstanceBindings绑定实例

通过注释绑定一个String常数的实例，但复合类型不用此法而是用@Provides
bind(String.class).annotatedWith(Names.named("url")).toInstance("jdbc:mysql://localhost/pizza");
可以用此法绑定更好
bindConstant().annotatedWith(Names.named("dbname")).to("MySQL");

@Provides提供实例

如果不用绑定可以用@Provides，如下例provideStore方法代替了bind(Store.class).to(RegularStore.class)

public class BillingModule extends AbstractModule {
@Provides
Store provideStore() {
RegularStore store = new RegularStore();
store.setId("123");
return store;
}
}

@Inject BillingService(@PayPal CreditCardProcessor processor, @Named("Database") TransactionLog transactionLog, Store store)

Provides也可以与binding一样加注释：
@Provides
@Named("regular")
Store provideStore()

@Inject BillingService(@PayPal CreditCardProcessor processor, @Named("Database") TransactionLog transactionLog, @Named("regular") Store store)

ProviderBindings提供绑定

如果@Provides模块太大需要独立成一个class可以实现Provider做到，还要加入bind中。
public class RegularEmployeeProvider implements Provider<Employee> {
public Employee get() {
RegularEmployee employee = new RegularEmployee("Sue");
return employee;
}
}

@Inject BillingService(@PayPal CreditCardProcessor processor, @Named("Database") TransactionLog transactionLog, @Named("regular") Store store, Employee employee)

bind(Employee.class).toProvider(RegularEmployeeProvider.class);

另一方法是JustInTimeBindings用@ProvidedBy
@ProvidedBy(RegularEmployeeProvider.class)
public interface Employee
这样就不用写bind

范围（如单例）

表示每次需要实例时，提供同一实例。在application运行期均为单例。
@Provides
@Named("regular")
@Singleton
Store provideStore()

以下两个服务都返回同一个RegularStore
BillingService billingService = injector.getInstance(BillingService.class);
BillingService billingService2 = injector.getInstance(BillingService.class);

上法是Provides法，还可以用binding法和加到RegularStore上面，若冲突，以binding为准。
bind(Store.class).to(RegularStore.class).in(Singleton.class);

@Singleton
public class RegularStore implements Store

非继承类

无继承的类可以用@Provides来提供实例，正如之前提过不用binding
@Provides
@Singleton
Cache provideCache(){
Cache cache = new Cache();
cache.setName("mycache");
return cache;
}
@Inject BillingService(@PayPal CreditCardProcessor processor, @Named("Database") TransactionLog transactionLog, @Named("regular") Store store, Employee employee, Cache cache)

ref:
官方
HeroModule
provider design pattern
HelloGuiceServiceImpl
singleton

负载测试Load test

负载测试用于负载测试和性能测试，使用它来查找和发现相关的性能和负载管理问题，尤其是Web应用程序的性能问题。

原理是手动写一或多个request，JMeter会用多线程随机产生很多这类requests发给要测试的service。测试会分阶段进行比如第一分钟5个request，第二分钟10个，如此类推。一般来说，前期latency基本不变，但某一个点后延迟会显著增加。

Throughout(TPS: transactions per second)：当增加连接(request)数时，如果latency显著增加，这样throughout并不会显著增加甚至下降，server遇到瓶颈。这就是最大的TPS。这时候就知道系统是否可以处理到某个TPS值如800.

ref:
JMeter

AWS Elasticsearch原理

报纸公司一开始把文章存在数据库的一个叫contents列中，但发现多关键词检索难以做到，即使做到(MySQL的full text search)也比较慢且难以全表搜索（比如把title，tags列也作为搜索目标）。此时，全文搜索应运而生，它可以满足：

1. 全文搜索：多个列中搜索
2. 多关键词：不用写SQL的like一样麻烦
3. 速度更快
4. 高级搜索：比如一次性搜索中，某些关键词权重高些

以下是公司用例：
GitHub：搜索代码和checkin日志
StackOverflow: 搜索问题和答案
HotelTonight：这是结构性数据，有人会有疑问，但它用ES来做多列搜索，如价钱+评分+位置
Wikipedia：搜索文章，用于autocomplete

原理

如ADB课程的proj。首先将文档tokenize，统计每个词(单词规范化，小写过滤)的出现的文档（倒排索引）
data Doc 1, Doc2
engineer Doc1
program Doc2
将这些结果写到多个shard（分片，是实际存储数据的Lucene索引 - DB，含其他某一分片的一个副本，如data存于主片1，当然主片1的副本可含engineer，engineer存于主片2）中，可以保证每个shard做独立搜索，加快速度。然后将结果整合，排序，最后(根据汇总结果)请求原始文档。
更新时候，类似于hashmap，可以找到该词找到相应的shard，更新shard，然后写到文件中，ES会对副本进行异步更新。删除是用soft delete，维护文档版本号，query时候过滤掉旧版本号。

基本概念

ES与MySQL对应概念

MySQL	Elasticsearch
Database	Index
Table	Type
Row	Document
Column	Field
Schema	Mappping
Index	Everything Indexed by default
SQL	Query DSL

写入数据：
client.index({
index : 'test',
type : 'article',
id : '100',
body : {
title : '什么是 JS？',
slug :'what-is-js',
tags : ['JS', 'JavaScript', 'TEST'],
content : 'JS 是 JavaScript 的缩写！',
update_date : '2015-12-15T13:05:55Z',
}
})

全文搜索JS：
client.search({
index : 'test',
type : 'article',
q : 'JS',
});

搜索结果：结果都在hits中

高级搜索DSL（类似于SQL）：
1. 只搜某个Document（例如message）：match
"query": {
"match" : {
"message" : "this is a test"
}
}
2. 准确查找：term
"query": {
"term" : { "user" : "Kimchy" }
}
3. 范围查找：range
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20,
"boost" : 2.0
}
}
}
boost是权重，默认1.0，表示这个query权重较高

AWS

ES由Lucene进化而来，AWS实现了ES从而推出了自己的产品。它的步骤是
1. 创建domain，如叫movies
2. 上传json数据

3. 搜索：全文搜索nightmare
curl -XGET 'search-movies-4oy.us-west-1.es.amazonaws.com/movies/_search?q=nightmare'

AWS例子中，有4步曲，要设为public access以及设置master user名和密码

query的例子：

在某index上搜索

curl -XGET -u 'master-user:master-user-password' 'domain-endpoint/movies/_search?q=mars&pretty=true'

在所有index上搜索

curl -XGET -u 'master-user:master-user-password' 'domain-endpoint/_search?q=mars&pretty=true'

在所有index上搜索多个关键字(类似于或的搜索)

curl -XGET -u 'master-user:master-user-password' 'domain-endpoint/_search?q=mars%20Jack&pretty=true'

在所有index上多域搜索

curl -XGET -u 'master-user:master-user-password' 'domain-endpoint/_search?q=title:mars%20AND%20actors:Jack&pretty=true'

Ref

原理

AWS例子

AWS搜索例子

Sunday, March 5, 2017

AWS CloudWatch

AWS CloudWatch是Amazon的运营数据监测和警报系统，用于监测server的CPU，memory，IO读写，吞吐量警报，server开启关闭，等等。

创建定制化的dashboard:
所以metrics都汇总到这里如RDS, SQS。只要创建一个自定义如关于SQS的dashboard含msgSent, msgReceived等，然后可以share到一个wiki。

ref:
简介

AWS Kinesis Stream

AWS Kinesis Stream是近乎实时收集和处理大数据记录流。可以理解为近乎实时数据传输管道，相当于以前用FTP传输数据，但它是用API且将数据作为字节流传输。1秒的平均传播延迟。

使用场景

用于快速而持续的数据引入和聚合
使用的数据类型可以包括 IT 基础设施日志数据、应用程序日志、社交媒体、市场数据源和 Web点击流数据。由于数据引入和处理的响应时间是实时的，因此处理通常是轻量级的。

日志引入:推送系统和应用程序日志=metrics
实时数据分析:实时处理网站点击流
实时聚合数据，然后将聚合数据加载到数据仓库或 map-reduce群集

架构

关键在于分片shard，流可以指定多个分片，而AWS Kinesis需要花时间为该流分配出分片。分为生产者和使用者，分别是产生数据和消费数据。

Kinesis Data Stream:

保留期(retention period)是数据记录在添加到流中后可供访问的时间长度。在创建之后，流的保留期设置
为默认值 24小时。最多到7天。

分片是流中数据记录的唯一标识序列。一个流由一个或多个分片组成，每个分片提供一个固定的容量单
位。流的总容量是其分片容量的总和。
分区键用于按分片对流中的数据进行分组。MD5 哈希函数用于将分区键映射到 128 位整数值并将关联的
数据记录映射到分片。应用程序需要对每个数据指定分区键。在创建流时，您将指定流的分片数。

number_of_shards = max(incoming_write_bandwidth_in_KiB/1024, outgoing_read_bandwidth_in_KiB/2048)

使用者称为Amazon Kinesis Data Streams Application

Kinesis Client Library(供使用者使用，是Kinesis服务的客户端)KCL 负责许多与分布式计算相关的复杂
任务，例如对多个实例进行负载均衡、
对实例故障做出响应、对已处理的数据执行检查点操作和对重新分片做出反应。KCL 可让您将精力
放在编写记录处理逻辑上。
它使用Amazon DynamoDB表存储控制数据。它会为每个正在处理数据的应用程序创建一个表。实例中，
选择 KinesisDataVisSampleApp-KCLDynamoDBTable-[randomString] 表。
在表中有两个条目，指示特定分片 (leaseKey)、流中的位置 (checkpoint) 和读取数据的应用程序
(leaseOwner)。

用 AWS KMS 主密钥进行加密

最佳实践

AWS实例是统计URL实时数目(2s window)。

分析实时股票数据

底层API

创建和描述stream，需要的时间创建stream主要用于分配shard

create-stream --stream-name Foo --shard-count 1

describe-stream --stream-name Foo

Response:

{

"StreamDescription": {

"StreamStatus": "ACTIVE",

"StreamName": "Foo",

"StreamARN": "arn:aws:kinesis:us-west-2:account-id:stream/Foo",

"Shards": [

{

"ShardId": "shardId-000000000000",

"HashKeyRange": {

"EndingHashKey": "170141183460469231731687303715884105727",

"StartingHashKey": "0"

"SequenceNumberRange": {

"StartingSequenceNumber": "495469866831355442865074579357546397794"

}

]

}

AWS 区域的默认分片限制为 500 分片

单个分片可以提取多达每秒 1 MiB 的数据 (包括分区键) 或每秒写入 1000 个记录.

每个 PutRecord 调用都需要流名称、分区键和创建者正在添加到流的数据记录。

put-record --stream-name Foo --partition-key 123 --data testdata

Response:

{ "ShardId": "shardId-000000000000", "SequenceNumber": "49546156785154" }

要将数据放入流，您必须指定流的名称、分区键和要添加到流的数据 Blob。GetRecords
可以从单个分片中检索多达每个调用 10 MiB 数据，每个调用多达 10000 个记录。每个分片可以通过 GetRecords 支持每秒 2 MiB 的最大总数据读取速率。

首先获得分片的迭代器（指针）

get-shard-iterator --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON --stream-name Foo

得到迭代器的id，然后将其带入到getRecords API

get-records --shard-iterator AAAAAAAAAAHSywlj

Response:

{ "Records":[ { "Data":"dGVzdGRhdGE=", "PartitionKey":"123”, "ApproximateArrivalTimestamp":
1.441215410867E9,
"SequenceNumber":"49544985256907370027570885864065577703022652638596431874" } ],
"MillisBehindLatest":24000, "NextShardIterator":"AAAA

Data为base64编码

NextShardIterator下一个iterator，即使流没有新数据

当Records为空的时候才表示全部读完

MillisBehindLatest表示距流末端（最新数据）还有多久。零值指示正进行记录处理，此时没有新
的记录要处理

术语检查点操作是指记录到流中目前已使用和处理的数据记录所在的点，这样一来，当应用程序
发生崩溃时，系统将从该点读取流，而不是从头开始读取流。
只要记录SequenceNumber。然后获取Iterator时指定这个checkpoint即可。
GetShardIterator::StartingSequenceNumber

封装层

生产器 - KPL

Kinesis Producer Library (KPL) 简化了创建器应用程序的开发.KPL 是一个易于使用的、高度可配置的库，可帮助您对 Kinesis data stream 进行写入. 由于 KPL 可在将记录发送到 Kinesis Data Streams 之前对其进行缓冲处理，从而产生高吞入量。KPL 会导致库（用户可配置的）中产生高达 RecordMaxBufferedTime 的额外处理延迟。RecordMaxBufferedTime 值越大，产生的包装效率和性能就越高。

使用器 - KCL
RecordMaxBufferedTime 上述以提到，这是应用程序可以用到的库

使用期 - Kinesis Firehose - 更高级封装

Amazon Kinesis Data Firehose，AWS Lambda 开发使用器，Amazon Kinesis Data Analytics 开发
使用器。您可以使用 Kinesis Data Firehose 读取和处理 Kinesis 流中的记录。Kinesis Data Firehose
是一个完全托管的服务，用于将实时流数据传输到目标

（如 Amazon S3、Amazon Redshift、Amazon Elasticsearch Service 和 Splunk）。也是不再需要
编写Java代码就可以直接将数据加载到目标。

例如目标为redshift

{"TICKER_SYMBOL":"QXZ","SECTOR":"HEALTHCARE","CHANGE":-0.05,"PRICE":84.51}

create table firehose_test_table

(

TICKER_SYMBOL varchar(4),

SECTOR varchar(16),

CHANGE float,

PRICE float

);

Firehose还可以单独使用，用于让数据流传输到适用的目标如S3。与SQS非常相似也是提供一个有时限的缓存，不同系统之间传输信息，区别在于实时vs非实时，大数据vs msg。
与Kinesis区别是firehose缓存大小对于S3只有128M，比Kinesis小很大，缓存的时间buffer interval也只有60s到900s，比kenesis小很多，目标也只能有限几个AWS产品，而kinesis可以是任何下游产品。kinesis类似于Kafka。
最佳实践为music应用，metrics传输到第三方partner，DDB->lambda->firehose->S3->Redshift

功能

单个分片->单个工作线程取数据。如果数据量大，可以将某几个shard聚合再放入到下一层的
stream。通过两层stream聚合来更快处理数据。

支持重新分片，这使您能够调整流中的分片数量以适应流中数据流量的变化。

增强型扇出功能是一种 Amazon Kinesis Data Streams 功能，使用者利用此功能能够接收数据流（其
中每分片每秒的专用吞吐量高达 2 MiB 数据）中的记录。此吞吐量是专用的，
这意味着，使用增强型扇出功能的使用者不必与接收流中数据的其他使用者争夺。
Kinesis Data Streams 将流中的数据记录推送到使用增强型扇出功能的使用者。
因此，这些使用者无需轮询数据。

Amazon Kinesis Data Streams 维度与指标用cloudwatch来记录

常见问题

（读取数据终止条件）即使流中有数据，GetRecords 仍然返回空记录阵列

ShardIterator 指向的分片部分附近没有数据。此情况很微妙，但却是避免在检索数据时搜寻时间
无止境（延迟）的一种必要的设计折衷。因此，流使用应用程序应循环并调用 GetRecords，并
且理所当然地处理空记录。在生产场景中，仅当 NextShardIterator 值为 NULL 时，
才应退出连续循环。当 NextShardIterator 为 NULL 时，这意味着当前分片已关闭，ShardIterator
值的指向应越过最后一条记录。如果使用应用程序从不调用 SplitShard 或 MergeShards，分片将
保持打开状态，并且对GetRecords 的调用从不返回为 NextShardIterator 的 NULL 值。

Java代码示例

下面发布者通过kinesisClient发布一个股票交易到AWS。而流事件是生产者自定义的。这是流事件的ID，通过这个ID，使用者者可只关注某些流事件。
private static void sendStockTrade(StockTrade trade, AmazonKinesis kinesisClient, String streamName) {
byte[] bytes = trade.toJsonAsBytes();

PutRecordRequest putRecord = new PutRecordRequest();
putRecord.setStreamName(streamName);

// We use the ticker symbol as the partition key, explained in the Supplemental Information section //below.
putRecord.setPartitionKey(trade.getTickerSymbol());
putRecord.setData(ByteBuffer.wrap(bytes));

kinesisClient.putRecord(putRecord);
}

使用者：

订阅者通过kinesisClient得到record，然后可以得到trade
StockTrade trade = StockTrade.fromJsonAsBytes(record.getData().array());
stockStats.addStockTrade(trade);

与SQS的区别：

Kinesis是实时的且可用于大数据，而SQS是非实时处理任务，将同步变异步解耦削峰。

ref:
官方指南
FireHose官方指南
分布式发布订阅系统
概念
发布者
订阅者
访问权限
与SQS区别

vTasters

Friday, December 15, 2017

Friday, December 1, 2017

Friday, November 17, 2017

Sunday, July 16, 2017

Saturday, June 24, 2017

Sunday, June 18, 2017

Thursday, June 15, 2017

Wednesday, June 7, 2017

Sunday, May 21, 2017

Thursday, May 18, 2017

SSH连接EC2步骤：

模板

terminate vs stop

界面GUI

VPC:

Monday, May 15, 2017

Wednesday, May 10, 2017

机器学习流程

机器学习问题分类

Model应用场合

Thursday, May 4, 2017

Thursday, March 16, 2017

Saturday, March 11, 2017

使用场景

架构

最佳实践

底层API

功能

Friday, March 10, 2017

Monday, March 6, 2017

最简单的例子

Binding绑定

BindingAnnotations注释绑定

InstanceBindings绑定实例

ProviderBindings提供绑定

范围（如单例）

非继承类

原理

基本概念

AWS

Ref

Sunday, March 5, 2017

使用场景

架构

最佳实践

底层API

封装层

功能

常见问题

Java代码示例

使用者：

与SQS的区别：

Pesonal blog