Atom Feed

myersguo's blog 2017-12-01T09:03:50+00:00 myersguo Python Multiple Thread 2017-11-26T00:00:00+00:00 /2017/11/26/python-multiple-thread <h3 id="线程的定义">线程的定义</h3> <ul> <li>start_new_thread</li> </ul> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>thread.start_new_thread(f) </code></pre></div></div> <ul> <li>使用线程类:</li> </ul> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>t = threading.Thread(target=f) t.start() t.join() </code></pre></div></div> <p>t.join:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>self.__block.acquire() # 获取条件锁 while not self.__stopped: # 如果线程未完成,继续等待 self.__block.wait() # 释放锁,等待结束 </code></pre></div></div> <p>线程的执行内部是走的<code class="highlighter-rouge">run</code>方法,run 内部调用的 target 方法。</p> <ul> <li>继承线程类:</li> </ul> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class MyThread(threading.Thread): def __init__(self): #threading.Thread.__init__(self) super(MyThread, self).__init__() def run(self): print 'thead running here' </code></pre></div></div> <h3 id="线程的同步">线程的同步</h3> <p>基础:锁 <br /> 锁的集中方式:<code class="highlighter-rouge">Condition</code>, <code class="highlighter-rouge">Semaphore(一种Condition)</code>,<code class="highlighter-rouge">Event(一种Condition)</code>,<code class="highlighter-rouge">Queue</code></p> <p><code class="highlighter-rouge">Condition</code> 使用可重入锁RLock 来实现.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def producer(cv): # cv.acquire() with cv: print 'produce' cv.notifyAll() # cv.notify(), cv.release() def consummer(cv): with cv: # cv.acquire() cv.wait() # cv.wait(), cv.release() print 'consume' </code></pre></div></div> <p><code class="highlighter-rouge">Event</code>:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>event.set 设置事件 event.clear() 发送事件 event.wait() 等待事件触发 </code></pre></div></div> About FastText 2017-11-23T00:00:00+00:00 /2017/11/23/about-fasttext <h3 id="基础知识">基础知识</h3> <h4 id="precision_and_recall">Precision_and_recall</h4> <p>这个在<a href="https://en.wikipedia.org/wiki/Precision_and_recall">维基百科</a>上解释的非常好,我这里翻译一下:</p> <blockquote> <p>假如图片集合中有12 张狗,其它的是其他动物。一个自动识别的程序,识别之后的结果是: 8张狗,其他是其他动物。<br /> 假如识别的8张狗中有5张是对的,那么,识别的 正确率是 5/8。 识别的 recall 值是 5/12 (12张狗,只识别了5张)。</p> </blockquote> <p>precision 范围: 0-1, 越大越好; <br /> recall 范围:0-1, 越大越好。</p> <h4 id="文本单词的数学表示">文本/单词的数学表示</h4> <p>现在机器学习流行的方式是使用向量矩阵(vector)来表示</p> <p>举个例子:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz &amp;&amp; tar xvzf cooking.stackexchange.tar.gz head -n 12404 cooking.stackexchange.txt &gt; cooking.train tail -n 3000 cooking.stackexchange.txt &gt; cooking.valid </code></pre></div></div> <p>这个下载好的数据,就是「训练集」,它是已经做好的分类数据。内容为:</p> <p>类别, 文本内容</p> <p><strong>文本分类</strong> 的 <strong>步骤</strong> 包括:</p> <p>训练集–&gt; 预处理 –&gt; 特征提取(一般为向量集表示) —&gt; 建模 –&gt; 使用模型进行分类</p> <h3 id="fasttext">fasttext</h3> <p>下面是官方的例子演练:</p> <p>训练生成模型分类器:</p> <blockquote> <p>./fasttext supervised -input cooking.train -output model_cooking -epoch 25</p> </blockquote> <p>交互式测试分类器:</p> <blockquote> <p>./fasttext predict model_cooking.bin -</p> </blockquote> <p>用测试样本进行测试:</p> <blockquote> <p>./fasttext test model_cooking.bin cooking.valid</p> </blockquote> <p>测试结果是 precision 和 recall 的表示。</p> <p><strong>文本训练</strong>的 <strong>基础</strong> 是样本。没有训练样本就是白扯。以下是一些公开的文本分类数据:</p> <p>维基百科数据: <a href="https://dumps.wikimedia.org/">https://dumps.wikimedia.org/</a>,例如中文数据为:<a href="https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2">https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2</a> <br /> dbpedia/yelp_review_full/amazon_review_full/sogou_news: <a href="https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M?spm=5176.100239.blogcont128589.13.L2tfdg">https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M?spm=5176.100239.blogcont128589.13.L2tfdg</a> <br /> 搜狗实验室预料: <a href="http://www.sogou.com/labs/resource/list_yuliao.php">http://www.sogou.com/labs/resource/list_yuliao.php</a> <br /> <a href="http://thuctc.thunlp.org/">THUCTC</a> <br /> …</p> <h3 id="原理篇">原理篇</h3> <h3 id="参考资料">参考资料</h3> <p><a href="http://www.jianshu.com/p/b7ede4e842f1">fasttext vs word2vec</a><br /> <a href="https://fasttext.cc/docs/en/supervised-tutorial.html#content">fasttext</a><br /> <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes_theorem</a><br /> <a href="https://en.wikipedia.org/wiki/Positive_and_negative_predictive_values">Positive_and_negative_predictive_values</a> <br /> <a href="https://en.wikipedia.org/wiki/Binary_classification">Binary_classification</a> <br /> <a href="https://en.wikipedia.org/wiki/Sensitivity_and_specificity">Sensitivity_and_specificity</a> <br /> <a href="https://en.wikipedia.org/wiki/Precision_and_recall">Precision_and_recall</a> <br /> <a href="http://www.52nlp.cn/%E4%B8%AD%E8%8B%B1%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E8%AF%AD%E6%96%99%E4%B8%8A%E7%9A%84word2vec%E5%AE%9E%E9%AA%8C">中英文维基百科语料上的Word2Vec实验</a></p> thinking in feed timeline 2017-11-22T00:00:00+00:00 /2017/11/22/how-to-realize-feed <h3 id="timeline">timeline</h3> <p>两种类型:user timeline(用户的的个人主页), home timeline(用户关注的人的信息流merge)</p> <p>实现方式:</p> <h3 id="推push">推(push)</h3> <p>用户发信息时,推送给所有的粉丝(更新粉丝的 home timeline)</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> def get_followers(uid): return redis.smembers('followers:' + uid) def write_post(uid, data): id = redis.incr('post:uuid:') # 生成 Twitter id redis.hset('posts:%s' % id, data) # 存放 内容到 hash set followers = get_followers(uid) # 获取粉丝列表 for follower in followers: redis.zadd('home_timeline:%s' % follower , id) # 写入粉丝列表 timeline </code></pre></div></div> <p>获取 timeline:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> def get_timeline(uid): posts = redis.zrevrange('home_timeline:%s' % uid, 0, 30) #获取30个 feed timeline ret = [] for id in posts: ret[id] = hgetall('posts:%s' % id) # 获取每个id的内容 return ret </code></pre></div></div> <p>这里的推送是将 id 写入到 集合中,当然也可以使用队列的方式,写时 lpush 到粉丝的 timeline 中,粉丝获取 timeline时,lrange 获取。</p> <h3 id="拉">拉</h3> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> #写 def write_post(uid, data): id = redis.incr('post:uuid:') # 生成 Twitter id redis.hset('posts:%s' % id, data) # 存放 内容到 hash set redis.sadd('posts:user:' % uid, id) def get_following(uid): return redis.smembers('following:' + uid) #读 def get_timeline(uid): followings = get_following(uid) # 获取关注的列表 redis.zunionstore(ret, {'post::user:user1', 'post:user:user2'}) # 获取关注的用户的并集 </code></pre></div></div> <h3 id="问题">问题</h3> <p>推的问题: <br /> 假如一个用户的粉丝有1000万个,那么一次写入要推送到1千万用户的timeline中,非常耗时,解决办法有: <br /> * 粉丝按照活跃度排序,优先推给活跃的粉丝,其他粉丝异步延迟推送<br /> * 当粉丝数过多时,不进行推送,粉丝的timeline 从关注的大V中获取最新 post ,然后进行 merge <br /> 拉的问题: <br /> 每次拉取都是一次大量并集运算, 相反「推」则一次 get 可获取所有消息列表。 * 关注者按照活跃度排序,只获取特定数量的活跃用户记录</p> <h3 id="参考资料">参考资料</h3> <p><a href="http://highscalability.com/blog/2013/7/8/the-architecture-twitter-uses-to-deal-with-150m-active-users.html">The Architecture Twitter Uses To Deal With 150M Active Users, 300K QPS, A 22 MB/S Firehose, And Send Tweets In Under 5 Seconds</a> <a href="https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html">The Infrastructure Behind Twitter: Scale</a> <a href="https://segmentfault.com/a/1190000004650279">Redis timeline</a></p> django celery transaction error 2017-11-12T00:00:00+00:00 /2017/11/12/django-celery-transaction <p>背景: 提交任务,异步处理。</p> <p>异步处理方案: <code class="highlighter-rouge">celery</code></p> <p>问题: <br /> <code class="highlighter-rouge">celery</code> 在 run 的时候,根据 task id 来 <code class="highlighter-rouge">do something</code>, 但是偶尔会报错:</p> <p><code class="highlighter-rouge"> matching query does not exist </code></p> <p>查看数据库,数据确实是存在的。为什么会报这个错呢?除非 celery 处理早于 db commit.查看资料(见参考资料):</p> <blockquote> <p>The data will only be externally accessible when the view finishes its execution, and the transaction is committed. This usually will happen <strong>after</strong> Celery executes the task.</p> </blockquote> <p>我们默认<code class="highlighter-rouge">autocommit</code>, 但 commit 是在 celery delay 之后,因此偶尔就会报错。<strong>怎么解决?</strong></p> <p>如果是 django 1.9, 使用 <code class="highlighter-rouge">transaction.on_commit</code>: <code class="highlighter-rouge">transaction.on_commit(lambda: do_stuff.delay(my_data.pk))</code></p> <p>如果是 &lt;1.9, 使用<a href="https://github.com/carljm/django-transaction-hooks">django-transaction-hooks</a></p> <p>另一种解决方案(verifying):<br /> 手动 commit: <code class="highlighter-rouge">transaction.commit()</code></p> <h3 id="参考资料">参考资料</h3> <p><a href="https://www.hypertrack.com/blog/2016/10/08/dealing-with-database-transactions-in-django-celery/">Dealing with database transactions in Django + Celery</a> <br /> <a href="https://www.vinta.com.br/blog/2016/database-concurrency-in-django-the-right-way/">Database concurrency in Django the right way</a> <br /> <a href="https://django-transaction-hooks.readthedocs.io/en/latest/">django-transaction-hooks</a></p> ironic python agent source code 一瞥 2017-11-09T00:00:00+00:00 /2017/11/09/ironic-agent-study <p>入口 cmd/run:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from ironic_python_agent import agent def run(): agent.IronicPythonAgent(api_url, agent.Host,....) .run() </code></pre></div></div> <p>agent:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>extensions/base.py class BaseAgentExtension(object): def __init__(self, agent=None): super(BaseAgentExtension, self).__init__() self.agent = agent self.command_map = dict( (v.command_name, v) for k, v in inspect.getmembvers(self) if hasattr(v, 'command_name') ) def execute(self, command_name, **kwargs): cmd = self.command_map.get(command_name) if cmd is None: raise return cmd(**kwargs) def ExecuteCommandMixin(object): def __init__(self): self.command_lock = threading.Lock() self.command_results = collections.OrderedDict() self.ext_mgr = None #命令的扩展都是 &lt;extension&gt;.&lt;name&gt; def split_command(self, command_name): command_parts = command_name.split('.', 1) return (command_parts[0], command_parts[1]) def get_extension(self, extension_name): ext = self.ext_mgr[extension_name].obj ext.ext_mgr = self.ext_mgr return ext def execute_command(self, command_name, **kwargs): with self.command_lock: extension_part, command_part = self.split_command(command_name) try: ext = self.get_extension(extension_part) result = ext.execute(command_part, **kwargs) except KeyError: ... self.command_results[result.id] = result class IronicPytonAgent(base.ExecuteCommandMixin): def __init__(self, api_url, advertise_address,...): super(IronicPythonAgent, self).__init__() self.ext_mgr = extensions.ExtensionManager( namespace='ironnic_python_agent.extensions', invoke_on_load=True, invoke_kwds={'agent':self} ) if self.api_url: self.api_client = ironic_api_client.APIClient(self.api_url) #心跳 self.heartbeater = IronicPythonAgentHeartbeater(self) def run(self): """启动agent"" self.started_at = _time() ... wsgi = simple_server.make_server( self.listen_address.hostname, self.listen_address.port, self.api, server_class=simple_server.WSGIServer) if not self.standalone and self.api_url: #与服务端的心跳 self.heartbeater.start() try: wsgi.serve_forever() except: ... </code></pre></div></div> <p>application/controller:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>api/controllers/v1/command.py class CommandController(rest.RestController): @wsme_pecan.wsexpose(CommandResult, types.text, body=Comand) def post(self, wait=None, command=None): ""Post a command for the agent to run if command is None: command = Command() agent = pecan.request.agent result = agent.execute_command(command.name, **command.params) if wait and wait.lower() == 'true': result.join() return result </code></pre></div></div> <p>命令使用插件(extension) 的方式执行,agent 启动一个 api server 用于接收指令,同时开启心跳。每当rest请求过来时执行命令。</p> django session & user 处理 2017-10-21T00:00:00+00:00 /2017/10/21/django-session <p>「校招面试中途」,闲着写一下 django 的 default session middleware (django.contrib.sessions.middleware.SessionMiddleware)的逻辑:</p> <h3 id="sesssion">sesssion</h3> <p>配置:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SESSION_COOKIE_NAME = 'sessionid' # session id ,cookie 名称 SESSION_COOKIE_AGE = 60 * 60 * 24 * 7 * 2 # session cookie 过期时间 SESSION_COOKIE_DOMAIN = None # cookie domain SESSION_COOKIE_SECURE = False # 是否 https only SESSION_COOKIE_PATH = '/' # cookie path SESSION_COOKIE_HTTPONLY = False # Whether to use the non-RFC standard httpOnly flag (IE, FF3+, others) SESSION_SAVE_EVERY_REQUEST = False # Whether to save the session data on every request. SESSION_EXPIRE_AT_BROWSER_CLOSE = False # Whether a user's session cookie expires when the Web browser is closed. SESSION_ENGINE = 'django.contrib.sessions.backends.db' # The module to store session data SESSION_FILE_PATH = None # Directory to store session files if using the file session module. If None, the backend will use a sensible default. </code></pre></div></div> <p>默认的 session engine 是 <code class="highlighter-rouge">django.contrib.sessions.backends.db</code>, 其他的有 cache,file 等。所有 session engine 集成自 <code class="highlighter-rouge">django.contrib.sessions.backends.base.SessionBase, session 类需要继承并实现 </code>save<code class="highlighter-rouge">, </code>delete<code class="highlighter-rouge">, </code>load<code class="highlighter-rouge">, create</code> 方法。</p> <ul> <li>load: 从cache/db 根据 session key 加载 session内容,如果不存在就 create</li> <li>create: 生成一个唯一的 session key, 并保存到 cache/db</li> <li>save: save session data to cache/db</li> <li>delete: delete from cache/data</li> </ul> <p>来看下 session plugin 的处理流程:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import time from django.conf import settings from django.utils.cache import patch_vary_headers from django.utils.http import cookie_date from django.utils.importlib import import_module class SessionMiddleware(object): def process_request(self, request): engine = import_module(settings.SESSION_ENGINE) session_key = request.COOKIES.get(settings.SESSION_COOKIE_NAME, None) #session key,cookie 的值 #初始化request.session 的值 request.session = engine.SessionStore(session_key) def process_response(self, request, response): """ If request.session was modified, or if the configuration is to save the session every time, save the changes and set a session cookie. """ try: accessed = request.session.accessed modified = request.session.modified except AttributeError: pass else: if accessed: patch_vary_headers(response, ('Cookie',)) if modified or settings.SESSION_SAVE_EVERY_REQUEST: if request.session.get_expire_at_browser_close(): max_age = None expires = None else: max_age = request.session.get_expiry_age() expires_time = time.time() + max_age expires = cookie_date(expires_time) # Save the session data and refresh the client cookie. request.session.save() response.set_cookie(settings.SESSION_COOKIE_NAME, request.session.session_key, max_age=max_age, expires=expires, domain=settings.SESSION_COOKIE_DOMAIN, path=settings.SESSION_COOKIE_PATH, secure=settings.SESSION_COOKIE_SECURE or None, httponly=settings.SESSION_COOKIE_HTTPONLY or None) return response </code></pre></div></div> <p>上面的 middleware 处理 session, process_request 获取 session 值, process_response 根据 session 是否更改来更新 session。</p> <h3 id="user">user</h3> <p>在<code class="highlighter-rouge">session</code> 模块,你可以通过 request.session 来获取当前的session信息,那么如果通过 session 找到用户呢?—- <code class="highlighter-rouge">django.contrib.auth.AuthenticationMiddleware</code>:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class LazyUser(object): def __get__(self, request, obj_type=None): if not hasattr(request, '_cached_user'): from django.contrib.auth import get_user request._cached_user = get_user(request) return request._cached_user class AuthenticationMiddleware(object): def process_request(self, request): assert hasattr(request, 'session'), "The Django authentication middleware requires session middleware to be installed. Edit your MIDDLEWARE_CLASSES setting to insert 'django.contrib.sessions.middleware.SessionMiddleware'." request.__class__.user = LazyUser() return None </code></pre></div></div> <p>request.user 就是由 django的 auth middleware 产生,我们看下如何获取 <code class="highlighter-rouge">lazyuser</code>:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def get_user(request): from django.contrib.auth.models import AnonymousUser try: user_id = request.session[SESSION_KEY] backend_path = request.session[BACKEND_SESSION_KEY] backend = load_backend(backend_path) user = backend.get_user(user_id) or AnonymousUser() except KeyError: user = AnonymousUser() return user </code></pre></div></div> <p>未登录是,用户为 <code class="highlighter-rouge">AnonymouseUser</code>, 登录:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def login(request, user): if user is None: user = request.user # TODO: It would be nice to support different login methods, like signed cookies. if SESSION_KEY in request.session: if request.session[SESSION_KEY] != user.id: # To avoid reusing another user's session, create a new, empty # session if the existing session corresponds to a different # authenticated user. request.session.flush() else: request.session.cycle_key() request.session[SESSION_KEY] = user.id request.session[BACKEND_SESSION_KEY] = user.backend if hasattr(request, 'user'): request.user = user user_logged_in.send(sender=user.__class__, request=request, user=user) </code></pre></div></div> <p>登录成功之后存储 user.backend 到 session 中.</p> <p>auth 模块提供了 <code class="highlighter-rouge">authenticate</code> 方法:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def authenticate(**credentials): for backend in get_backends(): try: user = backend.authenticate(**credentials) except TypeError: continue if user is None: continue user.backend = "%s.%s" % (backend.__module__, backend.__class__.__name__) return user def get_backends(): from django.conf import settings backends = [] for backend_path in settings.AUTHENTICATION_BACKENDS: backends.append(load_backend(backend_path)) if not backends: raise ImproperlyConfigured('No authentication backends have been defined. Does AUTHENTICATION_BACKENDS contain anything?') return backends </code></pre></div></div> <p>验证方,调用 <code class="highlighter-rouge">authenticate</code> 来进行认证。 <br /> 我们来看下 <code class="highlighter-rouge">django.contrib.auth.backends.ModelBackend</code> 的认证过程:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> def authenticate(self, username=None, password=None): try: user = User.objects.get(username=username) if user.check_password(password): return user except User.DoesNotExist: return None </code></pre></div></div> <p>ok, 到此,我们可以很明朗的看出来:</p> <p>session 信息存储在 <code class="highlighter-rouge">django_session</code> 中, 用户信息存储在  <code class="highlighter-rouge">auth_user</code>中,组信息存储在  <code class="highlighter-rouge">auth_group</code>中,权限信息 <code class="highlighter-rouge">auth_group_permission</code>。</p> <p>权限这一块,下次再说。。。</p> 大数据计数 2017-10-15T00:00:00+00:00 /2017/10/15/redis-bitmap-2 <p>之前在<a href="http://myersguo.me/2017/10/11/redis-bitmap.html">日活统计的烦恼</a>中说,有一个需求是要统计数量的。起初是想从 redis bitmap set 来统计,后来发现 bitmap 的统计上线越大占用的存储空间越大。而最终的结果是, bit 位将变得非常稀松。</p> <p>想要让数据统计<strong>均匀分布</strong>,我们首先想到是通过某种 hash 函数,让位置能均匀的分布在 n 位以内。但这样仍然占用较大的空间(10亿,1G)。在这篇2014的文章:[<a href="http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html">Big Data Counting: How To Count A Billion Distinct Objects Using Only 1.5KB Of Memory</a>] 提到<strong>基数估计算法(</strong> <a href="http://www.google.com/url?q=http%3A%2F%2Fdblab.kaist.ac.kr%2FPublication%2Fpdf%2FACM90_TODS_v15n2.pdf&amp;sa=D&amp;sntz=1&amp;usg=AFQjCNHtGk7728Hnh8XgbMFCEQwDMNd1kw">cardinality</a> <a href="http://www.google.com/url?q=http%3A%2F%2Falgo.inria.fr%2Fflajolet%2FPublications%2FDuFl03.pdf&amp;sa=D&amp;sntz=1&amp;usg=AFQjCNHij4lkOvtFkqLQ7BwDCOX3DHx2IQ">estimation</a> <strong>)</strong>,</p> <p>其中一种基数估计方法为<a href="http://blog.codinglabs.org/articles/algorithms-for-cardinality-estimation-part-ii.html">Linear Counting</a>:</p> <blockquote> <p>LC的基本思路是:设有一哈希函数H,其哈希结果空间有m个值(最小值0,最大值m-1),并且哈希结果服从均匀分布。使用一个长度为m的bitmap,每个bit为一个桶,均初始化为0,设一个集合的基数为n,此集合所有元素通过H哈希到bitmap中,如果某一个元素被哈希到第k个比特并且第k个比特为0,则将其置为1。当集合所有元素哈希完成后,设bitmap中还有u个bit为0。则:</p> <p>n̂ =−mlog(u/m) 为n的一个估计,且为最大似然估计(MLE)。Linear Counting算法相较于直接映射bitmap的方法能大大节省内存(大约只需后者1/10的内存),但毕竟只是一个常系数级的降低,空间复杂度仍然为O(Nmax)O(Nmax)。</p> </blockquote> <p><strong><a href="http://blog.codinglabs.org/articles/algorithms-for-cardinality-estimation-part-iii.html">LogLog Counting</a></strong>(以下简称LLC)出自论文“Loglog Counting of Large Cardinalities”。LLC的空间复杂度仅有O(log2(log2(Nmax)))O(log2(log2(Nmax))),使得通过KB级内存估计数亿级别的基数成为可能,因此目前在处理大数据的基数计算问题时,所采用算法基本为LLC或其几个变种。</p> <p>它将哈希空间平均分成m份,每份称之为一个桶(bucket)。对于每一个元素,<strong>其哈希值的前k比特作为桶编号</strong>,其中2k=m2k=m,而<strong>后L-k个比特作为真正用于基数估计的比特串</strong>。桶编号相同的元素被分配到同一个桶,在进行基数估计时,首先计算每个桶内元素最大的第一个“1”的位置,设为M[i],然后对这m个值取平均后再进行估计,即:</p> <p>n̂ =21m∑M[i]n^=21m∑M[i]</p> <p>这相当于物理试验中经常使用的多次试验取平均的做法,可以有效消减因偶然性带来的误差。</p> <blockquote> <p>假设H的哈希长度为16bit,分桶数m定为32。设一个元素哈希值的比特串为“0001001010001010”,由于m为32,因此前5个bit为桶编号,所以这个元素应该归入“00010”即2号桶(桶编号从0开始,最大编号为m-1),而剩下部分是“01010001010”且显然ρ(01010001010)=2ρ(01010001010)=2,所以桶编号为“00010”的元素最大的ρρ即为M[2]的值。</p> </blockquote> <p>在<a href="http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation">这篇文章</a>中提到的一个近似估计的算法:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def trailing_zeroes(num): """Counts the number of trailing 0 bits in num.""" if num == 0: return 32 # Assumes 32 bit integer inputs! p = 0 while (num &gt;&gt; p) &amp; 1 == 0: p += 1 return p def estimate_cardinality(values, k): """Estimates the number of unique elements in the input set values. Arguments: values:An iterator of hashable elements to estimate the cardinality of. k:The number of bits of hash to use as a bucket number; there will be 2**k buckets. """ num_buckets = 2 ** k max_zeroes = [0] * num_buckets for value in values: h = hash(value) bucket = h &amp; (num_buckets - 1) # Mask out the k least significant bits as bucket ID bucket_hash = h &gt;&gt; k max_zeroes[bucket] = max(max_zeroes[bucket], trailing_zeroes(bucket_hash)) return 2 ** (float(sum(max_zeroes)) / num_buckets) * num_buckets * 0.79402 [5/estimate_cardinality([random.random() for i in range(5)], 10) for j in range(10)] </code></pre></div></div> <p>其中<a href="http://blog.jobbole.com/78255/">翻译稿解释</a>:</p> <blockquote> <p><strong>我们保持一个计算前导(或尾部)0个数的数组,然后在最后对个数求平均值</strong>,如果我们的平均值是 x,我们的估计就是 2^x 乘以桶的个数。前面没有说到 的是这个魔术数 0.79402。数据统计表明我们的程序存在一个可预测的偏差,它会给出一个比实际更大的估计值。这个在 Durand-Flajolet 的论文中导出的魔术常数是用来修正这个偏差的。实际上这个数字随着使用的桶的个数(最大2^64)而发生变化,但是对于更多数目的桶数,它会收敛到我们上面用到的算法的估计数字。</p> <p>这个程序给了我们一个非常好的估计,对于 m 个桶来说,平均错误率大概在 1.3/sqrt(m) 左右。所以1024个桶时(),我们大概会有 4% 的期望错误率。为了估计每篇最多 2^27 个数据的数据集每个桶仅需要 5 比特就够了。少于 1 kb 内存,这真的很赞(1024 * 5 = 5120,即 640 字节)!</p> </blockquote> <p><a href="http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation">http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation</a><br /> <a href="http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf">hyperloglog</a><br /> http://blog.jobbole.com/78255/ <br /> http://blog.codinglabs.org/articles/cardinality-estimate-exper.html#ref1</p> 日活统计遇到的疑惑 2017-10-11T00:00:00+00:00 /2017/10/11/redis-bitmap <p>最近有个需求是需要统计某个版本的安装数量,之前接触过类似的需求— 统计日活,使用 redis 的bitmap 来记录,比如user_id是11位的,比如我们统计一下:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>setbit today 4294967295 1 setbit today 1 1 redis-cli --bigkeys 统计显示:Biggest string found 'today' has 536870912 bytes,4G </code></pre></div></div> <p>因为 bit offset 最大是2^32,即 4294967295, 10位,我们的一般ID都是11位,那怎么统计?</p> <p>我的做法是 offset % 100 放到key后缀,进行统计是否统计过,单独计数来统计数量:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prefix = offset % 100 offset = offset / 100 setbit today: prefix offset 1 incr today:cnt 1 这样,最大是数能统计到400(12位)亿,资源占用就是 4G*100,就是100G </code></pre></div></div> <p>如果要区分终端,区分版本统计日活,那岂不是 100G * xxx 倍? <br /> 这个占用量也很大啊?有没有更好的办法呢?</p> <p>从这边文章:<a href="http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html">http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html</a></p> <blockquote> <p>Sparse bitmaps can be compressed in order to gain space efficiency, but that is not always helpful.</p> </blockquote> <p>看到一个方案使用Hyper LogLog算法进行计数。而 redis 从2.8.9版本开始支持该算法:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pfadd today 1 pfadd today 2 统计: pfcount today </code></pre></div></div> <p>占用空间只有:<br /> 12304 bytes, 12k</p> <p>完美解决啊,基于 hyper loglog是什么,可以看上面的文章详细了解。</p> <h3 id="参考资料">参考资料</h3> <p><a href="http://blog.codinglabs.org/articles/cardinality-estimate-exper.html">http://blog.codinglabs.org/articles/cardinality-estimate-exper.html</a> <br /> <a href="http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html">http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html</a><br /> <a href="http://blog.codinglabs.org/articles/algorithms-for-cardinality-estimation-part-i.html">http://blog.codinglabs.org/articles/algorithms-for-cardinality-estimation-part-i.html</a></p> 杂文 2017-09-18T00:00:00+00:00 /2017/09/18/all <ol> <li> <p>由于公司的安全限制,很多云笔记产品被列入风险名单。我开始寻觅记录笔记的离线工具。<code class="highlighter-rouge">typora</code>进入了我的视野。markdown 编辑,可视化展示。非常适合做笔记用。暂时用它解决一下。</p> </li> <li> <p>感觉时间不够用。时间管理上是不是出了差错,总感觉时间是悄悄的在我的眼皮底下溜走,有一些 todo list 仍然是 todo list。并发的事情总是在并发地行进中「出错」(影响并发的事情有:面试、开会、钉钉消息)</p> </li> <li> <p>心态上出现消极的状态。遇到一些「困扰」,包括技术上、工作推进上。不能「释怀」,自我调整中,就让时间去消磨这些疑问。看未来会是什么模样?</p> </li> <li> <p>还是老样子。都是为了生活,活着,本身就已经很了不起了。</p> </li> </ol> nsq 2017-09-15T00:00:00+00:00 /2017/09/15/about-nsq <h3 id="名词解释">名词解释</h3> <p>topic: 主题,数据流,消息都是写到某个topic下。 <br /> channel: 通道。每个 topic 都有多个通道。消息会从 topic 拷贝到每一个通道中。消息是从topic通过多播的方式到channel(每个channel接收topic全部消息的拷贝),平均地从channel分发到consumers。 channel与消费者相关,是消费者之间的负载均衡,channel在某种意义上来说是一个“队列”。每当一个发布者发送一条消息到一个topic,消息会被复制到所有消费者连接的channel上,消费者通过这个特殊的channel读取消息,实际上,在消费者第一次订阅时就会创建channel。Channel会将消息进行排列,如果没有消费者读取消息,消息首先会在内存中排队,当量太大时就会被保存到磁盘中。 <br /> Topic Message Queue: topic 的消息队列<br /> Channel Message Queues: 通道的 消息队列 每个channel的消息都会进行排队,直到一个worker把他们消费,如果此队列超出了内存限制,消息将会被写入到磁盘中.一个通道一般会有多个客户端连接。假设所有已连接的客户端处于准备接收消息的状态,每个消息将被传递到一个随机的客户端。</p> <p>生产者(Producer):负责产生消息 <br /> 消息代理(Message Broker):负责存储/转发消息(转发分为推和拉两种,拉是指Consumer主动从Message Broker获取消息,推是指Message Broker主动将Consumer感兴趣的消息推送给Consumer) <br /> 消费者(Consumer):负责消费消息</p> <h3 id="组件">组件</h3> <p>nsql: nsqd 是一个守护进程,负责接收,排队,投递消息给客户端。它在 2 个 TCP 端口监听,一个给客户端,另一个是 HTTP API。同时,它也能在第三个端口监听 HTTPS。<br /> nsqlookupd 是守护进程负责管理拓扑信息。客户端通过查询 nsqlookupd 来发现指定话题(topic)的生产者,并且 nsqd 节点广播话题(topic)和通道(channel)信息。有两个接口:TCP 接口,nsqd 用它来广播。HTTP 接口,客户端用它来发现和管理。 nsqadmin 是一套 WEB UI,用来汇集集群的实时统计,并执行不同的管理任务。</p> <p><img src="/public/nsq.gif" alt="nsq" /></p> <p>注意,一般我们看到的消息队列是这样的: <br /> <img src="/public/broker.png" alt="broker" /></p> <p>而 nsq 是无 broker 设计,没有中间人,没有消息代理,也没有单点故障。 这种拓扑结构消除单链,聚合,反馈。相反,你的消费者直接访问所有生产者。从技术上讲,哪个客户端连接到哪个 NSQ 不重要,只要有足够的消费者连接到所有生产者,以满足大量的消息,保证所有东西最终将被处理。</p> <h3 id="实践">实践</h3> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew install nsq nsqlookupd nsqld --lookupd-tcp-address=127.0.0.1:4160 nsqadmin --lookupd-http-address=127.0.0.1:4161 #publish curl -d 'hello world 1' 'http://127.0.0.1:4151/pub?topic=test' #concume nsq_to_file --topic=test --output-dir=/tmp --lookupd-http-address=127.0.0.1:4161 </code></pre></div></div>