Had a red flag on my ElasticSearch cluster these days and found that the reason was related to an unassigned shards between the nodes.
As the data I collect is not that sensitive I could easy delete it and recreate in case I need in the future. But first we need to find it. There is many articles on the internet to help one to understand the shards allocation but I offer here a simple solution which is – simply delete the bastard.
First, we check on the cluster health and get the count of unassigned shards.
There are several ways to protect the audiovisual content and watermarking is one of them. It is arguably the best solution against content distribution via streaming, simply because it allows one to identify the source of the media theft.
Watermarking, which originally was created for image protection has been intensively researched in the past decade and is now possible to be applied not only in static videos but also in live streaming. That can be done in the hardware or software level and the mark can be inserted in the frames, key frames, bits, video sample and many other ways. It is an amazing technology! It is offered by the specialized companies as the ultimate protection against piracy.. for a lot of money off course.
There are certain desirable characteristics in these type of forensic measures that make it useful to be implemented to prevent piracy and I would like to discuss them first, before getting to the real purpose of this article.
Can you image a soccer transmission where you see a giant logo or number in the screen? That would not be the best way to put a mark on the content. Yet, it needs to be there somewhere. Nobody cares if the owner of the content has to insert something in it as long it doesn’t impact the end-user experience. There should be no degradation in the quality of the video too. The only and best way to do it is by inserting the mark invisible to the human eyes. Or if not invisible, imperceptible.
Robustness means that it should be difficult (if not impossible) to remove the watermarking from the media. What about making it not only invisible but moving? or random? uhh.. what about having it injected in different intervals? or a mix of sound and video marks? So the essence of the term robustness applied to this type of technology is to make it resistant to actions such as resizing, cropping, compression, rotation, noise, and many other attacks that may be applied in the effort to remove the mark.
This is one is the easiest ! Pairwise independence refers to fact that there shouldn’t be two equal marks in the same media. Although you can carry multiple different marks in the same media (say from different distribution path) they should not be equal.
Ok. Now that I have covered what the watermarking algorithm should have to be good I want to discuss a little bit what can be done to break it. Recent watermarking solutions are resistant to the common attacks – resizing, cropping, noise, compression and image overlay. There is one attack, however that still remains a challenge for must companies and it is called – The Collusion attack. The attack consists in merging two sources of the same video to form a third one. That new product would be then without the watermark or in some cases it would have two marks and make it difficult for the source identification.
Colluders collect several watermarked documents and combine them to produce digital content without underlying watermarks.
There are two basic types of collusion attack
Type 1 – In this type of collusion attack, attacker obtains several copies of the same work, with different watermarks. Here, the attacker tries to ﬁnd out the video frames which are similar in nature. Hence, frames belonging to the same scene have a high degree of correlation. The attacker then separates various scenes of the video. Then statistical average of the neighboring frames is done to mix the different marks together and computes a new unmarked frame. Type-1 collusion attack can only be successful if successive frames are different enough.
Type 2 – In this type of attack, the attacker obtains several different copies that contain the same watermark and studies them to learn about the algorithm. Then several copies are averaged by the attacker. If all copies have the same reference pattern added to them, then this averaging operation would return something that is closed to the pattern. Then, the average pattern can be subtracted from the copies to generate an unmarked video.
It seems complicated but there are several encoders out there that are able to perform the collusion attack without you having to study all this stuff.
Collusion or Convolution?
I was caught in a curious discussion with a friend when the term collusion was first presented to me. Although the technique made sense and sounded reasonable I had never heard about it before. He on the other hand didn’t know about Convolution either. So which term is the correct one, when referring to merging two sources to produce a third? In the literature the term convolution is used to describe a math operation of two functions (f and g) to produce a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reversed and shifted. While collusion is about people getting together to defraud a system. Both terms are correct, in my humble opnion and context helps to employ them properly. If one would be talking about people getting together to remove watermark that would be Collusion (could be a single guy btw). if you are talking about the math process to merge to different signals and produce a third than it is Convolution.
The concept of diceware is pretty awesome! You can read the nasty details here. It requires you to have a dice… lame! Who will bring a dice to work to generate passwords!? Come on?! The principle is really cool, regardless.
We need some code to do it. This lady did it! It is meant for you to use words that you will remember without loosing the security aspect and prevent you from using “abc123”.
What is the composition of a diceware password?
The recommended size is 4 sets of words – separated by space. The size of the words can vary from 1 to 5 characters each.
At work I wanted to make it simple and standardized, so I choose a set of 4 words with 4 letters each. You can have a list of a lot of words within that pattern. But you can use the list however you want.. The more the better. The idea is that the words are easy to remember, so it has to be within your language dictionary and composed within your pattern. In my case it is Portuguese… check this one:
volt come rena xepa giro poti roxa …
So the password would be something like – “volt come giro poti”
How strong is it?
Considering the 4 sets of 4 letters words (no pun intended) I’m using, the size of it is of 152 bits. And I’m counting the space bits as well that actually has one byte ( we have 3 spaces). It basically yields a gigantic number of possibilities.. something around the 5,708990770823839524233143877798e+45. That’s right.. 45 other digits after the last one seen.
If we count the characters only without meaning + plus space. The number of combinations would be smaller but still very big. But would be hard to remember a 4 set of random letters and we would be back to this “autg xdrv gvcn xmg”, right? That’s not a choice. What we need then is a list with words that make sense in our language. So let’s get one. You can generate yours (good idea), look on the internet, or grab from a book.
Say you have finished editing your list and ended up with 1000 words in it. That would give you a 1000 * 1000 * 1000 * 1000 = 1.000.000.000.000. Yep, that is a trillion. With your crappy list of a 1000 random 4 letter words, you would get a trillion different passwords, that would be actually easy to remember.
So basically what you have to do is process that list and spit each word randomly to compose your password. It would be rolling the dice for you.
with open('words.txt','r') as f:
mywords = [line.strip() for line in f]
print 'New Password: %s %s %s %s' %(random.choice(mywords),random.choice(mywords),random.choice(mywords),random.choice(mywords))
But this is not the diceware per se. The real diceware requires you to roll the dice 5 times to get each word. So each word of your dictionary would be assigned a number that goes from 11111 – 66666 getting you a list of 7776 unique words. Than our calculations becomes even more interesting now. Resulting in 7776 * 7776 * 7776 * 7776 = 3.656.158.440.062.976. I don’t know how to say that number in English! This is where the trues randomness in python needs to be explained, because I’m not rolling no freaking dice 5 times!
True or Pseudo Randomness
In computers system true randomness (rolling the dice) is hard to be achieved. Randomness is described as follow:
Randomness is the lack of pattern or predictability in events. A random sequence of events, symbols or steps has no order and does not follow an intelligible pattern or combination. Individual random events are by definition unpredictable, but in many cases the frequency of different outcomes over a large number of events (or “trials”) is predictable. For example, when throwing two dice, the outcome of any particular roll is unpredictable, but a sum of 7 will occur twice as often as 4. In this view, randomness is a measure of uncertainty of an outcome, rather than haphazardness, and applies to concepts of chance, probability, and information entropy.
In python for example the pseudorandomness method is used and is based on a set of mathematical functions called Mersenne Twister. In python the function “random” is used to generate a sequence of numbers and it takes a “seed” to start off. That is a deterministic way of generating numbers. You can choose that seed but generally the time of the system in milliseconds from epoch (1970) is used. Let me give you an example.
from random import seed
from random import random
# seed random number generator
# generate some random numbers
print(random(), random(), random())
# resetting the seed to 1 again
# see the pseudo thing happening
print(random(), random(), random())
You get two sets of random, but predictable numbers like the following.
You can see that after resetting the seed value, the randomness started off again from the same point, and the “randomness” is the same from the that point onwards.. hence the term pseudorandomness and deterministic.
As we set the seed number to 1, the random numbers will be given within the interval 0 and 1. Predicting the randomness can be useful to be used in production financial, engineering or machine learning systems.
If we use the python pseudo random function in a list, without setting a seed value (there is no point in it anyway) the result will be given based on a uniform likelihood or in other words, the choices are distributed evenly. In a list of 1000 words like the one I used, the likelihood of a given word to be given as a results is 1/1000 or 0,1%.
All that to say, that we don’t really need to roll the dice five times since the “entropy” is embedded in the python function.
If none of that is sufficient for you, you can order a true diceware password (made on paper) for 2 dollars.
This Feb 28th is the so called “Thesis Defense” day. It is where me, myself and I, after submitting the theses papers, put myself at the disposal of the thesis committee. In this case, “defend” does not imply that a I will have to argue aggressively about my work (although I see myself doing it).
Rather, the thesis defense is designed so that faculty members can ask questions and make sure that students actually understand their field and focus area. It serves as a formality because the paper will already have been evaluated ( have been… it is called Qualification Process). During a defense, a student will be asked questions by members of the thesis committee. Questions are usually open-ended and require that the student think critically about his or her work. The event is supposed to last from one to 3 hours, I have heard it could take more.. geees!
The Defense, is the crowning event of at least 2 years of hard study, dedication and sacrifice. And I want to tell you what I have learned with it.
It is not as hard and mystic as it seems
At least here in Brazil, masters, or as we call it “Mestrado” is not as common as it should be. It has some sort of mysticism around it, like if it was reserved for a certain “class” of student and society. People really tend to go for an MBA. MBA in Brazil, although stands for – Masters in Business Administration, has nothing to do with masters Strict Sensu and it just a Lato-Sensu course, or a specialization. I’m not taking out the credit of those that chooses the MBA, but it is different. The MBA, in the country has a more commercial practical focus. Also it is offered during the night, or weekends which helps a lot those that actually have a job to attend. I guess that is the main reason people tend to go that way. In the other hand a full blown masters course, is considered too academical or meant for those that want to pursue a professor, researcher or academic career. This is not entirely true! You can enroll to a master course, continue to work in the industry and solve a real world problem.
This is part of the mysticism that goes around a Masters course. It is not only meant for academics, it is not meant only for super nerds researchers and you can do it while you keep your actual job! It is true, that the work you have must give you certain flexibility and freedom to cope with the crazy schedules that some schools push, but it is possible.
You can have a job that is not related to universities and academic world. You can work anywhere you want. In fact, big tech companies are the ones that employs masters graduates the most. And, there is a probability that your salary increases by up to 80% if you have a master degree.
The professors and board of teachers are pretty much regular people, with experience on some topics and areas of research.. but are NOT the owners of the entire knowledge. It is pretty common that the student knows more about a given topic than the professor. He is there, to help you to adjust your thought process and writing your ideas within the accepted scientific methodology but he is not a God with omniscience. You can argue, you can defend a statement, heck, you can actually fight with your professor (although not a good idea) if you think X is equal Y.
The other aspect of the mysticism of the Master course is that you are there to learn and have classes.Myth! There is no way for the school to teach each student the specifics about his work. What you actually learn is how to organize your thought process, how to treat numbers you may collect from your research, how to use others work to build the fundamentals of your work. That is it. Don’t expect to be there and have classes of advanced math or signal processing or any in depth classes about your field. That’s not going to happen. Instead you have several classes of debate, several seminars to expose your ideas and have the other student to confront, challenge and disagree with you. You have some writing classes, you have basic statistics classes and a lot of seminars. That’s where the knowledge and ideas are born. That’s where you mature and learn how to “defend” your work and identify potential flaws in it, by discussing it with other people.
Since it depends on you, to understand, collect, test, treat and present the work.. it is not as hard as it seems. give it a try. You might surprise your self.
It is a victory in solitude
It is sad! I know. But it is the hard truth. There is a big chance that not a single soul around you will actually understand what you are doing. I mean, friends, family co-workers. None of them will, one – be interested about what you discovered, two – be willing to discuss it in detail. Nobody cares! If you are married, your wife will be interested about when you will finish it so she can have you back to regular life. Or when you will stop spending the night reading to give more attention to your kids, or – my case – When you will be finished to be able to request a raise at your current job!
There wont be any question about the inner details of your research, and if they initially show interest.. that is rapidly lost when you start going on and on about it.
Upon completion, your friends will be excited to know that you have finished it with success.. remember the myth that it is super hard and reserved for some people? Sure, they will be thrilled with the news. But don’t think they are really interested in bits and bytes of it. And it is not because the don’t like you, or have no interest on your stuff.. it is just because the don’t understand it.. and it is really hard to relate with something you have no idea about.. they are (as everybody else) afraid to look stupid.
Isn’t it sad? That you research, and successfully develop something that could be used by society and have the potential to put your name in the field history.. and nobody cares?! You can’t share it, or brag about it :)?! Come on!! So sad!
It is indeed a victory in solitude. Be glad you made it. The hours, the sacrifice is totally worth!!. Knowledge is one the few things you can actually keep and its value can’t be measured. It is however a satisfaction that only you will feel in its fullness. It is ok!
Are you curious to know what I have studied? Probably not, Maybe?!
My thesis is called – Image Perceptual Hash applied for Video Copy identification. and here is the abstract.
With the event of the Internet, video and image files are widely shared and consumed by users from all over the world. Thus, methods to identify these files have emerged as a way to preserve intelectual and commercial rights. Content based identification or perceptual hashing is the technique capable of generating a numeric identifier from the image characteristics. With this identifier, it is possible compare and decide if two images are equal, similar or different. This study has as objective discuss the application of image perceptual hashing to identify video copies. It proposes the usage of known and public methods such as the Average and Difference Hash that are based on statistics of the image also Phash and Wavelet hash that are based on the image frequency.
An identification technique was applied using a Hamming distance similarity threshold and the combination of perceptual hash algorithms for video copy identification. The method was tested by applying several attacks to the candidate video and the results were properly detailed. It is possible to use perceptual hash algorithms for video copy identification, and there are benefits when there is a combination of more than one of them filling performance gaps and vulnerabilities eliminating false positives
Eu trabalho com TI, com informática, com computação. Deu pra entender né? Pra ser mais especifico, eu trabalho com segurança da informação e desenvolvimento para automação de tarefas. Isto significa, na prática que a literatura das coisas relacionadas do ramo, raramente são em português.
Por isto, já que eu leio em inglês, por que não escrever também? Por que eu não sou nativo? Porque talvez o número de erros gramaticais serão grandes? Talvez. Porém uma análise mais fria da coisa toda, me faz perceber que mesmo escrevendo em português eu não estaria livre deste risco. Na era das redes sociais, escrever errado se tornou o padrão da comunicação, por que as pessoas valorizam mais o “fazer-se entender” do que o propriamente escrever corretamente. Neste paradigma, escrever correto, na verdade aumento o risco do locutor não ser compreendido…( 🙂 ). Então por que não arriscar?
Assim, os posts relacionados a coisas técnicas e coisas de nerd deste blog serão em inglês. Os outros, posts, relacionados a corrida, bike e outras aventuras seguirão em português.